Skip to content

openavmkit.utilities.modeling

Model class definitions.

Defines the model classes used by :mod:openavmkit.modeling to train and predict property values. Includes:

  • Tree-based wrappersXGBoostModel, LightGBMModel, CatBoostModel
  • Linear modelsMRAModel, MultiMRAModel
  • Geographic modelsGWRModel, LocalAreaModel, SpatialLagModel
  • BaselinesGarbageModel, AverageModel, NaiveAreaModel, PassThroughModel, GroundTruthModel

Plus helpers (greedy_forward_loocv, TreeBasedCategoricalData) shared across model fitting routines.

When adding a new model, subclass here and follow the existing pattern; register the prediction wrapper in :mod:openavmkit.model_runner and the params/contribs writer in :mod:openavmkit.modeling.

AverageModel

AverageModel(type, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the average of observed sale prices.

Attributes:

Name Type Description
type str

The type of average to use

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize an AverageModel

Parameters:

Name Type Description Default
type str

The type of average to use

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def __init__(self, type: str, sales_chase: float):
    """Initialize an AverageModel

    Parameters
    ----------
    type : str
        The type of average to use
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.type = type
    self.sales_chase = sales_chase

CatBoostModel

CatBoostModel(regressor, cat_data)

CatBoost Model

Attributes:

Name Type Description
regressor CatBRegressor

The trained CatBoost CatBRegressor model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
482
483
484
def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

GWRModel

GWRModel(coords_train, X_train, y_train, gwr_bw)

Geographic Weighted Regression Model

Attributes:

Name Type Description
coords_train list[tuple[float, float]]

list of geospatial coordinates corresponding to each observation in the training set

X_train ndarray

2D array of independent variables' values from the training set

y_train ndarray

1D array of dependent variable's values from the training set

gwr_bw float

Bandwidth for GWR calculation

df_params_test DataFrame

Coefficients for the test set

df_params_sales DataFrame

Coefficients for the sales set

df_params_universe DataFrame

Coefficients for the universe set

Parameters:

Name Type Description Default
coords_train list[tuple[float, float]]

list of geospatial coordinates corresponding to each observation in the training set

required
X_train ndarray

2D array of independent variables' values from the training set

required
y_train ndarray

1D array of dependent variable's values from the training set

required
gwr_bw float

Bandwidth for GWR calculation

required
Source code in openavmkit/utilities/modeling.py
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def __init__(
    self,
    coords_train: list[tuple[float, float]],
    X_train: np.ndarray,
    y_train: np.ndarray,
    gwr_bw: float
):
    """
    Parameters
    ----------
    coords_train : list[tuple[float, float]]
        list of geospatial coordinates corresponding to each observation in the training set
    X_train : np.ndarray
        2D array of independent variables' values from the training set
    y_train : np.ndarray
        1D array of dependent variable's values from the training set
    gwr_bw : float
        Bandwidth for GWR calculation
    """
    self.coords_train = coords_train
    self.X_train = X_train
    self.y_train = y_train
    self.gwr_bw = gwr_bw
    self.df_params_sales = None
    self.df_params_univ = None
    self.df_params_test = None

GarbageModel

GarbageModel(min_value, max_value, sales_chase, normal)

An intentionally bad predictive model, to use as a sort of control. Produces random predictions.

Attributes:

Name Type Description
min_value float

The minimum value of to "predict"

max_value float

The maximum value of to "predict"

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

normal bool

If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.

Initialize a GarbageModel

Parameters:

Name Type Description Default
min_value float

The minimum value of to "predict"

required
max_value float

The maximum value of to "predict"

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
normal bool

If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.

required
Source code in openavmkit/utilities/modeling.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def __init__(
    self, min_value: float, max_value: float, sales_chase: float, normal: bool
):
    """Initialize a GarbageModel

    Parameters
    ----------
    min_value : float
        The minimum value of to "predict"
    max_value : float
        The maximum value of to "predict"
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    normal : bool
        If True, the randomly generated predictions follow a normal distribution based on the observed sale price's
        standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.
    """
    self.min_value = min_value
    self.max_value = max_value
    self.sales_chase = sales_chase
    self.normal = normal

GroundTruthModel

GroundTruthModel(observed_field, ground_truth_field)

Mostly only used in Synthetic models, where you want to compare against simulation ground_truth instead of observed sale price, which you can never do in real life.

Attributes:

Name Type Description
observed_field str

The field that represents observed sale prices

ground_truth_field str

The field that represents platonic ground truth

Initialize a GroundTruthModel object

Parameters:

Name Type Description Default
observed_field str

The field that represents observed sale prices

required
ground_truth_field str

The field that represents platonic ground truth

required
Source code in openavmkit/utilities/modeling.py
242
243
244
245
246
247
248
249
250
251
252
253
def __init__(self, observed_field: str, ground_truth_field: str):
    """Initialize a GroundTruthModel object

    Parameters
    ----------
    observed_field : str
        The field that represents observed sale prices
    ground_truth_field : str
        The field that represents platonic ground truth
    """
    self.observed_field = observed_field
    self.ground_truth_field = ground_truth_field

LayeredCompModel

LayeredCompModel(model)

Layered Comp Model

A bagging ensemble version of the LayeredCompModel algorithm that reduces variance and automatically optimizes the weight_falloff for each tree in the ensemble.

Attributes:

Name Type Description
model LayeredCompModel

The trained LayeredCompModel from the layeredcompmodel package

Initialize a LayeredCompModel

Parameters:

Name Type Description Default
model LayeredCompModel

The trained LayeredCompModel instance

required
Source code in openavmkit/utilities/modeling.py
518
519
520
521
522
523
524
525
526
def __init__(self, model):
    """Initialize a LayeredCompModel

    Parameters
    ----------
    model : layeredcompmodel.LayeredCompModel
        The trained LayeredCompModel instance
    """
    self.model = model

LightGBMModel

LightGBMModel(booster, cat_data)

LightGBM Model

Attributes:

Name Type Description
booster Booster

The trained LightGBM Booster model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
454
455
456
def __init__(self, booster, cat_data):
    self.booster = booster
    self.cat_data = cat_data

LocalAreaModel

LocalAreaModel(loc_map, location_fields, overall_per_impr_area, overall_per_land_area, sales_chase)

Produces predictions equal to the localized average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Unlike NaiveAreaModel, this model is sensitive to location, based on user-specified locations, and might actually result in decent predictions.

Attributes:

Name Type Description
loc_map dict[str : tuple[DataFrame, DataFrame]

A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies.

Each entry is a tuple containing two DataFrames:

  • Values per improved square foot
  • Values per land square foot

Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be {location_field}_per_impr_{unit} or {location_field}_per_land_{unit}

location_fields list

List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)

overall_per_impr_area float

Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

overall_per_land_area float

Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a LocalAreaModel

Parameters:

Name Type Description Default
loc_map dict[str : tuple[DataFrame, DataFrame]

A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies.

Each entry is a tuple containing two DataFrames:

  • Values per improved square foot
  • Values per land square foot

Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be {location_field}_per_impr_{unit} or {location_field}_per_land_{unit}

required
location_fields list

List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)

required
overall_per_impr_area float

Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

required
overall_per_land_area float

Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def __init__(
    self,
    loc_map: dict,
    location_fields: list,
    overall_per_impr_area: float,
    overall_per_land_area: float,
    sales_chase: float,
):
    """Initialize a LocalAreaModel

    Parameters
    ----------
    loc_map : dict[str : tuple[DataFrame, DataFrame]
        A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the
        names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever
        the user specifies.

        Each entry is a tuple containing two DataFrames:

          - Values per improved square foot
          - Values per land square foot

        Each DataFrame is keyed by the unique *values* for the given location. (e.g. "River heights", "Meadowbrook",
        etc., if the location field in question is "neighborhood") The other field in each DataFrame will be
        ``{location_field}_per_impr_{unit}`` or ``{location_field}_per_land_{unit}``
    location_fields : list
        List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)
    overall_per_impr_area : float
        Fallback value per improved square foot, to use for parcels of unspecified location. Based on the
        overall average value for the dataset.
    overall_per_land_area : float
        Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average
        value for the dataset.
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.loc_map = loc_map
    self.location_fields = location_fields
    self.overall_per_impr_area = overall_per_impr_area
    self.overall_per_land_area = overall_per_land_area
    self.sales_chase = sales_chase

MRAModel

MRAModel(fitted_model, intercept, log=False)

Multiple Regression Analysis Model

Plain 'ol (multiple) linear regression

Attributes:

Name Type Description
fitted_model RegressionResults

Fitted model from running the regression

intercept bool

Whether the model was fit with an intercept or not.

log bool

Whether the model was fit on a log-transformed target. When True, predictions are produced in log space and exponentiated back to price space by predict_mra.

Source code in openavmkit/utilities/modeling.py
544
545
546
547
def __init__(self, fitted_model: RegressionResults, intercept: bool, log: bool = False):
    self.fitted_model = fitted_model
    self.intercept = intercept
    self.log = log

MultiMRAModel

MultiMRAModel(coef_map, global_coef, feature_names, intercept, location_fields, log=False)

Multi-MRA (hierarchical local OLS) model.

For each location field (e.g. "block", "neighborhood", ...), and for each distinct value of that field, we fit a separate OLS regression using the same set of independent variables.

We store: - A global OLS coefficient vector (fallback when no local model applies) - A mapping from (location_field, location_value) -> coefficient vector - The feature_names (column order) used for all regressions - Whether an intercept was used - The location_fields (ordered most specific -> least specific)

Attributes:

Name Type Description
coef_map dict[str, dict[Any, ndarray]]

Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).

global_coef ndarray

Coefficient vector for the global OLS regression.

feature_names list[str]

Ordered list of feature names used for all regressions.

intercept bool

Whether an intercept column was used.

location_fields list[str]

Location fields in order from most specific to least specific.

Parameters:

Name Type Description Default
coef_map dict[str, dict[Any, ndarray]]

Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).

required
global_coef ndarray

Coefficient vector for the global OLS regression.

required
feature_names list[str]

Ordered list of feature names used for all regressions.

required
intercept bool

Whether an intercept column was used.

required
location_fields list[str]

Location fields in order from most specific to least specific.

required
log bool

Whether the regressions were fit on a log-transformed target. When True, predictions are produced in log space and exponentiated back by predict_multi_mra.

False
Source code in openavmkit/utilities/modeling.py
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
def __init__(
    self,
    coef_map: dict[str, dict[Any, np.ndarray]],
    global_coef: np.ndarray,
    feature_names: list[str],
    intercept: bool,
    location_fields: list[str],
    log: bool = False,
):
    """
    Parameters
    ----------
    coef_map : dict[str, dict[Any, np.ndarray]]
        Mapping from location field name to a dict mapping
        location value -> coefficient vector (aligned with feature_names).
    global_coef : np.ndarray
        Coefficient vector for the global OLS regression.
    feature_names : list[str]
        Ordered list of feature names used for all regressions.
    intercept : bool
        Whether an intercept column was used.
    location_fields : list[str]
        Location fields in order from most specific to least specific.
    log : bool
        Whether the regressions were fit on a log-transformed target. When True,
        predictions are produced in log space and exponentiated back by ``predict_multi_mra``.
    """
    self.coef_map = coef_map
    self.global_coef = global_coef
    self.feature_names = feature_names
    self.intercept = intercept
    self.location_fields = location_fields
    self.log = log

NGBoostModel

NGBoostModel(regressor, cat_data)

NGBoost Model (probabilistic gradient boosting)

NGBoost predicts a full probability distribution per row, so it surfaces a per-parcel predictive standard deviation in addition to a point estimate. Its base learner is a numeric-only sklearn tree, so categoricals are encoded via cat_data rather than passed natively.

Attributes:

Name Type Description
regressor NGBRegressor

The trained NGBoost NGBRegressor model

cat_data TreeBasedCategoricalData

Categorical metadata used to build the numeric matrix NGBoost requires

Source code in openavmkit/utilities/modeling.py
502
503
504
def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

NaiveAreaModel

NaiveAreaModel(dep_per_built_area, dep_per_land_area, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the prevailing average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Attributes:

Name Type Description
dep_per_built_area float

Dependent variable value divided by improved square footage

dep_per_land_area float

Dependent variable value divided by land square footage

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a NaiveAreaModel

Parameters:

Name Type Description Default
dep_per_built_area float

Dependent variable value divided by improved square footage

required
dep_per_land_area float

Dependent variable value divided by land square footage

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
def __init__(
    self, dep_per_built_area: float, dep_per_land_area: float, sales_chase: float
):
    """Initialize a NaiveAreaModel

    Parameters
    ----------
    dep_per_built_area: float
        Dependent variable value divided by improved square footage
    dep_per_land_area: float
        Dependent variable value divided by land square footage
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.dep_per_built_area = dep_per_built_area
    self.dep_per_land_area = dep_per_land_area
    self.sales_chase = sales_chase

PassThroughModel

PassThroughModel(field, engine)

Mostly used for representing existing valuations to compare against, such as the Assessor's values

Attributes:

Name Type Description
field str

The field that holds the values you want to pass through as predictions

Initialize a PassThroughModel

Parameters:

Name Type Description Default
field str

The field that holds the values you want to pass through as predictions

required
engine str

The model engine ("assessor" or "pass_through")

required
Source code in openavmkit/utilities/modeling.py
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
def __init__(
    self,
    field: str,
    engine: str
):
    """Initialize a PassThroughModel

    Parameters
    ----------
    field : str
        The field that holds the values you want to pass through as predictions
    engine : str
        The model engine ("assessor" or "pass_through")
    """
    self.field = field
    self.engine = engine

SpatialLagModel

SpatialLagModel(per_area)

Use a spatial lag field as your prediction

Attributes:

Name Type Description
per_area bool

If True, normalize by area unit. If False, use the direct value of the spatial lag field.

Initialize a SpatialLagModel

Parameters:

Name Type Description Default
per_area bool

If True, normalize by square foot. If False, use the direct value of the spatial lag field.

required
Source code in openavmkit/utilities/modeling.py
265
266
267
268
269
270
271
272
273
def __init__(self, per_area: bool):
    """Initialize a SpatialLagModel

    Parameters
    ----------
    per_area : bool
        If True, normalize by square foot. If False, use the direct value of the spatial lag field.
    """
    self.per_area = per_area

TreeBasedCategoricalData dataclass

TreeBasedCategoricalData(feature_names, categorical_cols, category_levels, bool_cols)

Stores categorical metadata needed to reproduce LightGBM-compatible categorical encodings and generate numeric matrices for SHAP.

apply

apply(X, *, fill_missing_cat=False, missing_token='__MISSING__')

Reapply categorical + boolean structure to a dataframe. Unknown categories become NaN (categorical missing) unless fill_missing_cat=True.

Source code in openavmkit/utilities/modeling.py
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
def apply(self, X: pd.DataFrame, *, fill_missing_cat: bool = False, missing_token: str = "__MISSING__") -> pd.DataFrame:
    """
    Reapply categorical + boolean structure to a dataframe.
    Unknown categories become NaN (categorical missing) unless fill_missing_cat=True.
    """
    X = X.reindex(columns=self.feature_names)

    for c in self.bool_cols:
        if c in X.columns:
            X[c] = X[c].astype("boolean")

    for c, levels in self.category_levels.items():
        if c in X.columns:
            X[c] = pd.Categorical(X[c], categories=levels)
            if fill_missing_cat:
                # turn NaN category (including unknowns) into a string token
                X[c] = X[c].astype("string").fillna(missing_token)

    return X

from_training_data classmethod

from_training_data(X_train, categorical_cols)

Build metadata from training data AFTER categoricals have been converted to pandas 'category' dtype.

Source code in openavmkit/utilities/modeling.py
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
@classmethod
def from_training_data(
    cls,
    X_train: pd.DataFrame,
    categorical_cols: List[str],
) -> "TreeBasedCategoricalData":
    """
    Build metadata from training data AFTER categoricals have been
    converted to pandas 'category' dtype.
    """
    feature_names = list(X_train.columns)

    cat_cols = [
        c for c in categorical_cols
        if c in X_train.columns
        and pd.api.types.is_categorical_dtype(X_train[c])
    ]

    category_levels = {
        c: list(X_train[c].cat.categories)
        for c in cat_cols
    }

    bool_cols = [
        c for c in X_train.columns
        if pd.api.types.is_bool_dtype(X_train[c])
        or str(X_train[c].dtype) == "boolean"
    ]

    return cls(
        feature_names=feature_names,
        categorical_cols=cat_cols,
        category_levels=category_levels,
        bool_cols=bool_cols,
    )

to_numeric_matrix

to_numeric_matrix(X)

Convert dataframe to a numeric matrix compatible with SHAP. Categoricals -> integer codes, unknowns/missing -> np.nan.

Source code in openavmkit/utilities/modeling.py
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def to_numeric_matrix(self, X: pd.DataFrame) -> np.ndarray:
    """
    Convert dataframe to a numeric matrix compatible with SHAP.
    Categoricals -> integer codes, unknowns/missing -> np.nan.
    """
    X = self.apply(X)
    out = X.copy()

    for c in self.categorical_cols:
        codes = out[c].cat.codes.astype(np.float64)
        codes[codes == -1] = np.nan
        out[c] = codes

    for c in self.bool_cols:
        out[c] = out[c].astype("Float64").astype(np.float64)

    return out.to_numpy(dtype=np.float64)

XGBoostModel

XGBoostModel(regressor, cat_data)

XGBoost Model

Attributes:

Name Type Description
regressor XGBRegressor

The trained XGBoost XGBRegressor model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
468
469
470
def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

greedy_forward_loocv

greedy_forward_loocv(X, y, *, k_max=None, min_gain=0.002, standardize=True, prescreen_k=15)

Greedy forward selection maximizing LOOCV R² (fast). Auto-detects intercept handling: - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized). - Otherwise, adds an intercept internally (as before).

Assumes X, y are numeric, aligned, and NaN-free.

Source code in openavmkit/utilities/modeling.py
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
def greedy_forward_loocv(
    X: pd.DataFrame,
    y: pd.Series,
    *,
    k_max: Optional[int] = None,
    min_gain: float = 0.002,          # stop if best add improves CV-R² < 0.2%
    standardize: bool = True,
    prescreen_k: Optional[int] = 15,  # optionally screen to top K by |corr| for speed
) -> GreedyResult:
    """
    Greedy forward selection maximizing LOOCV R² (fast).
    Auto-detects intercept handling:
      - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized).
      - Otherwise, adds an intercept internally (as before).

    Assumes X, y are numeric, aligned, and NaN-free.
    """

    yv = y.to_numpy(dtype=float)
    cols_all = list(X.columns)

    has_const = "const" in cols_all
    if has_const:
        # Treat provided const as the intercept (force-include)
        const_vec = X["const"].to_numpy(dtype=float).reshape(-1, 1)

        # Optional sanity check (not required, but helpful during debugging):
        # if not (np.nanstd(const_vec) < 1e-12):
        #     raise ValueError("'const' column exists but is not (near-)constant; cannot treat as intercept safely.")

        X_no_const = X.drop(columns=["const"])
        cols = list(X_no_const.columns)
        Xmat = X_no_const.to_numpy(dtype=float)
    else:
        const_vec = None
        cols = cols_all
        Xmat = X.to_numpy(dtype=float)

    n, p = Xmat.shape
    if p == 0 and not has_const:
        return GreedyResult([], cv_r2=float("-inf"), train_r2=float("-inf"))

    # SST
    y_centered = yv - yv.mean()
    sst = float(y_centered.T @ y_centered)
    if sst == 0.0:
        return GreedyResult([], cv_r2=0.0, train_r2=0.0)

    # Standardize X for stability (recommended)
    # IMPORTANT: never standardize the intercept; we already removed it above if present.
    if standardize and p > 0:
        mu = Xmat.mean(axis=0)
        sd = Xmat.std(axis=0, ddof=0)
        sd = np.where(sd == 0, 1.0, sd)
        Xmat = (Xmat - mu) / sd

    # Adaptive size cap if not provided
    if k_max is None:
        # if p==0 but has_const, we still just fit const-only
        k_max = 0 if p == 0 else min(p, max(2, n // 5))
    k_max = max(0, min(k_max, p))

    # Optional prescreen to reduce p cheaply
    if prescreen_k is not None and prescreen_k < p and p > 0:
        y_std = y_centered.std(ddof=0)
        if y_std == 0:
            return GreedyResult([], cv_r2=0.0, train_r2=0.0)
        x_std = Xmat.std(axis=0, ddof=0)
        x_std = np.where(x_std == 0, 1.0, x_std)
        corr = (Xmat.T @ y_centered) / (n * x_std * y_std)
        keep_idx = np.argsort(np.abs(corr))[-prescreen_k:]
        keep_idx = np.sort(keep_idx)
        Xmat = Xmat[:, keep_idx]
        cols = [cols[i] for i in keep_idx]
        p = Xmat.shape[1]
        k_max = min(k_max, p)

    selected: List[int] = []
    remaining = set(range(p))

    # baseline
    if has_const:
        # intercept is provided as a column; baseline is const-only
        X0 = const_vec
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)
    else:
        # original behavior: intercept-only baseline
        X0 = np.ones((n, 1))
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)

    while remaining and len(selected) < k_max:
        step_best_cv = best_cv_r2
        step_best_train = best_train_r2
        step_best_j = None

        for j in list(remaining):
            idxs = selected + [j]
            Xsub = Xmat[:, idxs]

            if has_const:
                # Use provided intercept column
                X_design = np.column_stack([const_vec, Xsub])
            else:
                # Add intercept internally (original behavior)
                X_design = np.column_stack([np.ones(n), Xsub])

            train_r2, cv_r2 = _ols_train_and_loocv_r2(X_design, yv, sst)
            if cv_r2 > step_best_cv + 1e-12:
                step_best_cv = cv_r2
                step_best_train = train_r2
                step_best_j = j

        if step_best_j is None or (step_best_cv - best_cv_r2) < min_gain:
            break

        selected.append(step_best_j)
        remaining.remove(step_best_j)
        best_cv_r2 = step_best_cv
        best_train_r2 = step_best_train

    # Return only non-const variables (const is forced-in when present)
    return GreedyResult([cols[i] for i in selected], best_cv_r2, best_train_r2)