`openavmkit.utilities.modeling`

Model class definitions.

Defines the model classes used by :mod:openavmkit.modeling to train and predict property values. Includes:

Tree-based wrappers — XGBoostModel, LightGBMModel, CatBoostModel
Linear models — MRAModel, MultiMRAModel
Geographic models — GWRModel, LocalAreaModel, SpatialLagModel
Baselines — GarbageModel, AverageModel, NaiveAreaModel, PassThroughModel, GroundTruthModel

Plus helpers (greedy_forward_loocv, TreeBasedCategoricalData) shared across model fitting routines.

When adding a new model, subclass here and follow the existing pattern; register the prediction wrapper in :mod:openavmkit.model_runner and the params/contribs writer in :mod:openavmkit.modeling.

AverageModel

AverageModel(type, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the average of observed sale prices.

Attributes:

Name	Type	Description
`type`	`str`	The type of average to use
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize an AverageModel

Parameters:

Name	Type	Description	Default
`type`	`str`	The type of average to use	required
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.	required

Source code in openavmkit/utilities/modeling.py

def __init__(self, type: str, sales_chase: float):
    """Initialize an AverageModel

    Parameters
    ----------
    type : str
        The type of average to use
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.type = type
    self.sales_chase = sales_chase

CatBoostModel

CatBoostModel(regressor, cat_data)

CatBoost Model

Attributes:

Name	Type	Description
`regressor`	`CatBRegressor`	The trained CatBoost CatBRegressor model
`cat_data`	`TreeBasedCategoricalData`

Source code in openavmkit/utilities/modeling.py

def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

GWRModel

GWRModel(coords_train, X_train, y_train, gwr_bw)

Geographic Weighted Regression Model

Attributes:

Name	Type	Description
`coords_train`	`list[tuple[float, float]]`	list of geospatial coordinates corresponding to each observation in the training set
`X_train`	`ndarray`	2D array of independent variables' values from the training set
`y_train`	`ndarray`	1D array of dependent variable's values from the training set
`gwr_bw`	`float`	Bandwidth for GWR calculation
`df_params_test`	`DataFrame`	Coefficients for the test set
`df_params_sales`	`DataFrame`	Coefficients for the sales set
`df_params_universe`	`DataFrame`	Coefficients for the universe set

Parameters:

Name	Type	Description	Default
`coords_train`	`list[tuple[float, float]]`	list of geospatial coordinates corresponding to each observation in the training set	required
`X_train`	`ndarray`	2D array of independent variables' values from the training set	required
`y_train`	`ndarray`	1D array of dependent variable's values from the training set	required
`gwr_bw`	`float`	Bandwidth for GWR calculation	required

Source code in openavmkit/utilities/modeling.py

def __init__(
    self,
    coords_train: list[tuple[float, float]],
    X_train: np.ndarray,
    y_train: np.ndarray,
    gwr_bw: float
):
    """
    Parameters
    ----------
    coords_train : list[tuple[float, float]]
        list of geospatial coordinates corresponding to each observation in the training set
    X_train : np.ndarray
        2D array of independent variables' values from the training set
    y_train : np.ndarray
        1D array of dependent variable's values from the training set
    gwr_bw : float
        Bandwidth for GWR calculation
    """
    self.coords_train = coords_train
    self.X_train = X_train
    self.y_train = y_train
    self.gwr_bw = gwr_bw
    self.df_params_sales = None
    self.df_params_univ = None
    self.df_params_test = None

GarbageModel

GarbageModel(min_value, max_value, sales_chase, normal)

An intentionally bad predictive model, to use as a sort of control. Produces random predictions.

Attributes:

Name	Type	Description
`min_value`	`float`	The minimum value of to "predict"
`max_value`	`float`	The maximum value of to "predict"
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
`normal`	`bool`	If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.

Initialize a GarbageModel

Parameters:

Name	Type	Description	Default
`min_value`	`float`	The minimum value of to "predict"	required
`max_value`	`float`	The maximum value of to "predict"	required
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.	required
`normal`	`bool`	If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.	required

Source code in openavmkit/utilities/modeling.py

def __init__(
    self, min_value: float, max_value: float, sales_chase: float, normal: bool
):
    """Initialize a GarbageModel

    Parameters
    ----------
    min_value : float
        The minimum value of to "predict"
    max_value : float
        The maximum value of to "predict"
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    normal : bool
        If True, the randomly generated predictions follow a normal distribution based on the observed sale price's
        standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.
    """
    self.min_value = min_value
    self.max_value = max_value
    self.sales_chase = sales_chase
    self.normal = normal

GroundTruthModel

GroundTruthModel(observed_field, ground_truth_field)

Mostly only used in Synthetic models, where you want to compare against simulation ground_truth instead of observed sale price, which you can never do in real life.

Attributes:

Name	Type	Description
`observed_field`	`str`	The field that represents observed sale prices
`ground_truth_field`	`str`	The field that represents platonic ground truth

Initialize a GroundTruthModel object

Parameters:

Name	Type	Description	Default
`observed_field`	`str`	The field that represents observed sale prices	required
`ground_truth_field`	`str`	The field that represents platonic ground truth	required

Source code in openavmkit/utilities/modeling.py

def __init__(self, observed_field: str, ground_truth_field: str):
    """Initialize a GroundTruthModel object

    Parameters
    ----------
    observed_field : str
        The field that represents observed sale prices
    ground_truth_field : str
        The field that represents platonic ground truth
    """
    self.observed_field = observed_field
    self.ground_truth_field = ground_truth_field

LayeredCompModel

LayeredCompModel(model)

Layered Comp Model

A bagging ensemble version of the LayeredCompModel algorithm that reduces variance and automatically optimizes the weight_falloff for each tree in the ensemble.

Attributes:

Name	Type	Description
`model`	`LayeredCompModel`	The trained LayeredCompModel from the layeredcompmodel package

Initialize a LayeredCompModel

Parameters:

Name	Type	Description	Default
`model`	`LayeredCompModel`	The trained LayeredCompModel instance	required

Source code in openavmkit/utilities/modeling.py

def __init__(self, model):
    """Initialize a LayeredCompModel

    Parameters
    ----------
    model : layeredcompmodel.LayeredCompModel
        The trained LayeredCompModel instance
    """
    self.model = model

LightGBMModel

LightGBMModel(booster, cat_data)

LightGBM Model

Attributes:

Name	Type	Description
`booster`	`Booster`	The trained LightGBM Booster model
`cat_data`	`TreeBasedCategoricalData`

Source code in openavmkit/utilities/modeling.py

def __init__(self, booster, cat_data):
    self.booster = booster
    self.cat_data = cat_data

LocalAreaModel

LocalAreaModel(loc_map, location_fields, overall_per_impr_area, overall_per_land_area, sales_chase)

Produces predictions equal to the localized average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Unlike NaiveAreaModel, this model is sensitive to location, based on user-specified locations, and might actually result in decent predictions.

Attributes:

Name	Type	Description
`loc_map`	`dict[str : tuple[DataFrame, DataFrame]`	A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies. Each entry is a tuple containing two DataFrames: Values per improved square foot Values per land square foot Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be `{location_field}_per_impr_{unit}` or `{location_field}_per_land_{unit}`
`location_fields`	`list`	List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)
`overall_per_impr_area`	`float`	Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.
`overall_per_land_area`	`float`	Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a LocalAreaModel

Parameters:

Name	Type	Description	Default
`loc_map`	`dict[str : tuple[DataFrame, DataFrame]`	A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies. Each entry is a tuple containing two DataFrames: Values per improved square foot Values per land square foot Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be `{location_field}_per_impr_{unit}` or `{location_field}_per_land_{unit}`	required
`location_fields`	`list`	List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)	required
`overall_per_impr_area`	`float`	Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.	required
`overall_per_land_area`	`float`	Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.	required
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.	required

Source code in openavmkit/utilities/modeling.py

def __init__(
    self,
    loc_map: dict,
    location_fields: list,
    overall_per_impr_area: float,
    overall_per_land_area: float,
    sales_chase: float,
):
    """Initialize a LocalAreaModel

    Parameters
    ----------
    loc_map : dict[str : tuple[DataFrame, DataFrame]
        A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the
        names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever
        the user specifies.

        Each entry is a tuple containing two DataFrames:

          - Values per improved square foot
          - Values per land square foot

        Each DataFrame is keyed by the unique *values* for the given location. (e.g. "River heights", "Meadowbrook",
        etc., if the location field in question is "neighborhood") The other field in each DataFrame will be
        ``{location_field}_per_impr_{unit}`` or ``{location_field}_per_land_{unit}``
    location_fields : list
        List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)
    overall_per_impr_area : float
        Fallback value per improved square foot, to use for parcels of unspecified location. Based on the
        overall average value for the dataset.
    overall_per_land_area : float
        Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average
        value for the dataset.
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.loc_map = loc_map
    self.location_fields = location_fields
    self.overall_per_impr_area = overall_per_impr_area
    self.overall_per_land_area = overall_per_land_area
    self.sales_chase = sales_chase

MRAModel

MRAModel(fitted_model, intercept, log=False)

Multiple Regression Analysis Model

Plain 'ol (multiple) linear regression

Attributes:

Name	Type	Description
`fitted_model`	`RegressionResults`	Fitted model from running the regression
`intercept`	`bool`	Whether the model was fit with an intercept or not.
`log`	`bool`	Whether the model was fit on a log-transformed target. When True, predictions are produced in log space and exponentiated back to price space by `predict_mra`.

Source code in openavmkit/utilities/modeling.py

def __init__(self, fitted_model: RegressionResults, intercept: bool, log: bool = False):
    self.fitted_model = fitted_model
    self.intercept = intercept
    self.log = log

MultiMRAModel

MultiMRAModel(coef_map, global_coef, feature_names, intercept, location_fields, log=False)

Multi-MRA (hierarchical local OLS) model.

For each location field (e.g. "block", "neighborhood", ...), and for each distinct value of that field, we fit a separate OLS regression using the same set of independent variables.

We store: - A global OLS coefficient vector (fallback when no local model applies) - A mapping from (location_field, location_value) -> coefficient vector - The feature_names (column order) used for all regressions - Whether an intercept was used - The location_fields (ordered most specific -> least specific)

Attributes:

Name	Type	Description
`coef_map`	`dict[str, dict[Any, ndarray]]`	Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).
`global_coef`	`ndarray`	Coefficient vector for the global OLS regression.
`feature_names`	`list[str]`	Ordered list of feature names used for all regressions.
`intercept`	`bool`	Whether an intercept column was used.
`location_fields`	`list[str]`	Location fields in order from most specific to least specific.

Parameters:

Name	Type	Description	Default
`coef_map`	`dict[str, dict[Any, ndarray]]`	Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).	required
`global_coef`	`ndarray`	Coefficient vector for the global OLS regression.	required
`feature_names`	`list[str]`	Ordered list of feature names used for all regressions.	required
`intercept`	`bool`	Whether an intercept column was used.	required
`location_fields`	`list[str]`	Location fields in order from most specific to least specific.	required
`log`	`bool`	Whether the regressions were fit on a log-transformed target. When True, predictions are produced in log space and exponentiated back by `predict_multi_mra`.	`False`

Source code in openavmkit/utilities/modeling.py

def __init__(
    self,
    coef_map: dict[str, dict[Any, np.ndarray]],
    global_coef: np.ndarray,
    feature_names: list[str],
    intercept: bool,
    location_fields: list[str],
    log: bool = False,
):
    """
    Parameters
    ----------
    coef_map : dict[str, dict[Any, np.ndarray]]
        Mapping from location field name to a dict mapping
        location value -> coefficient vector (aligned with feature_names).
    global_coef : np.ndarray
        Coefficient vector for the global OLS regression.
    feature_names : list[str]
        Ordered list of feature names used for all regressions.
    intercept : bool
        Whether an intercept column was used.
    location_fields : list[str]
        Location fields in order from most specific to least specific.
    log : bool
        Whether the regressions were fit on a log-transformed target. When True,
        predictions are produced in log space and exponentiated back by ``predict_multi_mra``.
    """
    self.coef_map = coef_map
    self.global_coef = global_coef
    self.feature_names = feature_names
    self.intercept = intercept
    self.location_fields = location_fields
    self.log = log

NGBoostModel

NGBoostModel(regressor, cat_data)

NGBoost Model (probabilistic gradient boosting)

NGBoost predicts a full probability distribution per row, so it surfaces a per-parcel predictive standard deviation in addition to a point estimate. Its base learner is a numeric-only sklearn tree, so categoricals are encoded via cat_data rather than passed natively.

Attributes:

Name	Type	Description
`regressor`	`NGBRegressor`	The trained NGBoost NGBRegressor model
`cat_data`	`TreeBasedCategoricalData`	Categorical metadata used to build the numeric matrix NGBoost requires

Source code in openavmkit/utilities/modeling.py

def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

NaiveAreaModel

NaiveAreaModel(dep_per_built_area, dep_per_land_area, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the prevailing average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Attributes:

Name	Type	Description
`dep_per_built_area`	`float`	Dependent variable value divided by improved square footage
`dep_per_land_area`	`float`	Dependent variable value divided by land square footage
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a NaiveAreaModel

Parameters:

Name	Type	Description	Default
`dep_per_built_area`	`float`	Dependent variable value divided by improved square footage	required
`dep_per_land_area`	`float`	Dependent variable value divided by land square footage	required
`sales_chase`	`float`	Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of `sales_chase`. So `sales_chase=0.05` will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.	required

Source code in openavmkit/utilities/modeling.py

def __init__(
    self, dep_per_built_area: float, dep_per_land_area: float, sales_chase: float
):
    """Initialize a NaiveAreaModel

    Parameters
    ----------
    dep_per_built_area: float
        Dependent variable value divided by improved square footage
    dep_per_land_area: float
        Dependent variable value divided by land square footage
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.dep_per_built_area = dep_per_built_area
    self.dep_per_land_area = dep_per_land_area
    self.sales_chase = sales_chase

PassThroughModel

PassThroughModel(field, engine)

Mostly used for representing existing valuations to compare against, such as the Assessor's values

Attributes:

Name	Type	Description
`field`	`str`	The field that holds the values you want to pass through as predictions

Initialize a PassThroughModel

Parameters:

Name	Type	Description	Default
`field`	`str`	The field that holds the values you want to pass through as predictions	required
`engine`	`str`	The model engine ("assessor" or "pass_through")	required

Source code in openavmkit/utilities/modeling.py

def __init__(
    self,
    field: str,
    engine: str
):
    """Initialize a PassThroughModel

    Parameters
    ----------
    field : str
        The field that holds the values you want to pass through as predictions
    engine : str
        The model engine ("assessor" or "pass_through")
    """
    self.field = field
    self.engine = engine

SpatialLagModel

SpatialLagModel(per_area)

Use a spatial lag field as your prediction

Attributes:

Name	Type	Description
`per_area`	`bool`	If True, normalize by area unit. If False, use the direct value of the spatial lag field.

Initialize a SpatialLagModel

Parameters:

Name	Type	Description	Default
`per_area`	`bool`	If True, normalize by square foot. If False, use the direct value of the spatial lag field.	required

Source code in openavmkit/utilities/modeling.py

def __init__(self, per_area: bool):
    """Initialize a SpatialLagModel

    Parameters
    ----------
    per_area : bool
        If True, normalize by square foot. If False, use the direct value of the spatial lag field.
    """
    self.per_area = per_area

TreeBasedCategoricalData `dataclass`

TreeBasedCategoricalData(feature_names, categorical_cols, category_levels, bool_cols)

Stores categorical metadata needed to reproduce LightGBM-compatible categorical encodings and generate numeric matrices for SHAP.

apply

apply(X, *, fill_missing_cat=False, missing_token='__MISSING__')

Reapply categorical + boolean structure to a dataframe. Unknown categories become NaN (categorical missing) unless fill_missing_cat=True.

Source code in openavmkit/utilities/modeling.py

def apply(self, X: pd.DataFrame, *, fill_missing_cat: bool = False, missing_token: str = "__MISSING__") -> pd.DataFrame:
    """
    Reapply categorical + boolean structure to a dataframe.
    Unknown categories become NaN (categorical missing) unless fill_missing_cat=True.
    """
    X = X.reindex(columns=self.feature_names)

    for c in self.bool_cols:
        if c in X.columns:
            X[c] = X[c].astype("boolean")

    for c, levels in self.category_levels.items():
        if c in X.columns:
            X[c] = pd.Categorical(X[c], categories=levels)
            if fill_missing_cat:
                # turn NaN category (including unknowns) into a string token
                X[c] = X[c].astype("string").fillna(missing_token)

    return X

from_training_data `classmethod`

from_training_data(X_train, categorical_cols)

Build metadata from training data AFTER categoricals have been converted to pandas 'category' dtype.

Source code in openavmkit/utilities/modeling.py

@classmethod
def from_training_data(
    cls,
    X_train: pd.DataFrame,
    categorical_cols: List[str],
) -> "TreeBasedCategoricalData":
    """
    Build metadata from training data AFTER categoricals have been
    converted to pandas 'category' dtype.
    """
    feature_names = list(X_train.columns)

    cat_cols = [
        c for c in categorical_cols
        if c in X_train.columns
        and pd.api.types.is_categorical_dtype(X_train[c])
    ]

    category_levels = {
        c: list(X_train[c].cat.categories)
        for c in cat_cols
    }

    bool_cols = [
        c for c in X_train.columns
        if pd.api.types.is_bool_dtype(X_train[c])
        or str(X_train[c].dtype) == "boolean"
    ]

    return cls(
        feature_names=feature_names,
        categorical_cols=cat_cols,
        category_levels=category_levels,
        bool_cols=bool_cols,
    )

to_numeric_matrix

to_numeric_matrix(X)

Convert dataframe to a numeric matrix compatible with SHAP. Categoricals -> integer codes, unknowns/missing -> np.nan.

Source code in openavmkit/utilities/modeling.py

def to_numeric_matrix(self, X: pd.DataFrame) -> np.ndarray:
    """
    Convert dataframe to a numeric matrix compatible with SHAP.
    Categoricals -> integer codes, unknowns/missing -> np.nan.
    """
    X = self.apply(X)
    out = X.copy()

    for c in self.categorical_cols:
        codes = out[c].cat.codes.astype(np.float64)
        codes[codes == -1] = np.nan
        out[c] = codes

    for c in self.bool_cols:
        out[c] = out[c].astype("Float64").astype(np.float64)

    return out.to_numpy(dtype=np.float64)

XGBoostModel

XGBoostModel(regressor, cat_data)

XGBoost Model

Attributes:

Name	Type	Description
`regressor`	`XGBRegressor`	The trained XGBoost XGBRegressor model
`cat_data`	`TreeBasedCategoricalData`

Source code in openavmkit/utilities/modeling.py

def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

greedy_forward_loocv

greedy_forward_loocv(X, y, *, k_max=None, min_gain=0.002, standardize=True, prescreen_k=15)

Greedy forward selection maximizing LOOCV R² (fast). Auto-detects intercept handling: - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized). - Otherwise, adds an intercept internally (as before).

Assumes X, y are numeric, aligned, and NaN-free.

Source code in openavmkit/utilities/modeling.py

def greedy_forward_loocv(
    X: pd.DataFrame,
    y: pd.Series,
    *,
    k_max: Optional[int] = None,
    min_gain: float = 0.002,          # stop if best add improves CV-R² < 0.2%
    standardize: bool = True,
    prescreen_k: Optional[int] = 15,  # optionally screen to top K by |corr| for speed
) -> GreedyResult:
    """
    Greedy forward selection maximizing LOOCV R² (fast).
    Auto-detects intercept handling:
      - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized).
      - Otherwise, adds an intercept internally (as before).

    Assumes X, y are numeric, aligned, and NaN-free.
    """

    yv = y.to_numpy(dtype=float)
    cols_all = list(X.columns)

    has_const = "const" in cols_all
    if has_const:
        # Treat provided const as the intercept (force-include)
        const_vec = X["const"].to_numpy(dtype=float).reshape(-1, 1)

        # Optional sanity check (not required, but helpful during debugging):
        # if not (np.nanstd(const_vec) < 1e-12):
        #     raise ValueError("'const' column exists but is not (near-)constant; cannot treat as intercept safely.")

        X_no_const = X.drop(columns=["const"])
        cols = list(X_no_const.columns)
        Xmat = X_no_const.to_numpy(dtype=float)
    else:
        const_vec = None
        cols = cols_all
        Xmat = X.to_numpy(dtype=float)

    n, p = Xmat.shape
    if p == 0 and not has_const:
        return GreedyResult([], cv_r2=float("-inf"), train_r2=float("-inf"))

    # SST
    y_centered = yv - yv.mean()
    sst = float(y_centered.T @ y_centered)
    if sst == 0.0:
        return GreedyResult([], cv_r2=0.0, train_r2=0.0)

    # Standardize X for stability (recommended)
    # IMPORTANT: never standardize the intercept; we already removed it above if present.
    if standardize and p > 0:
        mu = Xmat.mean(axis=0)
        sd = Xmat.std(axis=0, ddof=0)
        sd = np.where(sd == 0, 1.0, sd)
        Xmat = (Xmat - mu) / sd

    # Adaptive size cap if not provided
    if k_max is None:
        # if p==0 but has_const, we still just fit const-only
        k_max = 0 if p == 0 else min(p, max(2, n // 5))
    k_max = max(0, min(k_max, p))

    # Optional prescreen to reduce p cheaply
    if prescreen_k is not None and prescreen_k < p and p > 0:
        y_std = y_centered.std(ddof=0)
        if y_std == 0:
            return GreedyResult([], cv_r2=0.0, train_r2=0.0)
        x_std = Xmat.std(axis=0, ddof=0)
        x_std = np.where(x_std == 0, 1.0, x_std)
        corr = (Xmat.T @ y_centered) / (n * x_std * y_std)
        keep_idx = np.argsort(np.abs(corr))[-prescreen_k:]
        keep_idx = np.sort(keep_idx)
        Xmat = Xmat[:, keep_idx]
        cols = [cols[i] for i in keep_idx]
        p = Xmat.shape[1]
        k_max = min(k_max, p)

    selected: List[int] = []
    remaining = set(range(p))

    # baseline
    if has_const:
        # intercept is provided as a column; baseline is const-only
        X0 = const_vec
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)
    else:
        # original behavior: intercept-only baseline
        X0 = np.ones((n, 1))
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)

    while remaining and len(selected) < k_max:
        step_best_cv = best_cv_r2
        step_best_train = best_train_r2
        step_best_j = None

        for j in list(remaining):
            idxs = selected + [j]
            Xsub = Xmat[:, idxs]

            if has_const:
                # Use provided intercept column
                X_design = np.column_stack([const_vec, Xsub])
            else:
                # Add intercept internally (original behavior)
                X_design = np.column_stack([np.ones(n), Xsub])

            train_r2, cv_r2 = _ols_train_and_loocv_r2(X_design, yv, sst)
            if cv_r2 > step_best_cv + 1e-12:
                step_best_cv = cv_r2
                step_best_train = train_r2
                step_best_j = j

        if step_best_j is None or (step_best_cv - best_cv_r2) < min_gain:
            break

        selected.append(step_best_j)
        remaining.remove(step_best_j)
        best_cv_r2 = step_best_cv
        best_train_r2 = step_best_train

    # Return only non-const variables (const is forced-in when present)
    return GreedyResult([cols[i] for i in selected], best_cv_r2, best_train_r2)

openavmkit.utilities.modeling

AverageModel

CatBoostModel

GWRModel

GarbageModel

GroundTruthModel

LayeredCompModel

LightGBMModel

LocalAreaModel

MRAModel

MultiMRAModel

NGBoostModel

NaiveAreaModel

PassThroughModel

SpatialLagModel

TreeBasedCategoricalData dataclass

apply

from_training_data classmethod

to_numeric_matrix

XGBoostModel

greedy_forward_loocv

`openavmkit.utilities.modeling`

TreeBasedCategoricalData `dataclass`

from_training_data `classmethod`