Skip to content

openavmkit.utilities.modeling

AverageModel

AverageModel(type, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the average of observed sale prices.

Attributes:

Name Type Description
type str

The type of average to use

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize an AverageModel

Parameters:

Name Type Description Default
type str

The type of average to use

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def __init__(self, type: str, sales_chase: float):
    """Initialize an AverageModel

    Parameters
    ----------
    type : str
        The type of average to use
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.type = type
    self.sales_chase = sales_chase

CatBoostModel

CatBoostModel(regressor, cat_data)

CatBoost Model

Attributes:

Name Type Description
regressor CatBRegressor

The trained CatBoost CatBRegressor model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
554
555
556
def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

GWRModel

GWRModel(coords_train, X_train, y_train, gwr_bw)

Geographic Weighted Regression Model

Attributes:

Name Type Description
coords_train list[tuple[float, float]]

list of geospatial coordinates corresponding to each observation in the training set

X_train ndarray

2D array of independent variables' values from the training set

y_train ndarray

1D array of dependent variable's values from the training set

gwr_bw float

Bandwidth for GWR calculation

df_params_test DataFrame

Coefficients for the test set

df_params_sales DataFrame

Coefficients for the sales set

df_params_universe DataFrame

Coefficients for the universe set

Parameters:

Name Type Description Default
coords_train list[tuple[float, float]]

list of geospatial coordinates corresponding to each observation in the training set

required
X_train ndarray

2D array of independent variables' values from the training set

required
y_train ndarray

1D array of dependent variable's values from the training set

required
gwr_bw float

Bandwidth for GWR calculation

required
Source code in openavmkit/utilities/modeling.py
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def __init__(
    self,
    coords_train: list[tuple[float, float]],
    X_train: np.ndarray,
    y_train: np.ndarray,
    gwr_bw: float
):
    """
    Parameters
    ----------
    coords_train : list[tuple[float, float]]
        list of geospatial coordinates corresponding to each observation in the training set
    X_train : np.ndarray
        2D array of independent variables' values from the training set
    y_train : np.ndarray
        1D array of dependent variable's values from the training set
    gwr_bw : float
        Bandwidth for GWR calculation
    """
    self.coords_train = coords_train
    self.X_train = X_train
    self.y_train = y_train
    self.gwr_bw = gwr_bw
    self.df_params_sales = None
    self.df_params_univ = None
    self.df_params_test = None

GarbageModel

GarbageModel(min_value, max_value, sales_chase, normal)

An intentionally bad predictive model, to use as a sort of control. Produces random predictions.

Attributes:

Name Type Description
min_value float

The minimum value of to "predict"

max_value float

The maximum value of to "predict"

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

normal bool

If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.

Initialize a GarbageModel

Parameters:

Name Type Description Default
min_value float

The minimum value of to "predict"

required
max_value float

The maximum value of to "predict"

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
normal bool

If True, the randomly generated predictions follow a normal distribution based on the observed sale price's standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.

required
Source code in openavmkit/utilities/modeling.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def __init__(
    self, min_value: float, max_value: float, sales_chase: float, normal: bool
):
    """Initialize a GarbageModel

    Parameters
    ----------
    min_value : float
        The minimum value of to "predict"
    max_value : float
        The maximum value of to "predict"
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    normal : bool
        If True, the randomly generated predictions follow a normal distribution based on the observed sale price's
        standard deviation. If False, randomly generated predictions follow a uniform distribution between min and max.
    """
    self.min_value = min_value
    self.max_value = max_value
    self.sales_chase = sales_chase
    self.normal = normal

GroundTruthModel

GroundTruthModel(observed_field, ground_truth_field)

Mostly only used in Synthetic models, where you want to compare against simulation ground_truth instead of observed sale price, which you can never do in real life.

Attributes:

Name Type Description
observed_field str

The field that represents observed sale prices

ground_truth_field str

The field that represents platonic ground truth

Initialize a GroundTruthModel object

Parameters:

Name Type Description Default
observed_field str

The field that represents observed sale prices

required
ground_truth_field str

The field that represents platonic ground truth

required
Source code in openavmkit/utilities/modeling.py
223
224
225
226
227
228
229
230
231
232
233
234
def __init__(self, observed_field: str, ground_truth_field: str):
    """Initialize a GroundTruthModel object

    Parameters
    ----------
    observed_field : str
        The field that represents observed sale prices
    ground_truth_field : str
        The field that represents platonic ground truth
    """
    self.observed_field = observed_field
    self.ground_truth_field = ground_truth_field

LandSLICEModel

LandSLICEModel(alpha, beta, gam_L, med_size, size_field)

SLICE stands for "Smooth Location w/ Increasing-Concavity Equation."

Attributes:

Name Type Description
alpha float
beta float
gam_L LinearGAM
med_size float
size_field str

...

Parameters:

Name Type Description Default
alpha float
required
beta float
required
gam_L LinearGAM
required
med_size float
required
size_field str
required
Source code in openavmkit/utilities/modeling.py
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def __init__(
    self,
    alpha: float,
    beta: float,
    gam_L: LinearGAM,
    med_size: float,
    size_field: str
):
    """
    ...

    Parameters
    ----------
    alpha: float
    beta : float
    gam_L : LinearGAM
    med_size : float
    size_field : str
    """
    self.alpha = alpha
    self.beta = beta
    self.gam_L = gam_L
    self.med_size = med_size
    self.size_field = size_field

LightGBMModel

LightGBMModel(booster, cat_data)

LightGBM Model

Attributes:

Name Type Description
booster Booster

The trained LightGBM Booster model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
526
527
528
def __init__(self, booster, cat_data):
    self.booster = booster
    self.cat_data = cat_data

LocalAreaModel

LocalAreaModel(loc_map, location_fields, overall_per_impr_area, overall_per_land_area, sales_chase)

Produces predictions equal to the localized average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Unlike NaiveAreaModel, this model is sensitive to location, based on user-specified locations, and might actually result in decent predictions.

Attributes:

Name Type Description
loc_map dict[str : tuple[DataFrame, DataFrame]

A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies.

Each entry is a tuple containing two DataFrames:

  • Values per improved square foot
  • Values per land square foot

Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be {location_field}_per_impr_{unit} or {location_field}_per_land_{unit}

location_fields list

List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)

overall_per_impr_area float

Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

overall_per_land_area float

Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a LocalAreaModel

Parameters:

Name Type Description Default
loc_map dict[str : tuple[DataFrame, DataFrame]

A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever the user specifies.

Each entry is a tuple containing two DataFrames:

  • Values per improved square foot
  • Values per land square foot

Each DataFrame is keyed by the unique values for the given location. (e.g. "River heights", "Meadowbrook", etc., if the location field in question is "neighborhood") The other field in each DataFrame will be {location_field}_per_impr_{unit} or {location_field}_per_land_{unit}

required
location_fields list

List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)

required
overall_per_impr_area float

Fallback value per improved square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

required
overall_per_land_area float

Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average value for the dataset.

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def __init__(
    self,
    loc_map: dict,
    location_fields: list,
    overall_per_impr_area: float,
    overall_per_land_area: float,
    sales_chase: float,
):
    """Initialize a LocalAreaModel

    Parameters
    ----------
    loc_map : dict[str : tuple[DataFrame, DataFrame]
        A dictionary that maps location field names to localized per-area values. The dictionary itself is keyed by the
        names of the location fields themselves (e.g. "neighborhood", "market_region", "census_tract", etc.) or whatever
        the user specifies.

        Each entry is a tuple containing two DataFrames:

          - Values per improved square foot
          - Values per land square foot

        Each DataFrame is keyed by the unique *values* for the given location. (e.g. "River heights", "Meadowbrook",
        etc., if the location field in question is "neighborhood") The other field in each DataFrame will be
        ``{location_field}_per_impr_{unit}`` or ``{location_field}_per_land_{unit}``
    location_fields : list
        List of location fields used (e.g. "neighborhood", "market_region", "census_tract", etc.)
    overall_per_impr_area : float
        Fallback value per improved square foot, to use for parcels of unspecified location. Based on the
        overall average value for the dataset.
    overall_per_land_area : float
        Fallback value per land square foot, to use for parcels of unspecified location. Based on the overall average
        value for the dataset.
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.loc_map = loc_map
    self.location_fields = location_fields
    self.overall_per_impr_area = overall_per_impr_area
    self.overall_per_land_area = overall_per_land_area
    self.sales_chase = sales_chase

MRAModel

MRAModel(fitted_model, intercept)

Multiple Regression Analysis Model

Plain 'ol (multiple) linear regression

Attributes:

Name Type Description
fitted_model RegressionResults

Fitted model from running the regression

intercept bool

Whether the model was fit with an intercept or not.

Source code in openavmkit/utilities/modeling.py
571
572
573
def __init__(self, fitted_model: RegressionResults, intercept: bool):
    self.fitted_model = fitted_model
    self.intercept = intercept

MultiMRAModel

MultiMRAModel(coef_map, global_coef, feature_names, intercept, location_fields)

Multi-MRA (hierarchical local OLS) model.

For each location field (e.g. "block", "neighborhood", ...), and for each distinct value of that field, we fit a separate OLS regression using the same set of independent variables.

We store: - A global OLS coefficient vector (fallback when no local model applies) - A mapping from (location_field, location_value) -> coefficient vector - The feature_names (column order) used for all regressions - Whether an intercept was used - The location_fields (ordered most specific -> least specific)

Attributes:

Name Type Description
coef_map dict[str, dict[Any, ndarray]]

Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).

global_coef ndarray

Coefficient vector for the global OLS regression.

feature_names list[str]

Ordered list of feature names used for all regressions.

intercept bool

Whether an intercept column was used.

location_fields list[str]

Location fields in order from most specific to least specific.

Parameters:

Name Type Description Default
coef_map dict[str, dict[Any, ndarray]]

Mapping from location field name to a dict mapping location value -> coefficient vector (aligned with feature_names).

required
global_coef ndarray

Coefficient vector for the global OLS regression.

required
feature_names list[str]

Ordered list of feature names used for all regressions.

required
intercept bool

Whether an intercept column was used.

required
location_fields list[str]

Location fields in order from most specific to least specific.

required
Source code in openavmkit/utilities/modeling.py
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
def __init__(
    self,
    coef_map: dict[str, dict[Any, np.ndarray]],
    global_coef: np.ndarray,
    feature_names: list[str],
    intercept: bool,
    location_fields: list[str],
):
    """
    Parameters
    ----------
    coef_map : dict[str, dict[Any, np.ndarray]]
        Mapping from location field name to a dict mapping
        location value -> coefficient vector (aligned with feature_names).
    global_coef : np.ndarray
        Coefficient vector for the global OLS regression.
    feature_names : list[str]
        Ordered list of feature names used for all regressions.
    intercept : bool
        Whether an intercept column was used.
    location_fields : list[str]
        Location fields in order from most specific to least specific.
    """
    self.coef_map = coef_map
    self.global_coef = global_coef
    self.feature_names = feature_names
    self.intercept = intercept
    self.location_fields = location_fields

NaiveAreaModel

NaiveAreaModel(dep_per_built_area, dep_per_land_area, sales_chase)

An intentionally bad predictive model, to use as a sort of control. Produces predictions equal to the prevailing average price/area of land or building, multiplied by the observed size of the parcel's land or building, depending on whether it's vacant or improved.

Attributes:

Name Type Description
dep_per_built_area float

Dependent variable value divided by improved square footage

dep_per_land_area float

Dependent variable value divided by land square footage

sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

Initialize a NaiveAreaModel

Parameters:

Name Type Description Default
dep_per_built_area float

Dependent variable value divided by improved square footage

required
dep_per_land_area float

Dependent variable value divided by land square footage

required
sales_chase float

Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of sales_chase. So sales_chase=0.05 will copy each sale price with 5% random noise. NOTE: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.

required
Source code in openavmkit/utilities/modeling.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def __init__(
    self, dep_per_built_area: float, dep_per_land_area: float, sales_chase: float
):
    """Initialize a NaiveAreaModel

    Parameters
    ----------
    dep_per_built_area: float
        Dependent variable value divided by improved square footage
    dep_per_land_area: float
        Dependent variable value divided by land square footage
    sales_chase : float
        Simulates sales chasing. If 0.0, no sales chasing will occur. For any other value, predictions against sold
        parcels will chase (copy) the observed sale price, with a bit of random noise equal to the value of
        ``sales_chase``. So ``sales_chase=0.05`` will copy each sale price with 5% random noise.
        **NOTE**: This is for analytical purposes only, one should not intentionally chase sales when working in actual production.
    """
    self.dep_per_built_area = dep_per_built_area
    self.dep_per_land_area = dep_per_land_area
    self.sales_chase = sales_chase

PassThroughModel

PassThroughModel(field, engine)

Mostly used for representing existing valuations to compare against, such as the Assessor's values

Attributes:

Name Type Description
field str

The field that holds the values you want to pass through as predictions

Initialize a PassThroughModel

Parameters:

Name Type Description Default
field str

The field that holds the values you want to pass through as predictions

required
engine str

The model engine ("assessor" or "pass_through")

required
Source code in openavmkit/utilities/modeling.py
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
def __init__(
    self,
    field: str,
    engine: str
):
    """Initialize a PassThroughModel

    Parameters
    ----------
    field : str
        The field that holds the values you want to pass through as predictions
    engine : str
        The model engine ("assessor" or "pass_through")
    """
    self.field = field
    self.engine = engine

SpatialLagModel

SpatialLagModel(per_area)

Use a spatial lag field as your prediction

Attributes:

Name Type Description
per_area bool

If True, normalize by area unit. If False, use the direct value of the spatial lag field.

Initialize a SpatialLagModel

Parameters:

Name Type Description Default
per_area bool

If True, normalize by square foot. If False, use the direct value of the spatial lag field.

required
Source code in openavmkit/utilities/modeling.py
246
247
248
249
250
251
252
253
254
def __init__(self, per_area: bool):
    """Initialize a SpatialLagModel

    Parameters
    ----------
    per_area : bool
        If True, normalize by square foot. If False, use the direct value of the spatial lag field.
    """
    self.per_area = per_area

TreeBasedCategoricalData dataclass

TreeBasedCategoricalData(feature_names, categorical_cols, category_levels, bool_cols)

Stores categorical metadata needed to reproduce LightGBM-compatible categorical encodings and generate numeric matrices for SHAP.

apply

apply(X)

Reapply categorical + boolean structure to a dataframe. Unknown categories are preserved but will map to NaN later.

Source code in openavmkit/utilities/modeling.py
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
def apply(self, X: pd.DataFrame) -> pd.DataFrame:
    """
    Reapply categorical + boolean structure to a dataframe.
    Unknown categories are preserved but will map to NaN later.
    """
    X = X.reindex(columns=self.feature_names)

    # enforce booleans
    for c in self.bool_cols:
        if c in X.columns:
            X[c] = X[c].astype("boolean")

    # enforce categorical levels
    for c, levels in self.category_levels.items():
        if c in X.columns:
            X[c] = pd.Categorical(X[c], categories=levels)

    return X

from_training_data classmethod

from_training_data(X_train, categorical_cols)

Build metadata from training data AFTER categoricals have been converted to pandas 'category' dtype.

Source code in openavmkit/utilities/modeling.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
@classmethod
def from_training_data(
    cls,
    X_train: pd.DataFrame,
    categorical_cols: List[str],
) -> "TreeBasedCategoricalData":
    """
    Build metadata from training data AFTER categoricals have been
    converted to pandas 'category' dtype.
    """
    feature_names = list(X_train.columns)

    cat_cols = [
        c for c in categorical_cols
        if c in X_train.columns
        and pd.api.types.is_categorical_dtype(X_train[c])
    ]

    category_levels = {
        c: list(X_train[c].cat.categories)
        for c in cat_cols
    }

    bool_cols = [
        c for c in X_train.columns
        if pd.api.types.is_bool_dtype(X_train[c])
        or str(X_train[c].dtype) == "boolean"
    ]

    return cls(
        feature_names=feature_names,
        categorical_cols=cat_cols,
        category_levels=category_levels,
        bool_cols=bool_cols,
    )

to_numeric_matrix

to_numeric_matrix(X)

Convert dataframe to a numeric matrix compatible with SHAP. Categoricals -> integer codes, unknowns/missing -> np.nan.

Source code in openavmkit/utilities/modeling.py
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
def to_numeric_matrix(self, X: pd.DataFrame) -> np.ndarray:
    """
    Convert dataframe to a numeric matrix compatible with SHAP.
    Categoricals -> integer codes, unknowns/missing -> np.nan.
    """
    X = self.apply(X)
    out = X.copy()

    for c in self.categorical_cols:
        codes = out[c].cat.codes.astype(np.float64)
        codes[codes == -1] = np.nan
        out[c] = codes

    for c in self.bool_cols:
        out[c] = out[c].astype("Float64").astype(np.float64)

    return out.to_numpy(dtype=np.float64)

XGBoostModel

XGBoostModel(regressor, cat_data)

XGBoost Model

Attributes:

Name Type Description
regressor XGBRegressor

The trained XGBoost XGBRegressor model

cat_data TreeBasedCategoricalData
Source code in openavmkit/utilities/modeling.py
540
541
542
def __init__(self, regressor, cat_data):
    self.regressor = regressor
    self.cat_data = cat_data

greedy_forward_loocv

greedy_forward_loocv(X, y, *, k_max=None, min_gain=0.002, standardize=True, prescreen_k=15)

Greedy forward selection maximizing LOOCV R² (fast). Auto-detects intercept handling: - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized). - Otherwise, adds an intercept internally (as before).

Assumes X, y are numeric, aligned, and NaN-free.

Source code in openavmkit/utilities/modeling.py
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
def greedy_forward_loocv(
    X: pd.DataFrame,
    y: pd.Series,
    *,
    k_max: Optional[int] = None,
    min_gain: float = 0.002,          # stop if best add improves CV-R² < 0.2%
    standardize: bool = True,
    prescreen_k: Optional[int] = 15,  # optionally screen to top K by |corr| for speed
) -> GreedyResult:
    """
    Greedy forward selection maximizing LOOCV R² (fast).
    Auto-detects intercept handling:
      - If X contains a 'const' column, treats it as intercept (always included, not selectable, not standardized).
      - Otherwise, adds an intercept internally (as before).

    Assumes X, y are numeric, aligned, and NaN-free.
    """

    yv = y.to_numpy(dtype=float)
    cols_all = list(X.columns)

    has_const = "const" in cols_all
    if has_const:
        # Treat provided const as the intercept (force-include)
        const_vec = X["const"].to_numpy(dtype=float).reshape(-1, 1)

        # Optional sanity check (not required, but helpful during debugging):
        # if not (np.nanstd(const_vec) < 1e-12):
        #     raise ValueError("'const' column exists but is not (near-)constant; cannot treat as intercept safely.")

        X_no_const = X.drop(columns=["const"])
        cols = list(X_no_const.columns)
        Xmat = X_no_const.to_numpy(dtype=float)
    else:
        const_vec = None
        cols = cols_all
        Xmat = X.to_numpy(dtype=float)

    n, p = Xmat.shape
    if p == 0 and not has_const:
        return GreedyResult([], cv_r2=float("-inf"), train_r2=float("-inf"))

    # SST
    y_centered = yv - yv.mean()
    sst = float(y_centered.T @ y_centered)
    if sst == 0.0:
        return GreedyResult([], cv_r2=0.0, train_r2=0.0)

    # Standardize X for stability (recommended)
    # IMPORTANT: never standardize the intercept; we already removed it above if present.
    if standardize and p > 0:
        mu = Xmat.mean(axis=0)
        sd = Xmat.std(axis=0, ddof=0)
        sd = np.where(sd == 0, 1.0, sd)
        Xmat = (Xmat - mu) / sd

    # Adaptive size cap if not provided
    if k_max is None:
        # if p==0 but has_const, we still just fit const-only
        k_max = 0 if p == 0 else min(p, max(2, n // 5))
    k_max = max(0, min(k_max, p))

    # Optional prescreen to reduce p cheaply
    if prescreen_k is not None and prescreen_k < p and p > 0:
        y_std = y_centered.std(ddof=0)
        if y_std == 0:
            return GreedyResult([], cv_r2=0.0, train_r2=0.0)
        x_std = Xmat.std(axis=0, ddof=0)
        x_std = np.where(x_std == 0, 1.0, x_std)
        corr = (Xmat.T @ y_centered) / (n * x_std * y_std)
        keep_idx = np.argsort(np.abs(corr))[-prescreen_k:]
        keep_idx = np.sort(keep_idx)
        Xmat = Xmat[:, keep_idx]
        cols = [cols[i] for i in keep_idx]
        p = Xmat.shape[1]
        k_max = min(k_max, p)

    selected: List[int] = []
    remaining = set(range(p))

    # baseline
    if has_const:
        # intercept is provided as a column; baseline is const-only
        X0 = const_vec
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)
    else:
        # original behavior: intercept-only baseline
        X0 = np.ones((n, 1))
        best_train_r2, best_cv_r2 = _ols_train_and_loocv_r2(X0, yv, sst)

    while remaining and len(selected) < k_max:
        step_best_cv = best_cv_r2
        step_best_train = best_train_r2
        step_best_j = None

        for j in list(remaining):
            idxs = selected + [j]
            Xsub = Xmat[:, idxs]

            if has_const:
                # Use provided intercept column
                X_design = np.column_stack([const_vec, Xsub])
            else:
                # Add intercept internally (original behavior)
                X_design = np.column_stack([np.ones(n), Xsub])

            train_r2, cv_r2 = _ols_train_and_loocv_r2(X_design, yv, sst)
            if cv_r2 > step_best_cv + 1e-12:
                step_best_cv = cv_r2
                step_best_train = train_r2
                step_best_j = j

        if step_best_j is None or (step_best_cv - best_cv_r2) < min_gain:
            break

        selected.append(step_best_j)
        remaining.remove(step_best_j)
        best_cv_r2 = step_best_cv
        best_train_r2 = step_best_train

    # Return only non-const variables (const is forced-in when present)
    return GreedyResult([cols[i] for i in selected], best_cv_r2, best_train_r2)