Skip to content

openavmkit.area_stats

Area-statistic ("neighborhood enrichment") feature generation.

Computes per-location summary statistics and stamps them onto every parcel as new area_stat_<location>_<field>_<stat> features (for example area_stat_neighborhood_bldg_area_finished_sqft_mean). This is a quantized, group-based counterpart to spatial lag (:func:openavmkit.data.enrich_sup_spatial_lag): instead of a smooth k-nearest-neighbor surface, it summarizes discrete location groups (neighborhood, census tract, ...) at one or more granularities.

Two rules keep the features honest:

  • Leakage: sale-derived fields (the sale price and its variants) are aggregated over the training set of valid sales only, so test-set prices never enter a feature. Characteristic fields (building area, lot size, quality, zoning, ...) are aggregated over the full universe, since those are known at prediction time. An optional exclude_test_keys flag drops test-key parcels from all aggregation for shops that want strict out-of-sample hygiene.
  • Small samples: a min_count floor blanks a stat to NaN when its group has too few observations, with no fallback. When locations are configured as a hierarchy (coarsest → finest), the coarser levels are simply separate area_stat_* columns the model can lean on where a finer one is missing.

The companion :func:report_area_stats ranks the generated features by their correlation with sale price and optionally writes a Markdown report.

enrich_sup_area_stats

enrich_sup_area_stats(sup, settings, verbose=False)

Enrich sales and universe with per-location area-statistic features.

Reads the data.process.enrich.area_stats configuration and, for each configured location × field × stat combination, computes the statistic within each location group and stamps it onto every parcel. A per-location count (group size) column is always emitted. If the feature is not configured, sup is returned unchanged.

Parameters:

Name Type Description Default
sup SalesUniversePair

SalesUniversePair containing sales and universe DataFrames.

required
settings dict

Settings dictionary.

required
verbose bool

If True, prints progress information.

False

Returns:

Type Description
SalesUniversePair

Enriched SalesUniversePair with new area_stat_* columns.

Source code in openavmkit/area_stats.py
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
def enrich_sup_area_stats(
    sup: SalesUniversePair, settings: dict, verbose: bool = False
) -> SalesUniversePair:
    """Enrich sales and universe with per-location area-statistic features.

    Reads the ``data.process.enrich.area_stats`` configuration and, for each configured
    ``location × field × stat`` combination, computes the statistic within each location
    group and stamps it onto every parcel. A per-location ``count`` (group size) column is
    always emitted. If the feature is not configured, ``sup`` is returned unchanged.

    Parameters
    ----------
    sup : SalesUniversePair
        SalesUniversePair containing sales and universe DataFrames.
    settings : dict
        Settings dictionary.
    verbose : bool, optional
        If True, prints progress information.

    Returns
    -------
    SalesUniversePair
        Enriched SalesUniversePair with new ``area_stat_*`` columns.
    """
    cfg = get_area_stats_config(settings)
    if not cfg:
        if verbose:
            print("area_stats: no configuration found; skipping.")
        return sup

    locations = cfg.get("locations", []) or []
    # Bare sale-price fields auto-expand into the full per-area family (level +
    # improved $/bldg-sqft + vacant/improved $/land-sqft), for both raw and time-adjusted.
    explicit_fields = set(cfg.get("fields", []) or [])
    fields = expand_area_stats_fields(settings, cfg.get("fields", []) or [])
    num_stats = cfg.get("stats", AREA_STATS_NUMERIC_DEFAULT) or []
    cat_stats = cfg.get("categorical_stats", AREA_STATS_CATEGORICAL_DEFAULT) or []
    min_count = int(cfg.get("min_count", 0) or 0)
    exclude_test_keys = bool(cfg.get("exclude_test_keys", False))

    df_sales = sup.sales.copy()
    df_universe = sup.universe.copy()

    # Split configured fields into sale-derived (train-only) and characteristic (universe).
    sale_fields = [f for f in fields if is_sale_derived_field(settings, f)]

    # Resolve base-field kinds once (numeric vs categorical) to pick the stat family.
    num_set = set(get_fields_numeric(settings, include_boolean=True))
    cat_set = set(get_fields_categorical(settings))
    unit = area_unit(settings)

    # Source frames -----------------------------------------------------------------
    # The sales frame is always built: sale-derived stats use it AND the per-location
    # sales counts (total / improved / vacant) are derived from it. It is restricted to
    # training valid sales so nothing here leaks the target.
    df_univ_src = df_universe
    df_hydrated = get_hydrated_sales_from_sup(sup)
    try:
        train_keys, test_keys = get_train_test_keys(df_hydrated, settings)
    except KeyError:
        # No model_group column / canonical splits available (e.g. run before the split
        # stage, or minimal frames): we can't build a leakage-safe sales source, so
        # sale-derived stats and sales counts are skipped rather than risk leakage.
        if sale_fields:
            warnings.warn(
                "area_stats: no train/test split available (missing 'model_group' or "
                "canonical splits); sale-derived fields and sales counts will be empty. "
                "Run after write_canonical_splits."
            )
        train_keys, test_keys = np.array([], dtype=str), np.array([], dtype=str)
    train_keys = set(np.asarray(train_keys).astype(str))
    test_keys = set(np.asarray(test_keys).astype(str))

    sale_mask = df_hydrated["key_sale"].astype(str).isin(train_keys)
    if "valid_sale" in df_hydrated.columns:
        sale_mask &= df_hydrated["valid_sale"].eq(True)
    df_sale_src = df_hydrated.loc[sale_mask].copy()
    if sale_fields:
        # Synthesize area-unit-normalized sale fields (e.g. $/finished-sqft) on the
        # train-only frame so they stay leakage-guarded, mirroring spatial lag.
        df_sale_src = _synthesize_sale_unit_fields(df_sale_src, sale_fields, unit)

    if exclude_test_keys:
        test_parcels = set(
            df_hydrated.loc[
                df_hydrated["key_sale"].astype(str).isin(test_keys), "key"
            ].astype(str)
        )
        df_univ_src = df_universe[~df_universe["key"].astype(str).isin(test_parcels)]

    # Compute and stamp --------------------------------------------------------------
    new_cols: list[str] = []

    for location in locations:
        if location not in df_universe.columns:
            warnings.warn(
                f"area_stats: location '{location}' not found in universe; skipping."
            )
            continue

        # Per-location counts (always emitted, never masked by min_count):
        #   count                -> universe parcels in the area
        #   sales_count          -> training valid sales in the area
        #   sales_count_improved -> of those, improved sales
        #   sales_count_vacant   -> of those, vacant sales
        parcels_name = make_area_stat_count_field_name(location, "count")
        df_universe[parcels_name] = df_universe[location].map(
            df_univ_src.groupby(location).size()
        )
        new_cols.append(parcels_name)

        total, improved, vacant = _sales_counts_by_group(df_sale_src, location, unit)
        for kind, series in (
            ("sales_count", total),
            ("sales_count_improved", improved),
            ("sales_count_vacant", vacant),
        ):
            cname = make_area_stat_count_field_name(location, kind)
            df_universe[cname] = df_universe[location].map(series).fillna(0)
            new_cols.append(cname)

        for field in fields:
            is_sale = field in sale_fields
            src = df_sale_src if is_sale else df_univ_src
            if src is None or field not in src.columns or location not in src.columns:
                # Only warn for fields the user listed explicitly; auto-expanded
                # sale-rate variants that don't apply here are skipped silently.
                if field in explicit_fields:
                    warnings.warn(
                        f"area_stats: field '{field}' unavailable for location "
                        f"'{location}'; skipping."
                    )
                continue

            kind = _base_field_kind(settings, field, src[field], num_set, cat_set)
            stats_list = num_stats if kind == "numeric" else cat_stats

            # Non-null observations per group drive the min_count floor.
            obs_count = src.groupby(location)[field].count()
            small_groups = (
                obs_count.index[obs_count < min_count] if min_count > 0 else None
            )

            for stat in stats_list:
                colname = make_area_stat_field_name(location, field, stat)
                series = _aggregate(src, location, field, stat, kind)
                if small_groups is not None and len(small_groups) > 0:
                    series = series.copy()
                    series.loc[series.index.isin(small_groups)] = np.nan
                df_universe[colname] = df_universe[location].map(series)
                new_cols.append(colname)

    # Propagate the universe-level features onto sales by parcel key (matches the
    # universe -> sales pattern used by spatial lag).
    for col in new_cols:
        df_sales = _fill_col_from_universe(df_sales, df_universe, col)

    if verbose:
        print(
            f"area_stats: added {len(new_cols)} column(s) across "
            f"{len(locations)} location(s)."
        )

    return SalesUniversePair(df_sales, df_universe)

report_area_stats

report_area_stats(sup, settings, outpath=None, threshold=0.1, do_plots=False, verbose=False)

Rank area-stat features by their correlation with sale price.

Computes the correlation of every numeric area_stat_* column with the sale price (over valid sales), returning a DataFrame ranked by correlation strength. When outpath is provided, also writes a Markdown report (and PDF/HTML per analysis.report.formats).

Parameters:

Name Type Description Default
sup SalesUniversePair

SalesUniversePair already enriched via :func:enrich_sup_area_stats.

required
settings dict

Settings dictionary.

required
outpath str

Output path (without extension) for the Markdown report. If None, no file is written and only the ranked DataFrame is returned.

None
threshold float

Correlation score threshold passed to :func:calc_correlations. Defaults to 0.1.

0.1
do_plots bool

If True, render correlation heatmaps. Defaults to False.

False
verbose bool

If True, prints progress information.

False

Returns:

Type Description
DataFrame

Columns variable, corr_strength, corr_clarity, corr_score, sorted by corr_strength descending.

Source code in openavmkit/area_stats.py
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
def report_area_stats(
    sup: SalesUniversePair,
    settings: dict,
    outpath: str = None,
    threshold: float = 0.1,
    do_plots: bool = False,
    verbose: bool = False,
) -> pd.DataFrame:
    """Rank area-stat features by their correlation with sale price.

    Computes the correlation of every numeric ``area_stat_*`` column with the sale price
    (over valid sales), returning a DataFrame ranked by correlation strength. When
    ``outpath`` is provided, also writes a Markdown report (and PDF/HTML per
    ``analysis.report.formats``).

    Parameters
    ----------
    sup : SalesUniversePair
        SalesUniversePair already enriched via :func:`enrich_sup_area_stats`.
    settings : dict
        Settings dictionary.
    outpath : str, optional
        Output path (without extension) for the Markdown report. If None, no file is
        written and only the ranked DataFrame is returned.
    threshold : float, optional
        Correlation score threshold passed to :func:`calc_correlations`. Defaults to 0.1.
    do_plots : bool, optional
        If True, render correlation heatmaps. Defaults to False.
    verbose : bool, optional
        If True, prints progress information.

    Returns
    -------
    pandas.DataFrame
        Columns ``variable``, ``corr_strength``, ``corr_clarity``, ``corr_score``, sorted
        by ``corr_strength`` descending.
    """
    empty = pd.DataFrame(
        columns=["variable", "corr_strength", "corr_clarity", "corr_score"]
    )

    df = get_hydrated_sales_from_sup(sup)
    if "valid_sale" in df.columns:
        df = df[df["valid_sale"].eq(True)]

    sale_field = get_sale_field(settings, df)
    # get_sale_field returns the time-adjusted field by default even if it isn't
    # present; fall back to raw sale_price so the report still works pre-time-adjustment.
    if sale_field not in df.columns and "sale_price" in df.columns:
        sale_field = "sale_price"
    if sale_field not in df.columns:
        warnings.warn(
            f"area_stats report: sale field '{sale_field}' not found; skipping report."
        )
        return empty

    area_cols = [
        c
        for c in df.columns
        if c.startswith(AREA_STAT_PREFIX) and pd.api.types.is_numeric_dtype(df[c])
    ]
    if not area_cols:
        warnings.warn(
            "area_stats report: no numeric area_stat columns found; "
            "run enrich_sup_area_stats first."
        )
        return empty

    x_corr = df[[sale_field] + area_cols].copy()
    corr = calc_correlations(x_corr, threshold=threshold, do_plots=do_plots)

    ranked = corr["initial"].copy()
    ranked = ranked[ranked["variable"] != sale_field]
    ranked = ranked.sort_values(
        "corr_strength", ascending=False, na_position="last"
    ).reset_index(drop=True)

    if verbose:
        print(f"area_stats report: ranked {len(ranked)} feature(s) vs '{sale_field}'.")

    if outpath is not None:
        _write_area_stats_report(ranked, settings, outpath, sale_field)

    return ranked