Skip to content

openavmkit.synthetic.basic

SyntheticData

SyntheticData(df_universe, df_sales, time_land_mult, time_bldg_mult)

A simple wrapper for holding generated data along with separate land/building inflation/depreciation curves

Attributes:

Name Type Description
df_universe DataFrame

The parcel universe

df_sales DataFrame

The sales observations

time_land_mult DataFrame

Land inflation curve over time

time_bldg_mult DataFrame

Building depreciation curve over time

Initialize a SyntheticData object

Parameters:

Name Type Description Default
df_universe DataFrame

The parcel universe

required
df_sales DataFrame

The sales observations

required
time_land_mult DataFrame

Land inflation curve over time

required
time_bldg_mult DataFrame

Building depreciation curve over time

required
Source code in openavmkit/synthetic/basic.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def __init__(
    self,
    df_universe: pd.DataFrame,
    df_sales: pd.DataFrame,
    time_land_mult: pd.DataFrame,
    time_bldg_mult: pd.DataFrame,
):
    """Initialize a SyntheticData object

    Parameters
    ----------
    df_universe : pd.DataFrame
        The parcel universe
    df_sales : pd.DataFrame
        The sales observations
    time_land_mult : pd.DataFrame
        Land inflation curve over time
    time_bldg_mult : pd.DataFrame
        Building depreciation curve over time
    """
    self.df_universe = df_universe
    self.df_sales = df_sales
    self.time_land_mult = time_land_mult
    self.time_bldg_mult = time_bldg_mult

create_rect

create_rect(x, y, width, height)

Create a Shapely Polygon in the shape of a rectangle

Parameters:

Name Type Description Default
x float

The x-center of the rectangle

required
y float

The y-center of the rectangle

required
width float

The width of the rectangle

required
height float

The height of the rectangle

required

Returns:

Type Description
Polygon

A Shapely polygon representing a rectangle

Source code in openavmkit/synthetic/basic.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def create_rect(x: float, y: float, width: float, height: float):
    """Create a Shapely Polygon in the shape of a rectangle

    Parameters
    ----------
    x : float
        The x-center of the rectangle
    y : float
        The y-center of the rectangle
    width : float
        The width of the rectangle
    height : float
        The height of the rectangle

    Returns
    --------
    Polygon
        A Shapely polygon representing a rectangle
    """

    half_width = width / 2
    half_height = height / 2
    # Determine the bounds for the square
    minx = x - half_width
    maxx = x + half_width
    miny = y - half_height
    maxy = y + half_height
    return box(minx, miny, maxx, maxy)

generate_basic

generate_basic(size, percent_sales=0.1, percent_vacant=0.1, noise_sales=0.05, seed=1337, land_inflation=None, bldg_inflation=None)

Build a synthetic real-estate data set of parcels and (optionally) sales.

A square grid of size × size parcels is laid out around a notional CBD (central business district). For each parcel the routine simulates—

  • Land characteristics (area, latitude/longitude, distance to CBD, land value).
  • Improvement characteristics (finished square footage, quality/condition scores, age, building type, depreciated value).
  • Time-varying inflation factors for land and improvements, generated with :func:generate_inflation_curve.
  • Optional sale events. Each parcel is given a Bernoulli trial with success probability percent_sales. A successful trial produces one sale whose price is the sum of time-adjusted land and building values plus uniform noise +/- noise_sales.

A parcel may instead be vacant, controlled by percent_vacant. Vacant parcels have land value only. All random draws are reproducible via the seed argument.

Parameters:

Name Type Description Default
size int

Length of one side of the square study area. The function creates size^2 parcels.

required
percent_sales float

Probability (0–1) that a parcel receives one valid sale event.

``0.1``
percent_vacant float

Probability (0–1) that a parcel is vacant (no improvement). Vacant parcels may still transact if selected by percent_sales.

``0.1``
noise_sales float

Half-width of the uniform noise band applied to the simulated sale price: the multiplier is drawn from :math:\mathrm{U}(1-\text{noise\_sales},\;1+\text{noise\_sales}).

``0.05``
seed int

Seed passed to :pyfunc:numpy.random.seed for reproducibility.

``1337``
land_inflation dict or None

Keyword arguments forwarded to :func:generate_inflation_curve to create a daily land-value index. If None, a preset dict with 10 % mean annual inflation (plus mild seasonality) is used.

None
bldg_inflation dict or None

Same as land_inflation but for building improvements. Defaults to a preset dict with 2 % mean annual inflation and no seasonality.

None

Returns:

Type Description
SyntheticData

An object with four public attributes

parcels : geopandas.GeoDataFrame One record per parcel with geometry and static attributes (distance to CBD, quality/condition scores, etc.).

sales : pandas.DataFrame One record per simulated sale (may be empty). Includes sale price, unit-price metrics, sale date, and vacancy flag.

land_index : pandas.DataFrame Daily land inflation multipliers (period, value).

bldg_index : pandas.DataFrame Daily building inflation multipliers (period, value).

Notes
  • The CBD is assumed to sit at latitude 29.760762° N, longitude 95.361937° W (roughly downtown Houston, TX). Parcel coordinates are spread +/-0.25° lat / +/-0.20° lon from that center.
  • Land value decreases approximately exponentially with Euclidean (grid-based) distance from the CBD.
  • Building value per square foot depends on building type (“A”, “B”, “C”), quality, condition, and age depreciation (linear caps at 100 years).

Examples:

>>> sd = generate_basic(size=25, percent_sales=0.2, seed=42)
>>> sd.parcels.head()
>>> sd.sales[['key', 'sale_price', 'sale_date']].sample(5)
Source code in openavmkit/synthetic/basic.py
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
def generate_basic(
    size: int,
    percent_sales: float = 0.1,
    percent_vacant: float = 0.1,
    noise_sales: float = 0.05,
    seed: int = 1337,
    land_inflation: dict = None,
    bldg_inflation: dict = None,
):
    """Build a synthetic real-estate data set of parcels and (optionally) sales.

    A square grid of ``size × size`` parcels is laid out around a notional
    CBD (central business district).  For each parcel the routine simulates—

    * **Land characteristics** (area, latitude/longitude, distance to CBD,
      land value).
    * **Improvement characteristics** (finished square footage,
      quality/condition scores, age, building type, depreciated value).
    * **Time-varying inflation factors** for land and improvements, generated
      with :func:`generate_inflation_curve`.
    * **Optional sale events.**  Each parcel is given a Bernoulli trial with
      success probability ``percent_sales``.  A successful trial produces one
      sale whose price is the sum of time-adjusted land and building values
      plus uniform noise ``+/- noise_sales``.

    A parcel may instead be vacant, controlled by ``percent_vacant``.  Vacant
    parcels have land value only.  All random draws are reproducible via the
    ``seed`` argument.

    Parameters
    ----------
    size : int
        Length of one side of the square study area.  The function creates
        ``size^2`` parcels.
    percent_sales : float, default ``0.1``
        Probability (0–1) that a parcel receives *one* valid sale event.
    percent_vacant : float, default ``0.1``
        Probability (0–1) that a parcel is vacant (no improvement).
        Vacant parcels may still transact if selected by ``percent_sales``.
    noise_sales : float, default ``0.05``
        Half-width of the uniform noise band applied to the simulated sale
        price: the multiplier is drawn from
        :math:`\\mathrm{U}(1-\\text{noise\\_sales},\\;1+\\text{noise\\_sales})`.
    seed : int, default ``1337``
        Seed passed to :pyfunc:`numpy.random.seed` for reproducibility.
    land_inflation : dict or None, optional
        Keyword arguments forwarded to :func:`generate_inflation_curve` to
        create a daily land-value index.  If *None*, a preset dict with
        10 % mean annual inflation (plus mild seasonality) is used.
    bldg_inflation : dict or None, optional
        Same as ``land_inflation`` but for building improvements.  Defaults to
        a preset dict with 2 % mean annual inflation and no seasonality.

    Returns
    -------
    SyntheticData
        An object with four public attributes

        ``parcels`` : geopandas.GeoDataFrame
            One record per parcel with geometry and static attributes
            (distance to CBD, quality/condition scores, etc.).

        ``sales`` : pandas.DataFrame
            One record per simulated sale (may be empty).  Includes sale price,
            unit-price metrics, sale date, and vacancy flag.

        ``land_index`` : pandas.DataFrame
            Daily land inflation multipliers (`period`, `value`).

        ``bldg_index`` : pandas.DataFrame
            Daily building inflation multipliers (`period`, `value`).

    Notes
    -----
    * The CBD is assumed to sit at latitude **29.760762° N**, longitude
      **95.361937° W** (roughly downtown Houston, TX).  Parcel coordinates are
      spread +/-0.25° lat / +/-0.20° lon from that center.
    * Land value decreases approximately exponentially with Euclidean
      (grid-based) distance from the CBD.
    * Building value per square foot depends on building type
      (“A”, “B”, “C”), quality, condition, and age depreciation
      (linear caps at 100 years).

    Examples
    --------
    >>> sd = generate_basic(size=25, percent_sales=0.2, seed=42)
    >>> sd.parcels.head()
    >>> sd.sales[['key', 'sale_price', 'sale_date']].sample(5)
    """
    data = {
        "key": [],
        "geometry": [],
        "neighborhood": [],
        "bldg_area_finished_sqft": [],
        "land_area_sqft": [],
        "bldg_type": [],
        "bldg_quality_num": [],
        "bldg_condition_num": [],
        "bldg_age_years": [],
        "land_value": [],
        "bldg_value": [],
        "total_value": [],
        "dist_to_cbd": [],
        "latitude": [],
        "longitude": [],
        "is_vacant": [],
    }

    data_sales = {
        "key": [],
        "key_sale": [],
        "valid_sale": [],
        "valid_for_ratio_study": [],
        "vacant_sale": [],
        "is_vacant": [],
        "sale_price": [],
        "sale_price_per_impr_sqft": [],
        "sale_price_per_land_sqft": [],
        "sale_age_days": [],
        "sale_date": [],
        "sale_year": [],
        "sale_month": [],
        "sale_quarter": [],
        "sale_year_month": [],
        "sale_year_quarter": [],
    }

    latitude_center = 29.760762
    longitude_center = -95.361937

    height = 0.5
    width = 0.4

    nw_lat = latitude_center - width / 2
    nw_lon = longitude_center - height / 2

    base_land_value = 5
    base_bldg_value = 50
    quality_value = 5

    # set a random seed:
    np.random.seed(seed)

    start_date = dt(year=2020, month=1, day=1)
    end_date = dt(year=2024, month=12, day=31)

    days_duration = (end_date - start_date).days

    # default time/bldg inflation parameters:
    if land_inflation is None:
        land_inflation = {
            "start_year": start_date.year,
            "end_year": end_date.year,
            "annual_inflation_rate": 0.1,
            "annual_inflation_rate_stdev": 0.01,
            "seasonality_amplitude": 0.025,
            "monthly_noise": 0.0125,
            "daily_noise": 0.0025,
        }
    if bldg_inflation is None:
        bldg_inflation = {
            "start_year": start_date.year,
            "end_year": end_date.year,
            "annual_inflation_rate": 0.02,
            "annual_inflation_rate_stdev": 0.005,
            "seasonality_amplitude": 0.00,
            "monthly_noise": 0.01,
            "daily_noise": 0.005,
        }

    # generate the time adjustment if so desired, using `land_inflation` as parameters:
    time_land_mult = generate_inflation_curve(**land_inflation)
    time_bldg_mult = generate_inflation_curve(**bldg_inflation)

    df_time_land_mult = pd.DataFrame(
        {"period": _generate_days(start_date, end_date), "value": time_land_mult}
    )
    df_time_bldg_mult = pd.DataFrame(
        {"period": _generate_days(start_date, end_date), "value": time_bldg_mult}
    )
    df_time_land_mult["period"] = pd.to_datetime(df_time_land_mult["period"])
    df_time_bldg_mult["period"] = pd.to_datetime(df_time_bldg_mult["period"])

    for y in range(0, size):
        for x in range(0, size):

            _x = x / size
            _y = y / size

            latitude = nw_lat + (width * _x)
            longitude = nw_lon + (height * _y)

            dist_x = abs(_x - 0.5)
            dist_y = abs(_y - 0.5)
            dist_center = (dist_x**2 + dist_y**2) ** 0.5

            valid_sale = False
            vacant_sale = False
            # roll for a sale:
            if np.random.rand() < percent_sales:
                valid_sale = True

            # base value with exponential falloff from center:
            _base_land_value = base_land_value - 1
            land_value_per_land_sqft = 1 + (_base_land_value * (1 - dist_center))

            key = f"{x}-{y}"
            land_area_sqft = np.random.randint(5445, 21780)
            land_value = land_area_sqft * land_value_per_land_sqft

            if np.random.rand() < percent_vacant:
                is_vacant = True
            else:
                is_vacant = False

            if not is_vacant:
                bldg_area_finished_sqft = np.random.randint(1000, 2500)
                bldg_quality_num = np.clip(np.random.normal(3, 1), 0, 6)
                bldg_condition_num = np.clip(np.random.normal(3, 1), 0, 6)
                bldg_age_years = np.clip(np.random.normal(20, 10), 0, 100)

                bldg_type = np.random.choice(["A", "B", "C"])

                bldg_type_mult = 1.0
                if bldg_type == "A":
                    bldg_type_mult = 0.5
                elif bldg_type == "B":
                    bldg_type_mult = 1.0
                elif bldg_type == "C":
                    bldg_type_mult = 2.0
            else:
                bldg_area_finished_sqft = 0
                bldg_quality_num = 0
                bldg_condition_num = 0
                bldg_age_years = 0
                land_value = 0
                bldg_type = ""
                bldg_type_mult = 0

            bldg_value_per_sqft = (
                base_bldg_value + (quality_value * bldg_quality_num)
            ) * bldg_type_mult

            depreciation_from_age = min(0.0, 1 - (bldg_age_years / 100))
            depreciation_from_condition = min(0.0, 1 - (bldg_condition_num / 6))

            total_depreciation = (
                depreciation_from_age + depreciation_from_condition
            ) / 2

            bldg_value_per_sqft = bldg_value_per_sqft * (1 - total_depreciation)
            bldg_value = bldg_area_finished_sqft * bldg_value_per_sqft

            total_value = land_value + bldg_value

            # TODO: properly evolve the city over time with sales in "real time" so we don't wind up with weird situations
            # such as the one we're in, where the sale version of the price doesn't take into account that the building was
            # younger than at the valuation date

            sale_price = 0
            sale_price_per_land_sqft = 0
            sale_price_per_impr_sqft = 0
            sale_age_days = 0

            sale_date = None
            sale_year = None
            sale_month = None
            sale_quarter = None
            sale_year_month = None
            sale_year_quarter = None

            if valid_sale:
                # account for time inflation:
                sale_age_days = np.random.randint(0, days_duration)
                land_value_per_land_sqft_sale = (
                    land_value_per_land_sqft * time_land_mult[sale_age_days]
                )
                # bldg_value_per_sqft_sale = bldg_value_per_sqft * time_bldg_mult[sale_age_days]
                bldg_value_per_sqft_sale = bldg_value_per_sqft

                # calculate total values:
                land_value_sale = land_area_sqft * land_value_per_land_sqft_sale
                bldg_value_sale = bldg_area_finished_sqft * bldg_value_per_sqft_sale
                total_value_sale = land_value_sale + bldg_value_sale

                # add some noise
                sale_price = total_value_sale * (
                    1 + np.random.uniform(-noise_sales, noise_sales)
                )

                sale_price_per_land_sqft = sale_price / land_area_sqft
                sale_price_per_impr_sqft = sale_price / bldg_area_finished_sqft

                sale_date = start_date + pd.DateOffset(days=sale_age_days)
                sale_year = sale_date.year
                sale_month = sale_date.month
                sale_quarter = (sale_month - 1) // 3 + 1
                sale_year_month = f"{sale_year:04}-{sale_month:02}"
                sale_year_quarter = f"{sale_year:04}Q{sale_quarter}"

                vacant_sale = is_vacant

            geometry = create_rect(longitude, latitude, height, width)

            data["key"].append(str(key))
            data["neighborhood"].append("")
            data["bldg_area_finished_sqft"].append(bldg_area_finished_sqft)
            data["land_area_sqft"].append(land_area_sqft)
            data["bldg_quality_num"].append(bldg_quality_num)
            data["bldg_condition_num"].append(bldg_condition_num)
            data["bldg_age_years"].append(bldg_age_years)
            data["bldg_type"].append(bldg_type)
            data["land_value"].append(land_value)
            data["bldg_value"].append(bldg_value)
            data["total_value"].append(total_value)
            data["dist_to_cbd"].append(dist_center)
            data["latitude"].append(latitude)
            data["longitude"].append(longitude)
            data["geometry"].append(geometry)
            data["is_vacant"].append(is_vacant)

            if valid_sale:
                sale_date_YYYY_MM_DD = sale_date.strftime("%Y-%m-%d")
                data_sales["key"].append(str(key))
                data_sales["key_sale"].append(str(key) + "---" + sale_date_YYYY_MM_DD)
                data_sales["valid_sale"].append(valid_sale)
                data_sales["valid_for_ratio_study"].append(valid_sale)
                data_sales["vacant_sale"].append(vacant_sale)
                data_sales["is_vacant"].append(vacant_sale)
                data_sales["sale_price"].append(sale_price)
                data_sales["sale_price_per_impr_sqft"].append(sale_price_per_impr_sqft)
                data_sales["sale_price_per_land_sqft"].append(sale_price_per_land_sqft)
                data_sales["sale_age_days"].append(sale_age_days)
                data_sales["sale_date"].append(sale_date)
                data_sales["sale_year"].append(sale_year)
                data_sales["sale_month"].append(sale_month)
                data_sales["sale_quarter"].append(sale_quarter)
                data_sales["sale_year_month"].append(sale_year_month)
                data_sales["sale_year_quarter"].append(sale_year_quarter)

    df = gpd.GeoDataFrame(data, geometry="geometry")
    df_sales = pd.DataFrame(data_sales)

    # Derive neighborhood:
    distance_quantiles = [0.0, 0.25, 0.75, 1.0]
    distance_bins = [np.quantile(df["dist_to_cbd"], q) for q in distance_quantiles]
    distance_labels = ["urban", "suburban", "rural"]
    df["neighborhood"] = pd.cut(
        df["dist_to_cbd"],
        bins=distance_bins,
        labels=distance_labels,
        include_lowest=True,
    )

    # Derive based on longitude/latitude what (NW, NE, SW, SE) quadrant a parcel is in:
    df["quadrant"] = ""
    df.loc[df["latitude"].ge(latitude_center), "quadrant"] += "s"
    df.loc[df["latitude"].lt(latitude_center), "quadrant"] += "n"
    df.loc[df["longitude"].ge(longitude_center), "quadrant"] += "e"
    df.loc[df["longitude"].lt(longitude_center), "quadrant"] += "w"

    df["neighborhood"] = (
        df["neighborhood"].astype(str) + "_" + df["quadrant"].astype(str)
    )

    sd = SyntheticData(df, df_sales, df_time_land_mult, df_time_bldg_mult)
    return sd

generate_depreciation_curve

generate_depreciation_curve(lifetime=60, weight_linear=0.2, weight_logistic=0.8, steepness=0.3, inflection_point=20)

Generates a depreciation curve that blends straight-line and logistic ("S-curve") methods.

The function returns an array whose i-th element represents the remaining proportion of value after i years.

A weighted average of straight-line depreciation and a logistic decay is used, giving you control over both the shape (via the logistic parameters) and the relative influence of each curve.

Parameters:

Name Type Description Default
lifetime int

Total service life of the asset in years—the point at which the value is considered fully depreciated (zero).

60
weight_linear float

Weight assigned to the straight-line component. Use 1.0 (and set weight_logistic to 0.0) for pure straight-line depreciation.

0.2
weight_logistic float

Weight assigned to the logistic (sigmoid) component. Use 1.0 (and set weight_linear to 0.0) for a pure logistic curve.

0.8
steepness float

The logistic steepness parameter k. Higher values make the "drop-off" around the inflection point sharper; lower values make the curve more gradual.

0.3
inflection_point int

Year (zero-based index) at which the logistic curve crosses 50 % of its starting value.

  • For years earlier than inflection_point the logistic term is > 0.5, so the asset is still retaining more than half its value.
  • For years later than inflection_point the logistic term is < 0.5, so the asset value declines faster.

Adjust this to shift the midpoint of rapid depreciation earlier or later in the asset’s life.

20

Returns:

Type Description
ndarray

an array whose i-th element represents the remaining proportion of value after i years.

Source code in openavmkit/synthetic/basic.py
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def generate_depreciation_curve(
    lifetime: int = 60,
    weight_linear: float = 0.2,
    weight_logistic: float = 0.8,
    steepness: float = 0.3,
    inflection_point: int = 20,
) -> np.ndarray:
    """Generates a depreciation curve that blends straight-line and logistic ("S-curve")
    methods.

    The function returns an array whose *i*-th element represents the remaining
    proportion of value after *i* years.

    A weighted average of straight-line
    depreciation and a logistic decay is used, giving you control over both the
    shape (via the logistic parameters) and the relative influence of each curve.

    Parameters
    ----------
    lifetime : int
        Total service life of the asset in years—the point at which the value
        is considered fully depreciated (zero).
    weight_linear : float
        Weight assigned to the straight-line component.
        Use 1.0 (and set ``weight_logistic`` to 0.0) for pure straight-line
        depreciation.
    weight_logistic : float
        Weight assigned to the logistic (sigmoid) component.
        Use 1.0 (and set ``weight_linear`` to 0.0) for a pure logistic curve.
    steepness : float
        The logistic steepness parameter *k*.
        Higher values make the "drop-off" around the inflection point sharper;
        lower values make the curve more gradual.
    inflection_point : int
        Year (zero-based index) at which the logistic curve crosses 50 % of its
        starting value.

        - For years **earlier** than ``inflection_point`` the logistic term is > 0.5,
          so the asset is still retaining more than half its value.
        - For years **later** than ``inflection_point`` the logistic term is < 0.5,
          so the asset value declines faster.

        Adjust this to shift the midpoint of rapid depreciation earlier or later
        in the asset’s life.

    Returns
    -------
    np.ndarray
        an array whose *i*-th element represents the remaining proportion of value after *i* years.
    """

    depreciation = np.zeros(lifetime)

    for i in range(0, lifetime):
        # linear depreciation
        linear = (lifetime - i) / lifetime

        # logistic depreciation
        logistic = 1 / (1 + np.exp(steepness * (i - inflection_point)))

        # combine the two curves
        y_combined = ((weight_linear * linear) + (weight_logistic * logistic)) / (
            weight_linear + weight_logistic
        )

        depreciation[i] = y_combined

    return depreciation

generate_inflation_curve

generate_inflation_curve(start_year, end_year, annual_inflation_rate=0.02, annual_inflation_rate_stdev=0.01, seasonality_amplitude=0.1, monthly_noise=0.0, daily_noise=0.0)

Generate a pseudo-random daily price index covering one or more calendar years.

The curve is built in three passes:

  1. Annual step – Each year’s inflation factor is drawn from a normal distribution N( annual_inflation_rate , annual_inflation_rate_stdev ).
  2. Monthly step – Values are linearly interpolated to month-ends, then modulated by a sinusoidal seasonal component that peaks in late spring and bottoms in mid-winter. The seasonal deviation is bounded by seasonality_amplitude (e.g. 0.10 => +/- 10 % around the baseline), after which optional multiplicative monthly noise is applied.
  3. Daily step – Each month is linearly interpolated to daily resolution, and optional multiplicative daily noise is applied.

Parameters:

Name Type Description Default
start_year int

First calendar year (January 1) included in the series.

required
end_year int

Last calendar year (December 31) included in the series.

required
annual_inflation_rate float

Mean annual inflation rate (e.g. 0.02 => 2 %).

0.02
annual_inflation_rate_stdev float

Standard deviation of the annual inflation rate used in step 1.

0.01
seasonality_amplitude float

Maximum proportional deviation caused by intra-year seasonality (positive in spring/summer, negative in winter). Expressed as a fraction of the underlying price level.

0.10
monthly_noise float

Standard deviation of multiplicative noise applied once per month: the month-end multiplier is drawn from N(1.0, monthly_noise).

0.0
daily_noise float

Standard deviation of multiplicative noise applied once per day: each daily multiplier is drawn from N(1.0, daily_noise).

0.0

Returns:

Type Description
ndarray

One-dimensional array of length equal to the number of days from start_year-01-01 through end_year-12-31 inclusive. The first element is 1.0; subsequent elements represent the cumulative price index (≥ 0) after applying inflation, seasonality, and noise.

Source code in openavmkit/synthetic/basic.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def generate_inflation_curve(
    start_year: int,
    end_year: int,
    annual_inflation_rate: float = 0.02,
    annual_inflation_rate_stdev: float = 0.01,
    seasonality_amplitude: float = 0.10,
    monthly_noise: float = 0.0,
    daily_noise: float = 0.0,
) -> np.ndarray:
    """
    Generate a pseudo-random daily price index covering one or more calendar years.

    The curve is built in three passes:

    1. **Annual step** – Each year’s inflation factor is drawn from a normal
       distribution *N*( ``annual_inflation_rate`` , ``annual_inflation_rate_stdev`` ).
    2. **Monthly step** – Values are linearly interpolated to month-ends, then
       modulated by a sinusoidal seasonal component that peaks in late spring
       and bottoms in mid-winter.  The seasonal deviation is bounded by
       ``seasonality_amplitude`` (e.g. 0.10 => +/- 10 % around the baseline),
       after which optional multiplicative monthly noise is applied.
    3. **Daily step** – Each month is linearly interpolated to daily resolution,
       and optional multiplicative daily noise is applied.

    Parameters
    ----------
    start_year : int
        First calendar year (January 1) included in the series.
    end_year : int
        Last calendar year (December 31) included in the series.
    annual_inflation_rate : float, default 0.02
        Mean annual inflation rate (e.g. 0.02 => 2 %).
    annual_inflation_rate_stdev : float, default 0.01
        Standard deviation of the *annual* inflation rate used in step 1.
    seasonality_amplitude : float, default 0.10
        Maximum proportional deviation caused by intra-year seasonality
        (positive in spring/summer, negative in winter).  Expressed as a
        fraction of the underlying price level.
    monthly_noise : float, default 0.0
        Standard deviation of multiplicative noise applied once per month:
        the month-end multiplier is drawn from *N*(1.0, ``monthly_noise``).
    daily_noise : float, default 0.0
        Standard deviation of multiplicative noise applied once per day:
        each daily multiplier is drawn from *N*(1.0, ``daily_noise``).

    Returns
    -------
    np.ndarray
        One-dimensional array of length equal to the number of days from
        ``start_year``-01-01 through ``end_year``-12-31 inclusive.  The first
        element is 1.0; subsequent elements represent the cumulative price
        index (≥ 0) after applying inflation, seasonality, and noise.
    """

    start_date = dt(year=start_year, month=1, day=1)
    end_date = dt(year=end_year, month=12, day=31)

    duration_years = (
        end_year - start_year
    ) + 1  # we add + 1 because we end in December of the end year
    duration_months = (duration_years * 12) + 1
    duration_days = (end_date - start_date).days + 1

    # First we generate a series of data points
    # +1 for the beginning value, then one for the end of each year:
    time_mult_years = np.array([1.0] * (duration_years + 1))

    # We increase each point after the first by the annual inflation rate:
    for i in range(1, duration_years + 1):
        curr_year_inflation_rate = np.random.normal(
            annual_inflation_rate, annual_inflation_rate_stdev
        )
        time_mult_years[i] = time_mult_years[i - 1] * (1 + curr_year_inflation_rate)

    # We subdivide each year into months, interpolating between the yearly values:
    # +1 for the beginning value, then one for the end of each month:
    time_mult_months = np.array([1.0] * duration_months)

    # We interpolate between the yearly values:
    # We start at 1.0, then each next value is for the end of that month
    month = 1
    year = 0
    for t in range(1, duration_months):
        curr_mult = time_mult_years[year]
        next_mult = time_mult_years[year + 1]
        time_mult_months[t] = curr_mult + ((next_mult - curr_mult) * (month / 12))
        month += 1
        if month > 12:
            month = 1
            year += 1

    # We prepare an array for seasonality:
    # +1 for the beginning value, then one for the end of each month:
    time_mult_season = np.array([1.0] * duration_months)

    # We add seasonality amplitude:
    # - prices peak in May/June
    # - prices bottom out in December/January
    # - we use a sine wave to model this:
    t_m = 0
    for t in range(0, duration_months):
        # t_n is the normalized time, ranging from 0 to 1
        t_n = t_m / 12
        # 1.4 * pi is the phase shift to peak in May/June
        time_mult_season[t] = 1.0 + (
            (math.sin((1.4 * math.pi) - (2 * math.pi * t_n))) * seasonality_amplitude
        )
        t_m += 1
        if t_m > 12:
            t_m = 1

    # We overlay the seasonality amplitude onto time_mult_months:
    time_mult_months = time_mult_months * time_mult_season

    # We add monthly noise:
    monthly_noise_values = np.random.normal(1.0, monthly_noise, duration_months)
    monthly_noise_values[0] = 1.0

    # We overlay the monthly noise onto time_mult:
    time_mult_months = time_mult_months * monthly_noise_values

    # Then we subdivide each month into days, interpolating between the monthly values:
    time_mult_days = np.array([1.0] * duration_days)

    curr_date = start_date
    curr_month = curr_date.month - 1
    curr_month_len_in_days = (curr_date + pd.DateOffset(months=1) - curr_date).days
    t_month = 0

    day_of_month = 1

    # We iterate over the days, interpolating between the monthly values:
    for t in range(0, duration_days):
        # add a time delta to curr_date of one day:
        t_month_next = t_month + 1
        mult_curr = time_mult_months[t_month]
        mult_next = time_mult_months[t_month_next]
        frac = day_of_month / curr_month_len_in_days
        time_mult_days[t] = mult_curr + (mult_next - mult_curr) * frac

        # add daily noise
        time_mult_days[t] *= np.random.normal(1.0, daily_noise)

        curr_date = curr_date + pd.DateOffset(days=1)
        new_month = curr_date.month - 1
        day_of_month += 1
        if new_month != curr_month:
            day_of_month = 1
            t_month += 1
            curr_month = new_month
            curr_month_len_in_days = (
                curr_date + pd.DateOffset(months=1) - curr_date
            ).days

    return time_mult_days