Skip to content

Plumbing

This is a general overview of how OpenAVMKit is organized and how data fundamentally flows through it.

(This article is still a work in progress -- more to come soon)

In OpenAVMKit, all the functions you need to run the notebooks are organized in the pipeline module, located at openavmkit/pipeline.py.

openavmkit/
├──other_directories/
├──other_modules.py
├──pipeline.py # central module containing the public functions

In Python, a module is simply a .py file that groups together functions, classes, and constants to make the library work as intended.

By looking into openavmkit/pipeline.py, you’ll find the list of public functions that notebooks rely on, along with their parameters (the inputs they need to run) and references to where supporting functionality is defined.

  • Initializing & Syncing the notebooks:
    • init_notebook():
      • Initialize the notebook environment for a specific locality.
    • load_settings():
      • Load and return the settings dictionary for the locality.
    • cloud_sync():
      • Synchronize local files to cloud storage.
    • from_checkpoint():
      • Read cached data from a checkpoint file or generate it via a function.
    • write_checkpoint():
      • Write data to a checkpoint file.
    • delete_checkpoints():
      • Delete all checkpoints that match the given prefix.
    • read_pickle():
      • Read and return data from a pickle file.
    • write_notebook_output_sup():
      • Write notebook output to disk.
  • Data ETL (Extract, Transform & Load)
    • load_dataframes():
      • Load dataframes based on the provided settings and return them in a dictionary.
      • As seen in: Assemble Notebook.
    • process_data():
      • Process raw dataframes according to settings and return a SalesUniversePair.
      • As seen in: Assemble Notebook.
    • process_sales():
      • Process sales data within a SalesUniversePair.
      • As seen in: Assemble Notebook, Clean Notebook.
    • load_and_process_data():
      • Load and process data according to provided settings.
    • load_cleaned_data_for_modeling():
      • Read and return the cleaned data from notebook 2.
      • As seen in: Model Notebook.
    • enrich_sup_streets():
      • Enrich a GeoDataFrame with street network data.
      • As seen in: Assemble Notebook.
    • enrich_sup_spatial_lag():
      • Enrich the sales and universe DataFrames with spatial lag features.
      • As seen in: Model Notebook.
    • tag_model_groups_sup():
      • Tag model groups for a SalesUniversePair.
      • As seen in: Assemble Notebook.
    • fill_unknown_values_sup():
      • Fill unknown values with default values as specified in settings.
      • As seen in: Clean Notebook.
    • read_sales_univ():
      • Creates a SalesUniversePair from an existing checkpoint.
      • As seen in: Assessment Quality Notebook.
  • Checking that the data is correct:
    • examine_df():
      • Print examination details of the dataframe.
      • As seen in: Assemble Notebook.
    • examine_df_in_ridiculous_detail():
      • Print details of the dataframe, but in RIDICULOUS DETAIL.
      • As seen in: Assemble Notebook.
    • examine_sup():
      • Print examination details of the sales and universe data from a SalesUniversePair.
      • As seen in: Assemble Notebook, Clean Notebook, Model Notebook.
    • examine_sup_in_ridiculous_detail():
      • Print details of the sales and universe data from a SalesUniversePair, but in RIDICULOUS DETAIL.
      • As seen in: Assemble Notebook.
  • Clustering
    • mark_ss_ids_per_model_group_sup():
      • Cluster parcels for a sales scrutiny study by assigning sales scrutiny IDs.
      • As seen in: Clean Notebook.
    • mark_horizontal_equity_clusters_per_model_group_sup():
      • Cluster parcels for a horizontal equity study by assigning horizontal equity cluster IDs.
      • As seen in: Clean Notebook.
    • run_sales_scrutiny():
      • Run sales scrutiny analysis for each model group within a SalesUniversePair.
      • As seen in: Clean Notebook.
    • run_sales_scrutiny_per_model_group_sup():
      • Run sales scrutiny analysis for each model group within a SalesUniversePair.
  • Modeling
    • write_canonical_splits():
      • Separates data from the sales DataFrame into training and test sets, and stores the keys to disk.
      • As seen in: Model Notebook.
    • try_variables():
      • Run tests on variables to figure out which might be the most predictive.
      • It can also print a PDF report to disk by setting the parameter "do_report" to True. Generating PDF report requires previous installation of the wkhtmltopdf library
      • As seen in: Model Notebook.
    • try_models():
      • Tries out predictive models on the given SalesUniversePair. Optimized for speed and iteration, doesn't finalize results or write anything to disk.
      • As seen in: Model Notebook.
    • run_models():
      • Runs predictive models on the given SalesUniversePair, taking detailed instructions from the provided settings dictionary.
    • finalize_models():
      • Tries out predictive models on the given SalesUniversePair, finalizes results and writes to disk.
      • As seen in: Model Notebook.
  • Evaluating the Assessment Quality
    • run_and_write_ratio_study_breakdowns():
      • Run ratio study breakdowns and write the results to disk.
      • Generating PDF report requires previous installation of the wkhtmltopdf library
      • As seen in: Model Notebook.
    • run_ratio_study():
      • Runs a Ratio Study for the designated time period.
      • As seen in: Assessment Quality Notebook.
    • run_horizontal_equity_study():
      • Runs a Horizontal Equity Study for each cluster.
      • As seen in: Assessment Quality Notebook.
    • run_vertical_equity_study():
      • Runs a Vertical Equity Study for the designated time period.
      • As seen in: Assessment Quality Notebook.
    • plot_prediction_vs_sales():
      • Visualizes in a scatterplot the prediction from the assessment versus the actual sales.
      • As seen in: Assessment Quality Notebook.