Plumbing

This is a general overview of how OpenAVMKit is organized and how data fundamentally flows through it.

(This article is still a work in progress -- more to come soon)

In OpenAVMKit, all the functions you need to run the notebooks are organized in the pipeline module, located at openavmkit/pipeline.py.

openavmkit/
├──other_directories/
├──other_modules.py
├──pipeline.py # central module containing the public functions

In Python, a module is simply a .py file that groups together functions, classes, and constants to make the library work as intended.

By looking into openavmkit/pipeline.py, you’ll find the list of public functions that notebooks rely on, along with their parameters (the inputs they need to run) and references to where supporting functionality is defined.

Initializing & Syncing the notebooks:
- ‎init_notebook():
  - Initialize the notebook environment for a specific locality.
- load_settings():
  - Load and return the settings dictionary for the locality.
- cloud_sync():
  - Synchronize local files to cloud storage.
- from_checkpoint():
  - Read cached data from a checkpoint file or generate it via a function.
- write_checkpoint():
  - Write data to a checkpoint file.
- delete_checkpoints():
  - Delete all checkpoints that match the given prefix.
- read_pickle():
  - Read and return data from a pickle file.
- write_notebook_output_sup():
  - Write notebook output to disk.
Data ETL (Extract, Transform & Load)
- load_dataframes():
  - Load dataframes based on the provided settings and return them in a dictionary.
  - As seen in: Assemble Notebook.
- process_data():
  - Process raw dataframes according to settings and return a SalesUniversePair.
  - As seen in: Assemble Notebook.
- process_sales():
  - Process sales data within a SalesUniversePair.
  - As seen in: Assemble Notebook, Clean Notebook.
- load_and_process_data():
  - Load and process data according to provided settings.
- load_cleaned_data_for_modeling():
  - Read and return the cleaned data from notebook 2.
  - As seen in: Model Notebook.
- enrich_sup_streets():
  - Enrich a GeoDataFrame with street network data.
  - As seen in: Assemble Notebook.
- enrich_sup_spatial_lag():
  - Enrich the sales and universe DataFrames with spatial lag features.
  - As seen in: Model Notebook.
- tag_model_groups_sup():
  - Tag model groups for a SalesUniversePair.
  - As seen in: Assemble Notebook.
- fill_unknown_values_sup():
  - Fill unknown values with default values as specified in settings.
  - As seen in: Clean Notebook.
- read_sales_univ():
  - Creates a SalesUniversePair from an existing checkpoint.
  - As seen in: Assessment Quality Notebook.
Checking that the data is correct:
- examine_df():
  - Print examination details of the dataframe.
  - As seen in: Assemble Notebook.
- examine_df_in_ridiculous_detail():
  - Print details of the dataframe, but in RIDICULOUS DETAIL.
  - As seen in: Assemble Notebook.
- examine_sup():
  - Print examination details of the sales and universe data from a SalesUniversePair.
  - As seen in: Assemble Notebook, Clean Notebook, Model Notebook.
- examine_sup_in_ridiculous_detail():
  - Print details of the sales and universe data from a SalesUniversePair, but in RIDICULOUS DETAIL.
  - As seen in: Assemble Notebook.
Clustering
- mark_ss_ids_per_model_group_sup():
  - Cluster parcels for a sales scrutiny study by assigning sales scrutiny IDs.
  - As seen in: Clean Notebook.
- mark_horizontal_equity_clusters_per_model_group_sup():
  - Cluster parcels for a horizontal equity study by assigning horizontal equity cluster IDs.
  - As seen in: Clean Notebook.
- run_sales_scrutiny():
  - Run sales scrutiny analysis for each model group within a SalesUniversePair.
  - As seen in: Clean Notebook.
- run_sales_scrutiny_per_model_group_sup():
  - Run sales scrutiny analysis for each model group within a SalesUniversePair.
Modeling
- write_canonical_splits():
  - Separates data from the sales DataFrame into training and test sets, and stores the keys to disk.
  - As seen in: Model Notebook.
- try_variables():
  - Run tests on variables to figure out which might be the most predictive.
  - It can also print a PDF report to disk by setting the parameter "do_report" to True. Generating PDF report requires previous installation of the wkhtmltopdf library
  - As seen in: Model Notebook.
- try_models():
  - Tries out predictive models on the given SalesUniversePair. Optimized for speed and iteration, doesn't finalize results or write anything to disk.
  - As seen in: Model Notebook.
- run_models():
  - Runs predictive models on the given SalesUniversePair, taking detailed instructions from the provided settings dictionary.
- finalize_models():
  - Tries out predictive models on the given SalesUniversePair, finalizes results and writes to disk.
  - As seen in: Model Notebook.
Evaluating the Assessment Quality
- run_and_write_ratio_study_breakdowns():
  - Run ratio study breakdowns and write the results to disk.
  - Generating PDF report requires previous installation of the wkhtmltopdf library
  - As seen in: Model Notebook.
- run_ratio_study():
  - Runs a Ratio Study for the designated time period.
  - As seen in: Assessment Quality Notebook.
- run_horizontal_equity_study():
  - Runs a Horizontal Equity Study for each cluster.
  - As seen in: Assessment Quality Notebook.
- run_vertical_equity_study():
  - Runs a Vertical Equity Study for the designated time period.
  - As seen in: Assessment Quality Notebook.
- plot_prediction_vs_sales():
  - Visualizes in a scatterplot the prediction from the assessment versus the actual sales.
  - As seen in: Assessment Quality Notebook.