openavmkit.utilities.clustering
make_clusters
make_clusters(df_in, field_location, fields_categorical, fields_numeric=None, split_on_vacant=True, min_cluster_size=15, unit='sqft', verbose=False, output_folder='', t=None)
Partition a DataFrame into hierarchical clusters based on location, vacancy, categorical, and numeric fields.
Clustering proceeds in phases:
- Location split: if
field_locationis given and present indf_in, rows are initially grouped by unique values of that column. - Vacancy split: if the column
is_vacantexists, clusters are further subdivided by vacancy status (True/False). - Categorical split: for each column in
fields_categorical, clusters are refined by appending the stringified category value. - Numeric split: for each entry in
fields_numeric, attempt to subdivide each cluster on a numeric field (or first available from a list) by calling_crunch(). Clusters smaller thanmin_cluster_sizeare skipped, ensuring no cluster falls below this threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_in
|
DataFrame
|
Input data to cluster. Each row will be assigned a final cluster ID. |
required |
field_location
|
str or None
|
Column name to use for an initial split. If None or not found, all rows start in one cluster. |
required |
fields_categorical
|
list of str
|
Categorical column names for successive splits. Each unique value in these fields refines cluster labels. |
required |
fields_numeric
|
list of str or list of str
|
Numeric fields (or lists of fallbacks) for recursive clustering. If None, a default set is used. Each entry represents a variable to attempt splitting upon, in order. |
None
|
split_on_vacant
|
bool
|
whether to split on vacant status or not, default True |
True
|
min_cluster_size
|
int
|
Minimum number of rows required to split a cluster on a numeric field. |
15
|
unit
|
str
|
What unit you are using for area. "sqft" or "sqm" |
"sqft"
|
verbose
|
bool
|
If True, print progress messages at each phase and sub-cluster iteration. |
False
|
output_folder
|
str
|
Path to save any intermediate outputs (currently unused). |
""
|
t
|
TimingData
|
TimingData object to record performance metrics. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
cluster_ids |
Series
|
Zero-based string IDs for each row’s final cluster. |
fields_used |
list of str
|
Names of fields (categorical or numeric) that resulted in at least one split. |
cluster_labels |
Series
|
Hierarchical cluster labels encoding the sequence of splits applied to each row. |
Source code in openavmkit/utilities/clustering.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |