openavmkit.utilities.clustering
make_clusters
make_clusters(df_in, field_location, fields_categorical, fields_numeric=None, min_cluster_size=15, verbose=False, output_folder='')
Partition a DataFrame into hierarchical clusters based on location, vacancy, categorical, and numeric fields.
Clustering proceeds in phases:
- Location split: if
field_location
is given and present indf_in
, rows are initially grouped by unique values of that column. - Vacancy split: if the column
is_vacant
exists, clusters are further subdivided by vacancy status (True
/False
). - Categorical split: for each column in
fields_categorical
, clusters are refined by appending the stringified category value. - Numeric split: for each entry in
fields_numeric
, attempt to subdivide each cluster on a numeric field (or first available from a list) by calling_crunch()
. Clusters smaller thanmin_cluster_size
are skipped, ensuring no cluster falls below this threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_in
|
DataFrame
|
Input data to cluster. Each row will be assigned a final cluster ID. |
required |
field_location
|
str or None
|
Column name to use for an initial split. If None or not found, all rows start in one cluster. |
required |
fields_categorical
|
list of str
|
Categorical column names for successive splits. Each unique value in these fields refines cluster labels. |
required |
fields_numeric
|
list of str or list of str
|
Numeric fields (or lists of fallbacks) for recursive clustering. If None, a default set is used. Each entry represents a variable to attempt splitting upon, in order. |
None
|
min_cluster_size
|
int
|
Minimum number of rows required to split a cluster on a numeric field. |
15
|
verbose
|
bool
|
If True, print progress messages at each phase and sub-cluster iteration. |
False
|
output_folder
|
str
|
Path to save any intermediate outputs (currently unused). |
""
|
Returns:
Name | Type | Description |
---|---|---|
cluster_ids |
Series
|
Zero-based string IDs for each row’s final cluster. |
fields_used |
list of str
|
Names of fields (categorical or numeric) that resulted in at least one split. |
cluster_labels |
Series
|
Hierarchical cluster labels encoding the sequence of splits applied to each row. |
Source code in openavmkit/utilities/clustering.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|