Preprocessing Reference
philanthropy.preprocessing
CRM data cleaning, Fiscal Year-aware feature engineering, and clinical-encounter feature engineering for medical philanthropy.
FiscalYearTransformer
Bases: TransformerMixin, BaseEstimator
Append Organisation-specific Fiscal Year and Quarter to dates.
Source code in philanthropy/preprocessing/transformers.py
CRMCleaner
Bases: TransformerMixin, BaseEstimator
Standardise raw CRM exports.
CRMCleaner performs lightweight, defensive cleaning of CRM datasets
exported from systems such as Salesforce NPSP, Raiser's Edge NXT, or
Ellucian Advance. It is designed to be chained in a sklearn.pipeline.Pipeline
along with WealthScreeningImputer to handle missing wealth values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_col
|
str
|
Column containing ISO-8601 gift dates. Parsed to |
"gift_date"
|
amount_col
|
str
|
Column containing raw gift amounts. Forced to |
"gift_amount"
|
fiscal_year_start
|
int
|
Month (1–12) that begins the organisation's fiscal year. |
7
|
Attributes:
| Name | Type | Description |
|---|---|---|
feature_names_in_ |
list of str
|
Column names of |
n_features_in_ |
int
|
Number of columns in |
Source code in philanthropy/preprocessing/transformers.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
WealthScreeningImputer
Bases: TransformerMixin, BaseEstimator
Leakage-safe median/constant imputation for wealth-screening columns.
This transformer learns fill statistics only from the training fold
during :meth:fit and applies them in :meth:transform. It is designed
to slot cleanly into a :class:sklearn.pipeline.Pipeline immediately after
:class:~philanthropy.preprocessing.CRMCleaner and before any model that
cannot natively handle NaN values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wealth_cols
|
list of str or None
|
Column names containing third-party wealth-screening numeric values.
If |
None
|
strategy
|
('median', 'mean', 'zero')
|
Imputation strategy applied to each wealth column:
|
"median"
|
add_indicator
|
bool
|
If |
True
|
fiscal_year_start
|
int
|
Month (1–12) starting the organisation's fiscal year. Inherited for pipeline compatibility. |
7
|
Attributes:
| Name | Type | Description |
|---|---|---|
fill_values_ |
dict of {str: float}
|
Mapping from column name to the computed fill value, frozen at
:meth: |
imputed_cols_ |
list of str
|
Wealth columns that were actually present in |
n_features_in_ |
int
|
Number of columns in |
feature_names_in_ |
ndarray of str
|
Column names of |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from philanthropy.preprocessing import WealthScreeningImputer
>>> X = pd.DataFrame({
... "estimated_net_worth": [1e6, np.nan, 5e5, np.nan, 2e6],
... "real_estate_value": [np.nan, 3e5, np.nan, 4e5, np.nan],
... "gift_amount": [5000, 250, 1000, 750, 10000],
... })
>>> imp = WealthScreeningImputer(
... wealth_cols=["estimated_net_worth", "real_estate_value"],
... strategy="median",
... add_indicator=True,
... )
>>> imp.set_output(transform="pandas")
WealthScreeningImputer(...)
>>> X_out = imp.fit_transform(X)
>>> bool(X_out["estimated_net_worth"].isna().any())
False
>>> "estimated_net_worth__was_missing" in X_out.columns
True
See Also
philanthropy.preprocessing.CRMCleaner : Upstream cleaner that standardises column dtypes before this imputer. philanthropy.models.ShareOfWalletRegressor : Downstream model that uses wealth-screening features to estimate philanthropic capacity.
Source code in philanthropy/preprocessing/_wealth.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | |
fit(X, y=None)
Learn fill statistics from training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Training-set feature matrix. Missing wealth columns are silently
skipped (a |
required |
y
|
ignored
|
Present for scikit-learn API compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
WealthScreeningImputer
|
Fitted imputer. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in philanthropy/preprocessing/_wealth.py
transform(X)
Apply imputation with frozen fill values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Feature matrix (training or held-out). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
ndarray
|
Copy of |
Raises:
| Type | Description |
|---|---|
NotFittedError
|
If :meth: |
Source code in philanthropy/preprocessing/_wealth.py
EncounterTransformer
Bases: TransformerMixin, BaseEstimator
Merge clinical encounter history into philanthropic feature matrices.
Given a lookup encounter_df containing at least one discharge date per
donor, this transformer enriches a gift-level DataFrame with two continuous
temporal features:
days_since_last_discharge
Integer number of days between the donor's most recent discharge
date (observed at :meth:fit time) and the gift_date in X.
Negative values indicate gifts made before discharge (pre-admission
solicitations are uncommon and are flagged as NaN by default unless
allow_negative_days=True).
encounter_frequency_score
Log-scaled count of distinct encounter records for the donor. Because
the distribution of encounter counts is highly right-skewed in real AMC
data, the log transform normalises the feature for downstream linear
models. Donors with zero encounters receive a score of 0.0.
All identifier columns (merge_key and any column whose name contains
common PII substrings: "id", "mrn", "ssn", "name",
"dob", "zip") are silently dropped from the output DataFrame before
it is returned, preventing accidental downstream leakage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encounter_df
|
DataFrame
|
Reference table of clinical encounters. Must contain |
None
|
discharge_col
|
str
|
Column in |
"discharge_date"
|
gift_date_col
|
str
|
Column in |
"gift_date"
|
merge_key
|
str
|
Column name present in both |
"donor_id"
|
allow_negative_days
|
bool
|
If |
False
|
id_cols_to_drop
|
list of str or None
|
Additional column names to explicitly drop on output, beyond those
detected via the PII heuristic. Useful when non-standard identifiers
(e.g., |
None
|
fiscal_year_start
|
int
|
Month (1–12) that begins the organisation's fiscal year. |
7
|
Attributes:
| Name | Type | Description |
|---|---|---|
encounter_summary_ |
DataFrame
|
Per-donor summary table (indexed by |
dropped_cols_ |
list of str
|
Names of the columns that were removed from |
n_features_in_ |
int
|
Number of columns seen in |
feature_names_in_ |
ndarray of str
|
Column names of |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
Examples:
>>> import pandas as pd
>>> from philanthropy.preprocessing import EncounterTransformer
>>> enc = pd.DataFrame({
... "donor_id": [1, 1, 2],
... "discharge_date": ["2022-01-01", "2023-06-15", "2022-09-30"],
... })
>>> gifts = pd.DataFrame({
... "donor_id": [1, 2, 3],
... "gift_date": ["2023-08-01", "2023-01-01", "2023-05-01"],
... "gift_amount": [10000.0, 750.0, 250.0],
... })
>>> t = EncounterTransformer(encounter_df=enc, merge_key="donor_id")
>>> t.set_output(transform="pandas")
EncounterTransformer(...)
>>> out = t.fit_transform(gifts)
>>> "donor_id" not in out.columns
True
>>> "days_since_last_discharge" in out.columns
True
>>> "encounter_frequency_score" in out.columns
True
Source code in philanthropy/preprocessing/_encounters.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 | |
fit(X, y=None)
Compute per-donor encounter summaries from encounter_df.
The fitted artefact encounter_summary_ is a lightweight per-donor
lookup containing the most-recent discharge date and total encounter
count. No information from X flows into this summary, which
prevents temporal data leakage when the transformer is placed before
a time-based train/test split inside a pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
DataFrame
|
Gift-level DataFrame. Used only to infer |
required |
y
|
ignored
|
Present for scikit-learn API compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
EncounterTransformer
|
Fitted transformer instance. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing from |
Source code in philanthropy/preprocessing/_encounters.py
transform(X)
Append encounter features and strip identifying columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
DataFrame
|
Gift-level DataFrame. Must contain |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
ndarray
|
Enriched array with two new columns:
All identifier-like columns (including |
Raises:
| Type | Description |
|---|---|
NotFittedError
|
If :meth: |
ValueError
|
If |
Source code in philanthropy/preprocessing/_encounters.py
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 | |
RFMTransformer
Bases: TransformerMixin, BaseEstimator
Transforms transaction logs into Recency, Frequency, and Monetary (RFM) features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_date
|
str or datetime - like
|
The date used as the reference point to calculate recency. If None, the maximum gift_date in the dataframe is used. |
None
|
agg_func
|
str or callable
|
The aggregation function to calculate the monetary value. Typical values are 'sum' (cumulative) or 'mean' (average). |
'sum'
|
Source code in philanthropy/preprocessing/_rfm.py
fit(X, y=None)
Fits the transformer. This simply validates the input and returns self.
Source code in philanthropy/preprocessing/_rfm.py
transform(X)
Transforms the transaction logs into RFM features.
Source code in philanthropy/preprocessing/_rfm.py
PlannedGivingSignalTransformer
Bases: TransformerMixin, BaseEstimator
Extract features for bequest / planned-giving intent classification.
Planned giving (bequests, charitable remainder trusts) requires a separate predictive model from major gifts. Key drivers are donor age ≥ 65, giving tenure ≥ 10 years, and a wealth-screening vendor "charitable inclination" score. This transformer extracts a four-column feature vector optimised for bequest/legacy gift intent classifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
age_col
|
str
|
Column containing donor age in years. |
"donor_age"
|
tenure_col
|
str
|
Column containing number of years the donor has been active. |
"years_active"
|
planned_gift_inclination_col
|
str
|
Column containing the wealth-screening vendor's charitable inclination score, expected to be in [0, 1]. Missing values are treated as a sentinel value (-1.0) to distinguish "vendor data absent" from a genuine 0 score. |
"planned_gift_inclination"
|
age_threshold
|
int
|
Minimum age (inclusive) for the is_legacy_age flag. |
65
|
tenure_threshold_years
|
int
|
Minimum years active (inclusive) for the is_loyal_donor flag. |
10
|
Attributes:
| Name | Type | Description |
|---|---|---|
n_features_in_ |
int
|
Number of input features seen at fit time. |
feature_names_in_ |
ndarray of str
|
Column names of X at fit time (set when X is a DataFrame). |
Notes
Output columns
~~~~~~~~~~~~~~
========================= ================================================
Col Name Description
========================= ================================================
0 is_legacy_age uint8: 1 if age >= age_threshold, else 0.
NaN age → 0.
1 is_loyal_donor uint8: 1 if tenure >= tenure_threshold_years.
NaN tenure → 0.
2 inclination_score float64: raw planned_gift_inclination value,
clipped to [0, 1]. Missing → -1.0 sentinel
(distinguishable from a genuine 0 score).
3 composite_score float64: is_legacy_age + is_loyal_donor
+ max(inclination_score, 0). Range [0.0, 3.0].
========================= ================================================
Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from philanthropy.preprocessing import PlannedGivingSignalTransformer
>>> X = pd.DataFrame({
... "donor_age": [70, 60, None],
... "years_active": [15, 5, 12],
... "planned_gift_inclination": [0.8, 0.3, None],
... })
>>> t = PlannedGivingSignalTransformer()
>>> out = t.fit_transform(X)
>>> out.shape
(3, 4)
Source code in philanthropy/preprocessing/_planned_giving.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
fit(X, y=None)
Validate input schema and record n_features_in_.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Donor-level feature matrix. |
required |
y
|
ignored
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
PlannedGivingSignalTransformer
|
|
Source code in philanthropy/preprocessing/_planned_giving.py
transform(X, y=None)
Compute the 4-column planned-giving feature vector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Donor-level feature matrix. Accepts pd.DataFrame (columns may or may not exist — missing columns are handled gracefully with NaN / 0). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
np.ndarray of shape (n_samples, 4), dtype float64
|
Columns: [is_legacy_age, is_loyal_donor, inclination_score, composite_score]. |
Raises:
| Type | Description |
|---|---|
NotFittedError
|
If :meth: |
Source code in philanthropy/preprocessing/_planned_giving.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | |
GratefulPatientFeaturizer
Bases: TransformerMixin, BaseEstimator
Featurize clinical signals from grateful-patient encounter data.
This transformer bridges EHR service-line and treating-physician data with the advancement CRM to produce clinical-depth features for major gift propensity models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encounter_df
|
DataFrame | None
|
Reference table of clinical encounters. Must contain |
None
|
encounter_path
|
str | None
|
Path to a Parquet or CSV file containing clinical encounters.
Alternative to |
None
|
service_line_col
|
str
|
Column in the encounter table holding service line / department name. |
"service_line"
|
physician_col
|
str
|
Column in the encounter table holding the attending physician ID. |
"attending_physician_id"
|
drg_weight_col
|
str | None
|
Optional column holding DRG (Diagnosis Related Group) relative weights. If present, total DRG weight per donor is computed. |
None
|
use_capacity_weights
|
bool
|
If True, apply AMC-benchmarked service-line capacity weights to scale the clinical gravity score. |
True
|
merge_key
|
str
|
Column name present in both the encounter table and |
"donor_id"
|
discharge_col
|
str
|
Column in the encounter table holding discharge dates. |
"discharge_date"
|
Attributes:
| Name | Type | Description |
|---|---|---|
encounter_summary_ |
DataFrame
|
Per-donor aggregated encounter features, indexed by |
n_features_in_ |
int
|
Number of features seen at fit time (set by |
feature_names_in_ |
ndarray of str
|
Column names of |
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither |
Notes
The four output columns are:
========================= ================================================
Column Description
========================= ================================================
clinical_gravity_score Encounter count × service-line capacity weight.
distinct_service_lines Number of unique service lines.
distinct_physicians Number of unique attending physicians.
total_drg_weight Sum of DRG relative weights (NaN if unavailable).
========================= ================================================
Donors absent from the encounter table receive zeros for all columns.
Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from philanthropy.preprocessing import GratefulPatientFeaturizer
>>> enc = pd.DataFrame({
... "donor_id": [1, 1, 2],
... "discharge_date": ["2022-01-01", "2023-06-15", "2022-09-30"],
... "service_line": ["cardiac", "cardiac", "oncology"],
... "attending_physician_id": ["P1", "P2", "P3"],
... })
>>> X = pd.DataFrame({"donor_id": [1, 2, 3]})
>>> gpf = GratefulPatientFeaturizer(encounter_df=enc)
>>> gpf.fit(X)
GratefulPatientFeaturizer(...)
>>> out = gpf.transform(X)
>>> out.shape
(3, 4)
Source code in philanthropy/preprocessing/_grateful_patient.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | |
fit(X, y=None)
Build per-donor encounter summaries from encounter data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Donor-level feature matrix. Used only for schema registration via
|
required |
y
|
ignored
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
GratefulPatientFeaturizer
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither |
Source code in philanthropy/preprocessing/_grateful_patient.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
transform(X, y=None)
Merge clinical features into the donor feature matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Donor-level feature matrix. Must contain |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
np.ndarray of shape (n_samples, 4), dtype float64
|
Columns in order:
|
Source code in philanthropy/preprocessing/_grateful_patient.py
DischargeToSolicitationWindowTransformer
Bases: TransformerMixin, BaseEstimator
Flag donors in the clinical fundraising post-discharge solicitation window.
This transformer outputs two features:
- in_solicitation_window (col 0): 1 if within window, 0 otherwise.
- window_position_score (col 1): proximity to midpoint [0.0, 1.0].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_days_post_discharge
|
int
|
Start of the solicitation window, in days post-discharge (inclusive). |
90
|
max_days_post_discharge
|
int
|
End of the solicitation window, in days post-discharge (inclusive). |
365
|
days_since_discharge_col
|
str
|
Column name containing days since last discharge. |
"days_since_last_discharge"
|
Source code in philanthropy/preprocessing/_discharge_window.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
fit(X, y=None)
Fit the transformer (no-op, validates parameters).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Training data. |
required |
y
|
Ignored
|
Not used, present for API consistency. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
DischargeToSolicitationWindowTransformer
|
|
Source code in philanthropy/preprocessing/_discharge_window.py
transform(X, y=None)
Transform X to two columns: in_window, window_position_score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like or DataFrame of shape (n_samples, n_features)
|
Data with days_since_discharge column or first column as days. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
out |
ndarray of shape (n_samples, 2)
|
Columns: in_window (0/1), window_position_score [0,1]. |
Source code in philanthropy/preprocessing/_discharge_window.py
get_feature_names_out(input_features=None)
Get output feature names.
WealthPercentileTransformer
Bases: TransformerMixin, BaseEstimator
Computes wealth percentile ranks.
Source code in philanthropy/preprocessing/_wealth_percentile.py
EncounterRecencyTransformer
Bases: TransformerMixin, BaseEstimator
Transform HIPAA-safe encounter-date columns into predictive recency features.
Given one or more date-only columns (no PHI — dates only), this transformer produces three downstream-model-ready features per date column:
days_since_last_encounter
Integer days between reference_date and the encounter date.
NaN for missing/unparseable dates. Always non-negative when
reference_date >= encounter_date; negative values indicate
future dates (rare in production) and are left as-is to allow models
to detect data-quality anomalies.
encounter_in_last_90d
Float64 0.0 / 1.0 flag — 1.0 if days_since_last_encounter <= 90.
Missing dates → 0.0.
fiscal_year_of_encounter
Integer fiscal year in which the encounter ends (e.g., a July-start
FY convention assigns a June-30 encounter to the current year, while
a July-1 encounter starts the next FY). Missing → np.nan
(returned as float64).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_col
|
str or list of str
|
Column name(s) in |
"last_encounter_date"
|
fiscal_year_start
|
int
|
Month (1–12) on which the organisation's fiscal year begins.
|
7
|
reference_date
|
str, datetime-like, or None
|
The anchor date used to compute |
None
|
timezone
|
str or None
|
Optional timezone name (e.g., |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
reference_date_ |
Timestamp
|
The reference date frozen at :meth: |
n_features_in_ |
int
|
Number of columns in |
feature_names_in_ |
ndarray of str
|
Column names of |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
TypeError
|
If the resolved |
Examples:
>>> import pandas as pd
>>> from philanthropy.preprocessing import EncounterRecencyTransformer
>>> X = pd.DataFrame({
... "last_encounter_date": ["2023-06-01", "2022-12-15", None],
... })
>>> t = EncounterRecencyTransformer(fiscal_year_start=7, reference_date="2023-09-01")
>>> t.set_output(transform="pandas")
EncounterRecencyTransformer(...)
>>> out = t.fit_transform(X)
>>> out.shape
(3, 3)
>>> int(out.iloc[0, 0]) # days since 2023-06-01 from 2023-09-01 = 92
92
>>> bool((out.iloc[:, 1] >= 0).all())
True
Notes
HIPAA note: This transformer accepts only date columns. Ensure that
no PHI fields (MRN, patient name, diagnosis code) are included in X.
Fiscal year convention: With fiscal_year_start=7, the fiscal year
is identified by the calendar year in which it ends. A date of
2023-07-01 belongs to FY 2024; a date of 2023-06-30 belongs to FY
2023. This matches the convention used by most US research universities
and many hospital foundations.
Source code in philanthropy/preprocessing/_encounter_recency.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | |
fit(X, y=None)
Validate parameters and freeze the reference date from training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input data. Must contain the column(s) specified in |
required |
y
|
ignored
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
EncounterRecencyTransformer
|
|
Source code in philanthropy/preprocessing/_encounter_recency.py
transform(X, y=None)
Compute encounter recency features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
Input data containing the date column(s). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
np.ndarray of shape (n_samples, 3 * n_date_cols), dtype float64
|
Feature columns, in order, for each
|
Raises:
| Type | Description |
|---|---|
NotFittedError
|
If :meth: |
Source code in philanthropy/preprocessing/_encounter_recency.py
get_feature_names_out(input_features=None)
Return output feature names.
Returns:
| Name | Type | Description |
|---|---|---|
feature_names |
ndarray of str
|
|
Source code in philanthropy/preprocessing/_encounter_recency.py
WealthScreeningImputerKNN
Bases: TransformerMixin, BaseEstimator
Leakage-safe KNN imputation for wealth-screening vendor columns.
Extends the median/mean/zero strategy of
:class:~philanthropy.preprocessing.WealthScreeningImputer with a
"knn" strategy using :class:sklearn.impute.KNNImputer. KNN
imputation is recommended when wealth columns cluster meaningfully
(e.g., by zip-code based real-estate quartile), which is common in
curated hospital prospect pools where WealthEngine / DonorSearch
data has geographic structure.
This estimator delegates to sklearn.impute.KNNImputer internally
and inherits its Pipeline composability and clone-safety.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wealth_cols
|
list of str or None
|
Subset of columns to impute. If |
None
|
strategy
|
('median', 'mean', 'zero', 'knn')
|
Imputation strategy. |
"median"
|
n_neighbors
|
int
|
Number of neighbours used when |
5
|
add_indicator
|
bool
|
Append a binary |
True
|
group_col_idx
|
int or None
|
Column index of a group variable (e.g., zip-code encoded as int)
to stratify KNN imputation. When provided (and
|
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
imputed_cols_ |
list of str
|
Wealth columns that were actually present in |
fill_values_ |
dict of {str: float}
|
Fill statistics (only populated for non-KNN strategies). |
knn_imputer_ |
KNNImputer or None
|
The fitted :class: |
n_features_in_ |
int
|
Number of columns in |
feature_names_in_ |
ndarray of str
|
Column names of |
Examples:
>>> import numpy as np
>>> from philanthropy.preprocessing._share_of_wallet import WealthScreeningImputerKNN
>>> rng = np.random.default_rng(42)
>>> X = rng.uniform(0, 1e6, (50, 3))
>>> X[rng.random((50, 3)) < 0.3] = np.nan
>>> imp = WealthScreeningImputerKNN(strategy="knn", n_neighbors=3, add_indicator=False)
>>> imp.fit(X)
WealthScreeningImputerKNN(...)
>>> out = imp.transform(X)
>>> bool(np.isnan(out).any())
False
Source code in philanthropy/preprocessing/_share_of_wallet.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
fit(X, y=None)
Learn fill statistics or fit the KNN imputer from training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
|
required |
y
|
ignored
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
WealthScreeningImputerKNN
|
|
Source code in philanthropy/preprocessing/_share_of_wallet.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
transform(X, y=None)
Apply imputation and optionally append missingness indicators.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
ndarray
|
Imputed array (float64), with indicator columns appended if
|
Raises:
| Type | Description |
|---|---|
NotFittedError
|
|
Source code in philanthropy/preprocessing/_share_of_wallet.py
ShareOfWalletScorer
Bases: TransformerMixin, BaseEstimator
Compute a normalised Share-of-Wallet score and capacity-tier label.
This transformer is designed as the final stage of a major-gift capacity scoring pipeline. It consumes a numeric feature matrix and produces two outputs per row:
sow_score (float64, [0, 1])
Normalised Share-of-Wallet:
SoW = estimated_capacity / (total_modelled_wealth + epsilon)
where ``total_modelled_wealth`` is the row-wise sum of all columns
specified in ``wealth_col_indices`` (or all columns if not
specified), and ``estimated_capacity`` is the column at
``capacity_col_idx``.
capacity_tier (float64, categorical encoding)
A numeric encoding of the human-readable tier label, usable by
downstream sklearn estimators (e.g., a classifier trained to
predict tier upgrades). The mapping is:
============ ============ =========================================
SoW score Tier label Recommended action
============ ============ =========================================
≥ 0.75 Principal Schedule personal visit with campaign chair.
0.40 – 0.75 Major Assign major gift officer.
0.00 – 0.40 Leadership Include in leadership annual giving.
============ ============ =========================================
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
capacity_col_idx
|
int
|
Column index (0-based) in |
0
|
wealth_col_indices
|
list of int or None
|
Column indices to sum as "total modelled wealth". If |
None
|
epsilon
|
float
|
Small constant added to the denominator to prevent division by zero when all wealth columns are zero. |
1.0
|
capacity_floor
|
float
|
Minimum value to enforce on |
0.0
|
Attributes:
| Name | Type | Description |
|---|---|---|
wealth_scale_ |
float
|
95th-percentile total modelled wealth observed at fit time, used to
clip outlier wealth sums during :meth: |
n_features_in_ |
int
|
|
feature_names_in_ |
ndarray of str
|
|
Examples:
>>> import numpy as np
>>> from philanthropy.preprocessing._share_of_wallet import ShareOfWalletScorer
>>> rng = np.random.default_rng(0)
>>> X = rng.uniform(0, 1e6, (20, 4))
>>> scorer = ShareOfWalletScorer(capacity_col_idx=0, epsilon=1.0)
>>> scorer.fit(X)
ShareOfWalletScorer(...)
>>> out = scorer.transform(X)
>>> out.shape
(20, 2)
>>> bool(((out[:, 0] >= 0) & (out[:, 0] <= 1)).all())
True
Source code in philanthropy/preprocessing/_share_of_wallet.py
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 | |
fit(X, y=None)
Fit the scorer: record wealth scale from training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
|
required |
y
|
ignored
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
ShareOfWalletScorer
|
|
Source code in philanthropy/preprocessing/_share_of_wallet.py
transform(X, y=None)
Compute SoW score and numeric capacity tier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like of shape (n_samples, n_features)
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_out |
np.ndarray of shape (n_samples, 2), dtype float64
|
Column 0: |
Raises:
| Type | Description |
|---|---|
NotFittedError
|
|
Source code in philanthropy/preprocessing/_share_of_wallet.py
get_tier_labels(X)
Return human-readable tier labels for each row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like compatible with :meth:`transform`
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
labels |
ndarray of str, shape (n_samples,)
|
One of |