Machine Learning Case Study by Chris Kucewicz
Real-world classification project predicting serious injuries in Chicago traffic crashes using interpretable models on imbalanced, messy public data.
Tech Stack: Python, Pandas, scikit-learn, imbalanced-learn (SMOTE), GridSearchCV
Photo by Sawyer Bengtson on Unsplash
Problem & Objective
Can we predict whether a traffic crash results in serious injury — and understand why?
Using 900K+ crash records from Chicago’s Open Data Portal, this project predicts serious injury outcomes (fatal or incapacitating) to support data-driven traffic safety policy. The goal was not just prediction, but interpretation — building a model that can inform real-world decision-making.
Data sources:
ML Problem Framing
- Task: Binary classification – Serious Injury vs. Non-Serious Injury
- Target:
most_severe_injury
→ recoded as binary - Challenge: Class imbalance (~5% serious injury)
- Goal: Use interpretable, policy-relevant models
- Data Sources: Crash, Vehicle, and People datasets (merged)
Serious injuries account for less than 5% of all crashes, highlighting a significant class imbalance in the dataset. To address this, the modeling pipeline used oversampling with SMOTE and class weighting to improve recall for serious injury predictions.
Modeling Pipeline
Step | Approach / Tool |
---|---|
Baseline Models | Logistic Regression, Decision Tree |
Imbalance Handling | SMOTE (oversampling), Class Weights |
Evaluation Metric | Precision-Recall AUC (PR AUC) |
Hyperparameter Tuning | GridSearchCV (on class weights) |
Final Model | Decision Tree (PR AUC = 0.096) |
Explore the modeling process step-by-step in the notebook. View it here.
Feature Insights
Feature | Insight |
---|---|
Airbag Deployment | Top predictor; likely a proxy for high-speed or large-vehicle impacts |
Sex (Male) | Overrepresented in serious injuries (54% vs. 47.8%) |
Crash Cause | High gain, though often “Unknown/Other” |
Season (Winter/Summer) | Seasonal extremes associated with elevated risk |
Decision Tree Plot of Final Model
The decision tree revealed interpretable splits based on the driver reversing direction and/or stopping, along with environmental and road conditions.
Top Feature Importances from Decision Tree Model
Feature | Category | Importance |
---|---|---|
Airbag | Not Deployed | 0.0625 |
Crash Type | Unknown/Other | 0.0499 |
Sex | Male | 0.0437 |
Season | Summer | 0.0414 |
Season | Winter | 0.0397 |
Season | Spring | 0.0391 |
The model prioritized contextual and driver-related features. Injury severity was more likely when airbags did not deploy, crashes occurred in less favorable seasons, or the driver was male. These results provide interpretable insights aligned with known crash risk factors.
Policy-Relevant Takeaways
- Male involvement and mid-speed roads are linked to serious crashes
- Airbag deployment (as a proxy for high-impact collisions) was the strongest predictor
- Findings support targeted safety campaigns, vehicle weight regulations, and airbag inspection policies
What This Project Demonstrates
- Handling real-world imbalanced classification with appropriate metrics (PR AUC)
- Use of interpretable models (white-box over black-box)
- Practical feature engineering: merging 3 datasets into crash-level records
- ML for public good: policy-relevant, equity-aware analysis
- Full workflow: data prep → modeling → evaluation → insights
Next Steps
- Explore ensemble models with SHAP for interpretability
- Refine feature engineering (e.g., separate driver vs. passenger sex)
- Compare with statistical models (e.g., Poisson/Negative Binomial)
Full Jupyter Notebook | Presentation Slides | GitHub Repo | Contact: cfkucewicz@gmail.com