Machine Learning Models for Crime Prediction

1 Introduction to Crime Prediction

Predictive modeling of crime has become an important tool for law enforcement and public safety planning. In this section, we explore how machine learning techniques can be used to predict the type of crime a victim might experience based on demographic characteristics and contextual factors. By understanding these patterns, authorities can develop more targeted prevention strategies and allocate resources more effectively.

2 Data Preparation for Modeling

To build our predictive models, we used the Los Angeles crime dataset with features including victim age, gender, ethnicity, time of occurrence, and location type. We preprocessed the data by:

Categorizing crime types into seven major categories
Encoding categorical variables
Handling missing values
Scaling numerical features
Splitting data into training (70%), validation (15%), and test (15%) sets

Summary of Features Used in Predictive Models
Feature	Unique_Values	Examples
Crime Categories	7	Other, Robbery, Fraud
Victim Gender	3	Male, Unknown, Female
Age Groups	5	50-69, 30-49, 18-29
Premise Types	10	Residential, Commercial, Bank
Time of Day	4	Night, Evening, Morning, Afternoon

3 Model Comparison

We built and evaluated several machine learning models to predict crime types:

Multinomial Logistic Regression: A baseline model for multi-class classification
Random Forest: A tree-based ensemble method with class weight balancing
XGBoost: A gradient boosting framework that handles imbalanced data well
Support Vector Machine (SVM): A model focused on finding optimal decision boundaries

Each model was evaluated on the test set using accuracy, class-wise sensitivity, and Top-N accuracy metrics.

3.1 1. Multinomial Logistic Regression Results

Test Accuracy: Multinomial Logistic Regression Model
Crime.Category	Accuracy….
Overall (Validation)	46.53
Overall (Test)	46.89
Class: Assault	66.24
Class: Fraud	11.46
Class: Property_Damage	32.96
Class: Public_Order	0.10
Class: Sexual_Crimes	0.06
Class: Theft	62.29
Class: Violent_Crimes	0.00

The logistic regression model achieved a moderate overall accuracy of ~46%. It performed best on common crime types like Assault (66.24%) and Theft (62.29%), but struggled with less frequent categories like Public Order (0.10%) and Fraud (11.46%). This behavior suggests the model is influenced by class imbalance.

3.2 2. Weighted Random Forest Results

Test Accuracy: Weighted Random Forest Model
Crime.Category	Accuracy….
Overall (Validation)	33.54
Overall (Test)	33.65
Class: Assault	13.86
Class: Fraud	67.34
Class: Property_Damage	36.45
Class: Public_Order	27.96
Class: Sexual_Crimes	53.02
Class: Theft	37.98
Class: Violent_Crimes	53.49

The weighted Random Forest shows lower overall accuracy (33.65%) compared to logistic regression, but demonstrates more balanced performance across crime categories. The class weighting strategy improved sensitivity for minority classes like Fraud (67.34%) and Public Order (27.96%), at the cost of performance on majority classes.

3.3 3. Top-N Accuracy Comparison

Top-N Accuracy (Validation & Test Sets) using Weighted Random Forest
Top.N	Validation.Accuracy….	Test.Accuracy….
Top-1	49.11	49.50
Top-2	73.28	73.62
Top-3	85.61	85.84
Top-4	92.93	93.02
Top-5	97.31	97.26

Top-N Accuracy (Validation & Test Sets) using XGBoost Classifier
Top.N	Validation.Accuracy….	Test.Accuracy….
Top-1	49.01	49.07
Top-2	73.27	73.58
Top-3	85.78	85.92
Top-4	93.17	93.12
Top-5	97.38	97.34

When evaluating models using Top-N accuracy, all models show substantial improvement beyond Top-1 predictions. Most notably, the ensemble methods (Random Forest and XGBoost) reach over 85% accuracy when considering their top 3 predictions, and nearly 98% with top 5 predictions. This suggests that while predicting the exact crime type is challenging, identifying a small set of likely crime types is highly feasible.

4 Confusion Matrix Analysis

The confusion matrix helps us understand where our models make mistakes and identify patterns in misclassification.

The confusion matrix reveals several important patterns:

Common misclassifications: Theft is often confused with Robbery, likely due to similarities in these crime types.
Strong diagonal: Most crime types show relatively strong diagonal values, indicating the model performs reasonably well at identifying the correct class.
Minority class challenges: Smaller classes like Sexual Crimes show more off-diagonal misclassifications proportional to their size.

5 Model Deployment Strategy

Based on our comprehensive model evaluation, we recommend the following deployment strategy:

Model Deployment Recommendations
Scenario	Recommended_Model	Justification
Emergency Response Prioritization	XGBoost with Top-3 Predictions	Balance between speed and accuracy; provides actionable multiple scenarios
Community Risk Awareness	Random Forest with Top-5 Predictions	Higher coverage ensures most potential risks are identified for community education
Resource Allocation	Logistic Regression with Class-Specific Thresholds	Simple interpretable model with adjustable thresholds for different resource types
Individual Safety Planning	XGBoost with Probability Calibration	Best calibrated probability estimates for personal risk assessment

6 Interactive Crime Prediction Tool

Below is a demonstration of how our crime prediction model could be implemented in an interactive tool. Using demographic information and location details, the tool estimates the probability of different crime types.

Sample Prediction Scenarios
Scenario	Gender	Age	Location	Time
Scenario 1	Female	18-29	Commercial	Evening
Scenario 2	Male	30-49	Street	Night
Scenario 3	Female	50-69	Residential	Morning

7 Limitations and Future Work

While our models show promising results, several limitations should be acknowledged:

Data quality issues: Missing or inaccurate information in police reports affects model performance
Temporal dynamics: Crime patterns change over time, requiring model retraining
Spatial granularity: Neighborhood-level factors are not fully captured
Causality vs. correlation: Predictions reflect statistical patterns, not necessarily causal relationships
Ethical considerations: Care must be taken to avoid reinforcing existing biases

Future research directions include:

Incorporating spatiotemporal modeling: Using geographic and temporal information more effectively
Ensemble methods: Combining multiple models for improved performance
Deep learning approaches: Using neural networks for more complex pattern recognition
External data integration: Adding socioeconomic, weather, and infrastructure data
Interpretable AI: Developing more transparent models for law enforcement use

8 Conclusion

Our analysis demonstrates that machine learning can effectively predict crime types based on victim demographics and contextual factors. While no model achieves perfect accuracy, ensemble methods like Random Forest and XGBoost provide reliable Top-3 predictions, which are practical for real-world applications.

The most important predictors of crime type are victim age, location type, and time of day. Different demographic groups face distinctly different risk profiles, which supports targeted prevention strategies rather than one-size-fits-all approaches.

For practical deployment, we recommend using a Top-3 prediction system that presents the most likely crime scenarios rather than a single prediction. This approach balances accuracy with actionable information, enabling more effective resource allocation and safety planning.