1 Introduction to Crime Prediction

Predictive modeling of crime has become an important tool for law enforcement and public safety planning. In this section, we explore how machine learning techniques can be used to predict the type of crime a victim might experience based on demographic characteristics and contextual factors. By understanding these patterns, authorities can develop more targeted prevention strategies and allocate resources more effectively.

2 Data Preparation for Modeling

To build our predictive models, we used the Los Angeles crime dataset with features including victim age, gender, ethnicity, time of occurrence, and location type. We preprocessed the data by:

  1. Categorizing crime types into seven major categories
  2. Encoding categorical variables
  3. Handling missing values
  4. Scaling numerical features
  5. Splitting data into training (70%), validation (15%), and test (15%) sets
Summary of Features Used in Predictive Models
Feature Unique_Values Examples
Crime Categories 7 Other, Robbery, Fraud
Victim Gender 3 Male, Unknown, Female
Age Groups 5 50-69, 30-49, 18-29
Premise Types 10 Residential, Commercial, Bank
Time of Day 4 Night, Evening, Morning, Afternoon

3 Model Comparison

We built and evaluated several machine learning models to predict crime types:

  1. Multinomial Logistic Regression: A baseline model for multi-class classification
  2. Random Forest: A tree-based ensemble method with class weight balancing
  3. XGBoost: A gradient boosting framework that handles imbalanced data well
  4. Support Vector Machine (SVM): A model focused on finding optimal decision boundaries

Each model was evaluated on the test set using accuracy, class-wise sensitivity, and Top-N accuracy metrics.

3.1 1. Multinomial Logistic Regression Results

Test Accuracy: Multinomial Logistic Regression Model
Crime.Category Accuracy….
Overall (Validation) 46.53
Overall (Test) 46.89
Class: Assault 66.24
Class: Fraud 11.46
Class: Property_Damage 32.96
Class: Public_Order 0.10
Class: Sexual_Crimes 0.06
Class: Theft 62.29
Class: Violent_Crimes 0.00

The logistic regression model achieved a moderate overall accuracy of ~46%. It performed best on common crime types like Assault (66.24%) and Theft (62.29%), but struggled with less frequent categories like Public Order (0.10%) and Fraud (11.46%). This behavior suggests the model is influenced by class imbalance.

3.2 2. Weighted Random Forest Results

Test Accuracy: Weighted Random Forest Model
Crime.Category Accuracy….
Overall (Validation) 33.54
Overall (Test) 33.65
Class: Assault 13.86
Class: Fraud 67.34
Class: Property_Damage 36.45
Class: Public_Order 27.96
Class: Sexual_Crimes 53.02
Class: Theft 37.98
Class: Violent_Crimes 53.49

The weighted Random Forest shows lower overall accuracy (33.65%) compared to logistic regression, but demonstrates more balanced performance across crime categories. The class weighting strategy improved sensitivity for minority classes like Fraud (67.34%) and Public Order (27.96%), at the cost of performance on majority classes.

3.3 3. Top-N Accuracy Comparison

Top-N Accuracy (Validation & Test Sets) using Weighted Random Forest
Top.N Validation.Accuracy…. Test.Accuracy….
Top-1 49.11 49.50
Top-2 73.28 73.62
Top-3 85.61 85.84
Top-4 92.93 93.02
Top-5 97.31 97.26
Top-N Accuracy (Validation & Test Sets) using XGBoost Classifier
Top.N Validation.Accuracy…. Test.Accuracy….
Top-1 49.01 49.07
Top-2 73.27 73.58
Top-3 85.78 85.92
Top-4 93.17 93.12
Top-5 97.38 97.34

When evaluating models using Top-N accuracy, all models show substantial improvement beyond Top-1 predictions. Most notably, the ensemble methods (Random Forest and XGBoost) reach over 85% accuracy when considering their top 3 predictions, and nearly 98% with top 5 predictions. This suggests that while predicting the exact crime type is challenging, identifying a small set of likely crime types is highly feasible.

4 Confusion Matrix Analysis

The confusion matrix helps us understand where our models make mistakes and identify patterns in misclassification.

The confusion matrix reveals several important patterns:

  1. Common misclassifications: Theft is often confused with Robbery, likely due to similarities in these crime types.
  2. Strong diagonal: Most crime types show relatively strong diagonal values, indicating the model performs reasonably well at identifying the correct class.
  3. Minority class challenges: Smaller classes like Sexual Crimes show more off-diagonal misclassifications proportional to their size.

5 Model Deployment Strategy

Based on our comprehensive model evaluation, we recommend the following deployment strategy:

Model Deployment Recommendations
Scenario Recommended_Model Justification
Emergency Response Prioritization XGBoost with Top-3 Predictions Balance between speed and accuracy; provides actionable multiple scenarios
Community Risk Awareness Random Forest with Top-5 Predictions Higher coverage ensures most potential risks are identified for community education
Resource Allocation Logistic Regression with Class-Specific Thresholds Simple interpretable model with adjustable thresholds for different resource types
Individual Safety Planning XGBoost with Probability Calibration Best calibrated probability estimates for personal risk assessment

6 Interactive Crime Prediction Tool

Below is a demonstration of how our crime prediction model could be implemented in an interactive tool. Using demographic information and location details, the tool estimates the probability of different crime types.

Sample Prediction Scenarios
Scenario Gender Age Location Time
Scenario 1 Female 18-29 Commercial Evening
Scenario 2 Male 30-49 Street Night
Scenario 3 Female 50-69 Residential Morning

7 Limitations and Future Work

While our models show promising results, several limitations should be acknowledged:

  1. Data quality issues: Missing or inaccurate information in police reports affects model performance
  2. Temporal dynamics: Crime patterns change over time, requiring model retraining
  3. Spatial granularity: Neighborhood-level factors are not fully captured
  4. Causality vs. correlation: Predictions reflect statistical patterns, not necessarily causal relationships
  5. Ethical considerations: Care must be taken to avoid reinforcing existing biases

Future research directions include:

  1. Incorporating spatiotemporal modeling: Using geographic and temporal information more effectively
  2. Ensemble methods: Combining multiple models for improved performance
  3. Deep learning approaches: Using neural networks for more complex pattern recognition
  4. External data integration: Adding socioeconomic, weather, and infrastructure data
  5. Interpretable AI: Developing more transparent models for law enforcement use

8 Conclusion

Our analysis demonstrates that machine learning can effectively predict crime types based on victim demographics and contextual factors. While no model achieves perfect accuracy, ensemble methods like Random Forest and XGBoost provide reliable Top-3 predictions, which are practical for real-world applications.

The most important predictors of crime type are victim age, location type, and time of day. Different demographic groups face distinctly different risk profiles, which supports targeted prevention strategies rather than one-size-fits-all approaches.

For practical deployment, we recommend using a Top-3 prediction system that presents the most likely crime scenarios rather than a single prediction. This approach balances accuracy with actionable information, enabling more effective resource allocation and safety planning.