Libraries:
View the Project on GitHub SindujaSivan/Customer-Churn-Prediction-Analysis
Objective: The primary objective of this project is to predict customer churn, aiding businesses in proactively retaining customers. The project uses a structured dataset containing various customer features, and the target variable is binary, indicating whether a customer has churned (1) or not (0).
Dataset: The generated dataset includes various features that might influence the likelihood of churn. Here’s a description of the dataset:
This synthetic dataset aims to simulate a scenario where customer features are generated randomly, including factors that may contribute to customer churn. It provides a basis for building and testing customer churn prediction models.
Note: [Synthetic Data (First and last few features are shown above]
Models Used:
Code Snippnet showing model parameter setting
# Model list with models and their hyperparameter grids
models = {
'Random Forest': (RandomForestClassifier(),
{'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]}),
'Logistic Regression': (LogisticRegression(),
{'C': [0.001, 0.01, 0.1, 1, 10, 100]}),
'SVM': (SVC(),
{'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']}),
'Gradient Boosting': (GradientBoostingClassifier(),
{'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5]}),
'KNN': (KNeighborsClassifier(),
{'n_neighbors': [3, 5, 7],
'weights': ['uniform', 'distance'],
'p': [1, 2]}),
}
Code Snippnet showing model execution
for model_name, (model, param_grid) in models.items():
# Hyperparameter tuning using Grid Search
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best model after hyperparameter tuning
best_model = grid_search.best_estimator_
# Cross-validation scores
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
accuracy = np.mean(cv_scores)
accuracies.append(accuracy)
model_names.append(model_name)
print(f"\nModel: {model_name}")
print("Best Hyperparameters:", grid_search.best_params_)
print("Cross-Validation Scores:", cv_scores)
print("Average Cross-Validation Accuracy:", accuracy)
# Fit the best model on the entire training set for later evaluation
best_model.fit(X_train, y_train)
# Evaluate on the test set
y_pred = best_model.predict(X_test)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Models Results:
Model | Best Hyperparameters | Cross-Validation Accuracy | Test Set Accuracy | Precision (Class 0) | Recall (Class 1) |
---|---|---|---|---|---|
Random Forest | {‘max_depth’: 30, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_estimators’: 200} | 86.73% | 78.95% | 79% | 0% |
Logistic Regression | {‘C’: 0.001} | 63.82% | 59.25% | 80% | 42% |
SVM | {‘C’: 10, ‘kernel’: ‘rbf’} | 79.21% | 69.65% | 80% | 21% |
Gradient Boosting | {‘learning_rate’: 0.1, ‘max_depth’: 3, ‘n_estimators’: 100} | 86.43% | 78.9% | 79% | 0% |
KNN | {‘n_neighbors’: 3, ‘p’: 1, ‘weights’: ‘distance’} | 79.21% | 61.4% | 80% | 35% |
Models Accuracy:
Let’s analyze the results for each model:
1. Random Forest:
2. Logistic Regression:
3. SVM:
4. Gradient Boosting:
5. KNN:
Overall Summary:
Note: These observations highlight the trade-offs between precision and recall, especially in the context of imbalanced datasets. Further tuning and evaluation metrics may be considered depending on the specific business objectives and consequences of false positives/negatives.