Sparkify: Predicting the user churn using Apache Spark

Published in

Analytics Vidhya

8 min readJun 1, 2021

This project is aimed at predicting the User Churn for a fictional digital music streaming service called Sparkify

Churn prediction is one of the most sought after use cases of Big Data in the industry and it helps businesses make proactive decisions of how to retain a customer if the customer is predicted to be churning in the near future by targeting them with promotional campaigns and giving them a reason to stay with the business.

We’ll analyze, explore and model our solution using the industry-wide practice of CRISP-DM (Cross Industry Process for Data Mining) whereby we will be explaining the solution under the following sections:

Business Understanding
Data Understanding
Prepare Data
Data Modeling
Evaluate the Results
Deploy

Business Understanding

Sparkify is a digital music streaming service that is used for free by listening to some advertisement in between the songs or by paying a monthly subscription fee to have an ads-free experience. Whenever the users feel like then they can also decide to downgrade from premium to free, upgrade from free to premium or even cancel from the service.

When the users cancel from the service, that is when we denote that the user has churned.

Data Understanding & Data Preparation

The dataset provided is basically composed of the log of each user action that is taken on the platform. Each action is marked with a certain timestamp ts when that action took place

In the dataset, we have 691 users with 543705 events

These users perform different actions and click on different pages and as you can see in the distribution, the bulk of the actions that the users perform is clicking on the NextSong

There are also users that have upgraded/downgraded their service and then there are also 99 users that have Cancelled their service, these 99 users will be marked to be denoted as churned.

We will now look at some aspects of the users that have churned and how they compare to the users that haven’t

We will first look at how the gender is distributed compared to the churned users

*Male customers are more likely to churn than female customers*

We will now look at some metrics and how they compare between the churned and active users

Data Modeling

As part of the Data Modelling, 11 features were identified and used in the prediction of the user churn, these features are:

From the above list of fields, userId was omitted and categorical variables like gender and level were one-hot encoded and since both of them only had 2 distinct values, they were converted into binary columns.

Features such as total_lifetime (time since registration), average_songs_played (average number of songs played per session), had to be engineered

# Total time since registration
total_lifetime = user_log_valid.select('userId', 'registration', 'ts').withColumn('total_lifetime', (user_log_valid.ts - user_log_valid.registration))\
    .groupBy('userId').agg({'total_lifetime' : 'max'})\
    .withColumnRenamed('max(total_lifetime)', 'total_lifetime')\
    .select('userId', (col('total_lifetime')/1000/3600/24).alias('total_lifetime'))
total_lifetime.show(5)# Average songs played per session
avg_song_played = user_log_valid.where('page == "NextSong"').groupby(['userId', 'sessionId']).count().groupby(['userId']).agg({'count':'avg'}).withColumnRenamed('avg(count)', 'avg_songs_played')
avg_song_played.show(5)

Most of the other features were aggregated sum when grouped by UserIds

label (User Churn) and Downgraded were also calculated using a Window function

These features were then vectorized and scaled using the VectorAssembler and the StandardScaler after omitting the userId and the label variables

# Three ML algorithms will be used to build model and f1 score will be used to optimize.  These are
# 1. Logistic Regression, 2. Random Forest Classifier and 3. Gradient Boosting Trees
# First of vectorize numerical variables in model data, transform them for feature pipeline.
columns = []   
for field in model_data.schema.fields :
            columns.append(field.name)

columns.remove('label')
columns.remove('userId')
assembler = VectorAssembler(inputCols=columns, outputCol="num_features")
model_data = assembler.transform(model_data)

# using standard scaler
scaler = StandardScaler(inputCol="num_features", outputCol="features", withStd=True)
scalerModel = scaler.fit(model_data)
model_data = scalerModel.transform(model_data)

The final vectorized, scaled feature dataset was then split into train, test, validate sets to perform the training and evaluation of the models.

#Split the full dataset into train, test, and validation sets. train, rest = model_data.randomSplit([0.6, 0.4], seed=42) validation, test = rest.randomSplit([0.5, 0.5], seed=42)

Evaluate the Results

A Confusion Matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions:

Accuracy = (True Positives + True Negative) / (True Positives + False Positives + True Negatives + False Negatives)

Precision tells us what proportion of the correct prediction actually was correct. It is a ratio of true positives to all positives:

Precision = True Positives / (True Positives + False Positives)

Recall (Sensitivity) tells us what proportion of predictions that actually was correct were classified by us as correct. It is a ratio of true positives to all predictions that were actually positive:

Recall = True Positives / (True Positives + False Negative)

F-beta score is a metric that considers both precision and recall:

The fact that the dataset is imbalanced also means that Accuracy is not very helpful because even if we obtain high accuracy the actual predictions are not necessarily that good. It is usually recommended to use Precision and Recall in these cases

Let’s compare the results of 3 models:

Logistic Regression
Random Forrest
Gradient Boosted Trees

Logistic Regression

# 1. Logistic Regression 

# initialize classifier, set evaluater and build paramGrid
lr = LogisticRegression(maxIter=10)
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_lr = CrossValidator(estimator=lr, evaluator=f1_evaluator, estimatorParamMaps=paramGrid,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_lr = crossval_lr.fit(train)
end_time = time()
cvModel_lr.avgMetrics
seconds = end_time- start_time

results_lr = cvModel_lr.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")Logistic Regression Metrics:
Accuracy of model is : 0.7443609022556391
F1 score of model is :0.6724855617304131
The training process of model took 13.907336235046387 seconds

2. Random Forrest

# 2. Random Forest Classifier 

# initialize classifier, set evaluater and build paramGrid
rf = RandomForestClassifier()
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_rf = CrossValidator(estimator=rf,estimatorParamMaps=paramGrid,evaluator=f1_evaluator,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_rf = crossval_rf.fit(train)
end_time = time()
cvModel_rf.avgMetrics
seconds = end_time- start_time

results_rf = cvModel_rf.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")Random Forest Metrics:
Accuracy of model is : 0.7969924812030075
F1 score of model is :0.7585255822483037
The training process of model took 12.764495134353638 seconds

3. Gradient Boosted Trees

# 3. Gradient Boosting Trees

# initialize classifier, set evaluater and build paramGrid
gbt = GBTClassifier(maxIter=10,seed=42)
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_gbt = CrossValidator(estimator=gbt,estimatorParamMaps=paramGrid,evaluator=f1_evaluator,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_gbt = crossval_gbt.fit(train)
end_time = time()
cvModel_gbt.avgMetrics
seconds = end_time- start_time

results_gbt = cvModel_gbt.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")Gradient Boosted Trees Metrics: 
Accuracy of model is : 0.8270676691729323 
F1 score of model is :0.799474962304081 
The training process of model took 31.332529306411743 seconds

Refinement

Here Gradient Boosted Trees has the best f1-score that’s why I choose it for the next steps whereby I will be finding the best hyperparameters by letting the ParamGridBuilder() and CrossValidator() search for all parameters

# Optimizing Hyperparameters in Random Forest Classification
clf = GBTClassifier()
maxIter=[5,10,20]
maxDepth=[10,20]    
paramGrid = ParamGridBuilder().addGrid(clf.maxIter, maxIter).addGrid(clf.maxDepth, maxDepth).build()       
crossval = CrossValidator(estimator = Pipeline(stages=[clf]),
                         estimatorParamMaps = paramGrid,
                         evaluator = MulticlassClassificationEvaluator(metricName='f1'),
                         numFolds = 3)

cvModel_gbt = crossval.fit(train)
predictions = cvModel_gbt.transform(test)

evaluator = MulticlassClassificationEvaluator(metricName='f1')
f1_score = evaluator.evaluate(predictions.select(col('label'), col('prediction')))
print('The F1 score is {:.2%}'.format(f1_score)) 
   
bestPipeline = cvModel_gbt.bestModelThe F1 score is 0.818
Best parameters : max depth:10, max Iter:5

Feature Importance

We then find the feature importance in the final tuned and optimized model

Feature_Importance_Scores = gbt_best_model.featureImportances.values.tolist()
Feature_Importance_df = pd.DataFrame({'Feature_Importance_Scores': Feature_Importance_Scores, 'Features': columns})
plt.title('Features Importance Scores of Gradient Boosted Trees Model')
sns.barplot(x='Feature_Importance_Scores', y='Features', data=Feature_Importance_df, color="blue")

We observe that register_duration (days) and average listened songs per session are top 2 most important features while predicting churn.

Deploy

To train and deploy the model we have used IBM Watson Studio, it let’s you create a Spark Environment with limited resources in the free-tier which I have made use of to explore, train and evaluate my model.

Create a Project with a Spark 2.4 and Python 3.7 Environment

The environment consists of a Apache Spark cluster consisting of 1 Driver Node (1 vCPU and 4GB RAM) and 2 Executor Nodes (1 vCPU and 4GB RAM)

These nodes are running Apache Spark v2.4 and Python 3.7 and have an attached notebook that is used to explore, train and evaluate the model.

IBM Watson Project with Notebook and Apache Spark Environment

Conclusion

Our goal was to predict if a user is going to cancel from the service to enable the company to make him offers or discount in order to try to retain these users. After cleaning the data and modeling them into a dataset ready to be used for ML training we tested the performance of three different models. Considering F-1 Score the best model is Gradient Boosted Trees. Despite the good results, the model could be improved by crafting more engineered features to capture some behavior patterns that could be related to the user satisfaction of the service: is the recommendation engine good? Meaning the song recommended to users really meets their tastes. From the features importances of GBT the raw features register_duration and average listened songs per session are pretty important

Note: Code can be found in this Github repository