Authored by Alexander Held, Data Scientist, Der Spiegel, Germany.
Alexander Held will be the featured guest at one of the next monthly meet-ups of the WAN-IFRA Data Science Expert Group on 22 February 2023. Register here if you want to discuss this article further.
Several audience research projects at Der Spiegel suggest that the decision to a news subscription is not a sudden, instantaneous action. Instead, this is an informed decision based on our experience and our reader’s behaviour.² This applies at least for visitors who outlast our free trial phase and are not solely interested in the content of a specific paywalled article.
Effective engagement features are suitable for building a subscription prediction model. Therefore, we developed a machine learning approach that can be used to predict lasting subscriptions and enables us to segment readers according to their affinity to subscribe.
For an outline of the Machine Learning system we have built at Der Speigel, this article will go through the Data & Features (1), the Model (2), the Evaluation (3), the ML Operations (4), and the Next steps (5).
Data & Features
As we plan to train a machine learning model to make predictions, we need a data source to learn from. Therefore, based on spiegel.de’s Adobe Data Feed, the raw website tracking data from spiegel.de, usage features are derived at the device level, where we currently developed four classes of input features:
- Engagement: web_visit, app_visit , avg_time_spent_article, avg_time_spent_article_across_sessions, mean_articles_read, mean_visit_duration, number_articles_read, paywall_loads, total_visit_duration, visit_subscription_page, number_of_visits
- Location: federal states
- Referrer: aggregated URLs
- Editorial: department, type (text, audio, video) and format (op-ed, investigative, …)
A long backlog of further potential features already exists, but the first step of development was primarily about the implementation of an end-to-end ML pipeline. To paraphrase the fourth rule of the Google ML Guide: Keep the first model simple and get the infrastructure right. The first model provides the biggest boost to your product, so it doesn’t need to be fancy. But you will run into many more infrastructure issues than you expect.³
Anyhow, building suitable features is only one part of feature engineering. With customer journey data involved, we also set sliding windows within our user journeys. We set lookback and lookahead windows for training data, whereby data for inference only has the former. Accordingly, our features mentioned above are calculated over the last 30 days per user. For training data records, there is information as to whether a user has taken out a subscription, accompanied by the maximum number of days the subscription has been used after conversion.
For this purpose, a maximum of 40 days from purchase will be considered. Together with our lookback window, we have a total sliding window of seventy days for training. Following this approach, the model is flexible so that a retention threshold can be set so that training solely considers subscriptions as a target value for which users have still used their subscription X days after the actual order. This makes it possible to train the model on predicting long-term subscriptions only, where readers stay with us over the trial phase. To be precise, we currently train the model on purchases where readers still use their subscription at least thirty-five days after conversion. Journeys that do not meet these criteria are getting removed from training.
Different journeys are also discarded on a large scale before the model is trained. This is necessary to achieve an appropriate predictive quality for the model’s purpose: the prediction of lasting subscriptions and identification of potentially loyal customers. The main reason for removing users from training is that many journeys consist of very short and/or few interactions, including those that led to a subscription.
The absence of an obligation to log in without a subscription and the low data quality on the cookie device level is causing these data quality problems. Therefore, at least 70% of user journeys that led to conversion are not included in model training because they do not bring enough quality with them.
It is also important to say that even bigger numbers of user journeys that did not lead to a subscription are getting removed from the training dataset because they also include only very short and/or few interactions. These circumstances described above are one of the biggest challenges to this ML task.
Training our model on only a subset of positive cases naturally impacts inference because the same short and incomplete journeys are also present in the classification dataset.
This is important to keep in mind if the model gets validated on unseen data without subsetting it according to the training data. In that case, we end up with a so-called train-serving-skew, where the unseen data comes from a different distribution, so the model does not generalise correctly.
For this reason, our model is not performing well in predicting the total sum of daily subscriptions, which was expected and is not the goal of our model. On the other end, it becomes accurate in predicting explicitly lasting subscriptions. Even more critical in identifying potential customers.
Finally, it is important to note that our “on-hand” ML task is a highly imbalanced binary classification problem, where we have more users who did not subscribe compared to journeys which end up in a subscription. Therefore, we sample down negative cases and, with that, remove high amounts of journeys from training data once more.
We currently see the best performance if we balance the training data from one to nine, where for every journey with a subscription, we have nine journeys without a subscription.
Do we need ML to identify potential customers better and predict lasting subscriptions? ML solutions are only helpful when there are complex patterns to learn. Therefore, the question is the same as with every other ML system: Is the pattern complex and very challenging to specify manually? If so, instead of telling our system how to calculate the affinity-to-subscribe from a list of characteristics, we let our ML system figure out the pattern itself.⁴
Let’s paraphrase Google’s ML Guide : “A simple heuristic can get your product out the door. A complex heuristic is unmaintainable. Once you have data and a basic idea of what you are trying to accomplish, move on to machine learning. As in most software engineering tasks, you will want to be constantly updating your approach, whether it is a heuristic or a machine-learned model, and you will find that the machine-learned model is easier to update and maintain (see Rule #16 ).”
Therefore, for our approach, a random forest model is trained on the features described earlier. The model is then used to provide all users with a model score between 0 and 100. The higher the value, the more subscription-affine the user.
The training method used is sklearn.ensemble.RandomForestClassifier, which runs with a training time of fewer than three hours, from which cross-validation and feature selection take up the most time. We have gone through a model experimentation phase, where we chose to discard the following approaches due to insufficient performance uplifts:
- Random Forest outperformes XGBoost
- SMOTE with random undersampling of the majority class did not create any uplift to the model
- RFECV (Recursive feature elimination with cross-validation) outperforms other feature selection methods like SelectKBest() or SelectFromModel()
- The basic RandomForestClassifier() outperforms RandomForestClassifier(class_weight=’balanced_subsample’) and BalancedRandomForestClassifier()
We applied recursive feature elimination in a cross-validation loop to find the optimal number and estimate the importance of individual features. As a result, from the feature list described earlier, only features from category engagement were considered relevant during this process. It seems that none of the location, referrer, or editorial features have much predictive power.
With these features selected, the current model in production did perform as such during cross-validation on training data: precision: 0.71, recall: 0.41, F1 score: 0.54 and accuracy: 0.93.
If we look at the confusion matrix, the model can easily identify most journeys that did not lead to a subscription. Also, identifying around forty per cent of actual conversions (recall) seems acceptable. Nevertheless, it is important to note, that we removed journeys with positive labels with few interactions before training, as described earlier in the section data & features.
For this reason, our performance on recall is only assessable under these circumstances. Furthermore, a precision of 71% also seems satisfying, especially if we consider how useful false positives can be, as these users behave like potential subscribers, so they would be interested in being targeted with specific campaigns.
Apart from using a threshold to split the predicted model scores into a binary classifier where only journeys with a score > 0.5 are conceived as successful subscriptions (like we did for the confusion matrix seen above), we use the model scores as actual probabilities of how likely it is that a user will subscribe with us.
We found a well-calibrated classifier by putting the predicted model scores into buckets and comparing them to the proportion of predictions in each bucket that were actual subscriptions. This proportion is higher for higher buckets, so the probability is roughly in line with the average prediction for that bucket.⁵ Note that buckets with higher model scores contain significantly fewer users.
Our model is retrained monthly, with all historical training data being used for retraining. Therefore, it becomes more accurate over time because it takes more input data into account over a more extended period. Each trained model is versioned, and the training dataset and feature selection are saved with the model. In addition, there is an automatically created evaluation document for each model, which contains cross- and hold-out validation with the most important key figures such as precision and recall.
We do not use tools like DVC or MLflow for versioning or experiment tracking but have the whole project wrapped up in a Django app. This enables us to use management commands to control and serve most of the daily workflows like data pre-processing and feature engineering, building the classification dataset and serving classifications via the model, uploading model scores to Adobe, or running monitorings.
On a monthly cadence, we also trigger workflows to build the training dataset and train a new model. Furthermore, Django’s ability to initialise modules helps us to catch logs and monitor data on different levels, from raw tracking data, to derived features or predictions.
The model currently in production is serving predictions in batch mode, where all users who visited our website yesterday are getting new or updated scores. The scope of this project included the usability of the resulting model scores in Adobe Analytics daily for accessible analysis from various stakeholders.
From Analytics, we then send segments for users with high scores to Adobe Target daily, to be able to personalise our product based on the model scores. For this purpose, we used the Adobe Data Sources API, to be able to attach scores on the user-session level.
Finally, all components of the ML system are running in Azure. Raw data from the Adobe Data Feed is stored in binary parquet files in Azure blob storages and data processing tasks are written in PySpark, to be able to use distributed batch processing.
As computing infrastructure, we rely on a mid-sized virtual machine and for code repositories and continuous integration, we use Azure DevOps. Development is done in VS Code as it smoothly integrates with our Azure ecosystem.
- Data distribution shift: News and user taste fluctuate. What’s trendy today might be old news tomorrow.⁶ We already have built features to capture editorial department usage, but they don’t show much predictive power yet. In building these features, we might need to consider fast and substantial shifts in the training data, which also means rethinking appropriate timings for retraining and feature engineering.
- Return on investment: Anecdotal reports claim that 90% of ML models don’t make it to production, and others claim that 85% of ML projects fail to deliver value.⁷ While having our model in production, we just started to search for suitable approaches and campaigns to evaluate if our model was worth the investment.
- Targeted surveys: Using our model scores to target visitors with specific surveys to find out what motives and attitudes separate them can be very insightful. This is extremely relevant if you think about the fraction of variance which our features cannot explain on usage behavior, but more qualitative factors like perceived price or added value of the subscription. In another medium post I wrote about how we combine user research and data science methods at Der Spiegel.
- Precision-Recall Trade-Off: We see potential in adjusting different probability thresholds in favour of precision or recall. Especially a higher number of false positives could be an interesting segment to work with, as they naturally behave like visitors who usually subscribe with us.
- Switch to continuous predictions: Instead of a binary classifier, we can reframe our overall ML task as a regression problem, where we predict the number of days a user will keep his or her subscription.
- Compare ML and RFV models: Solely engagement features were selected during the recursive feature selection process, and none of location, referrer or editorial features seem to have much predictive power. This led us to the question, what the differences between many bespoke engagement scorings, like RFV models at the Financial Times, and machine learning-based scores are?
, ,  Designing Machine Learning Systems (by Chip Huyen)
 Time-Aware Subscription Prediction Model for User Acquisition in Digital News Media (by H. Davoudi, M. Zihayat & A. An)
 Rules of Machine Learning: Best Practices for ML Engineering (by Google)
 Are Model Predictions Probabilities? (by Google)
 Operationalizing Machine Learning: An Interview Study (by S. Shankar, R. Garcia, J. M. Hellerstein & A. G. Parameswaran)