Machine learning can make a lot of things better. But as everyone who deals with artificial intelligence knows, a model is only as good as the data used to train it.
At F-Secure, we’re not only applying machine learning to the cyber security domain, we’re using it to help improve the relationships that we and our service provider partners have with the customers we serve.
What we’ve learned about churn
Keeping customers is cheaper than acquiring new customers. According to the White House Office of Consumer Affairs, “it is 6–7 times more expensive to acquire a new customer than to retain an old one.” This is why many companies invest in building prediction machine learning models that can help identify the customers that are about to leave, reasons as to why they are leaving, and help build suitable retention strategies to keep customers, as well as improve services and customer satisfaction. The models support companies to pre-act rather than re-act.
For the past few years, I have been working on building churn prediction models using machine learning. In each project, the data set has been different, capturing different perspectives of the customer. This also leads to models that perform quite differently. As many people in this field know, that data used to train machine learning algorithms is the crucial part, and unfortunately, sometimes the data that companies have is not sufficient for the purpose of predicting customers who will churn.
“When a model fails to predict something, it’s because the information used to train it lacks predictive power,” Max Pagel explained. “Having a fancy or complex algorithm wouldn’t help either if you do not have the right data.”
In this post, I look at data points that should be collected for those thinking about building accurate churn prediction models.
How to train your churn model
The first step of course is to define what churn means. Is it that an end user uninstalls a product, stops making payments, cancels a subscription or does not renew, or something else? When that part is clear, the second step is to collect data that captures who this user is and how they interact with the product and company, among many others. The data should reflect the end users / customers reality – cause that’s what models are, they are abstractions of the reality.
From my experience, the best way to identify the data needed is to look at the reality and capture that as much as possible while respecting consumer privacy. Consumers need to be made aware of the data being collected as well as give consent to their data being collected.
With the above in mind, the reality could be captured by asking the following six questions:
Who are our customers, really?
Answering this question gives you information such as the age, gender, residence, occupation, income, ethnicity, education, etc. of your customers. Many companies might already have this information which is mostly used for segmentation. Besides predicting churn, this type of information allows you to do personalization of services and engagement which also helps retain customers.
Which product(s) do the customers have?
For each customer, you need to know which of your products they have, whether they got it for free, with a promotion, or at full price; as a standalone product or in a bundle with other products; the number of licenses they bought, on which platform, from which channel, are they making annual or monthly payments, at what price etc.
How do customers use our product?
Getting the answer to this question is very important. Customers stay with companies or products where they see value. The answer provides an overview of the behavior of the customers, e.g., how often they log in, time spent with the product per session, what features they access, amount and type and frequency of errors they encounter, etc. For digital products, the data collection for this type of data is easier as it can be implemented into the product.
How are customers engaged?
While this question might be similar to the above, it captures something different. It captures the engagement between the company and the customer online or offline. Highly engaged customers often demonstrate also more loyalty. This can include information such as emails or newsletters sent, events hosted, etc.
How do customers feel about our products and us?
This question gives you data capturing how your customers feel about your products, their user experience. For instance from ratings, feedback, social media posts, or customer care interactions. Negative sentiments, emotions or opinions from customers are big indicators of churn. However, this information is not available for all customers.
What other external factors influence our customers?
Sometimes customers just leave for reasons beyond your control. For example, many customers have been adversely affected by Covid-19 and have unsubscribed or cancelled several services. There are also other external factors such as new competitors in the market, network coverage, etc., that might drive your customers away. Capturing this information is difficult but if possible, it is valuable to a prediction model.
Respect the data and users’ privacy
The above six questions can give you a vast amount of information to help you identify the customers that stay or leave and most importantly understand why.
With the available data it is the task of the data scientist to then use all or some of this information to perform feature engineering to capture trends and patterns over time that might be informative to churn modeling, like whether a customer increased the number of features they interact with or decreased login activity, and so on. With this information, customers tend to stick around for the best possible reason—their needs are being met by the service.
Capturing and recording this information, requires companies to have a data strategy and invest in data architectures that will enable long-term and short-term storage to enable historic and current analysis to be performed. Naturally, privacy concerns will hinder data collection or the ability to connect all of these data points. However, wherever possible, aim to collect as much as data as customers knowingly and willingly share that would give you a model that closely reflects reality.