This post is about one of the most important topics in data science: Feature Engineering! More specifically, how to create great customer metrics for understanding churn, using feature engineering techniques.
If you are operating any kind of product or service with repeated interactions with users or customers then you are probably collecting data about those interactions in some kind of data warehouse. It is common to refer to such interactions as events for short. Thats because interactions tracked in a data warehouse invariably have a time stamp telling you when it happened. But this post is not about how to collect that data, rather it’s about how to put that data to good use. Because if you aren’t using the data you are collecting, why are you doing it?
Here's how it fits into the overall picture for fighting churn with data:
Why Churn Feature Engineering?
In my previous post on calculating churn rates, you learned how to measure churn (thats the "Churn Metrics" in the diagram).
If you want to reduce churn, the first step in putting the raw data to use is turning the event data into a set of measurements. These measurements summarize the events and collectively produce a profile of the users’ behaviors. These measurements are often called customer metrics or just metrics for short. Turning related events into measurements is necessary because each event is like one tiny dot in a big picture, and by itself one event usually doesn’t mean much.
But while we need to zoom out from each individual interaction, we are not going to zoom out very far - not yet, anyway. Each measurement is taken individually for each customers, and repeatedly over the lifetime as a customer. That’s because user engagement and churn are dynamic processes for each individual. You need to watch how those metrics change over subscribers lifetimes in order to fight churn with data.
To people trained in data science, machine learning or statistics call this area “feature engineering”, and thats why this post is titled feature engineering customer metrics for churn.
Basic Churn Feature Engineering with Customer Metrics
The process of using feature engineering to create customer metrics is illustrated in the drawing below: For each user you have a series of events. Only the series for a single event is shown: Logins. In typical scenarios there will be many different types of events, and the events can be at any time; for some types of events there can even be multiple events at the same time.
In order to compare subscribers, a metric defines fixed time windows and counts (or otherwise measures) the events in each window. In the example the windows are defined as consecutive four week periods. These are calculated on the day after the periods so that they observation of events is complete. For example, on the 29th of January you can calculate the number of logins per subscriber in the four weeks covering January 1st-28th. Then on February 26th you can measure the number of logins for the four weeks from January 29th-February 25th, etc.
Why Use Weekly Measurements for Customer Metric Churn Features
You are probably wondering why I use measurements calculated over four week periods and not over whole months. Weeks are important because nearly all human activities follow a weekly cycle. So the events in your data warehouse probably also follow a weekly cycle: If your product is something that people use for work then most events occur on Monday thru Friday, and Saturday and Sunday have few events. On the other hand if your product is something people use for leisure like watching videos then most events are on Friday thru Sunday and Monday thru Thursday will be slower. For short observation windows (like a month) it's better to measure metrics over time periods that are an even number of weeks so every calculation is exactly comparable.
Churn Feature Engineering: Customer Metrics with SQL
A simple SQL select that calculates one customer metric like this on an event table with columns for an account id, an event time and event type is shown below.
with calc_date as ( select '2017-01-28'::timestamp as the_date ) select account_id, count(*) as n_login from event e inner join calc_date d on e.event_time <= d.the_date and e.event_time > d.the_date - interval '28 day' where e.event_type_id=1 group by account_id;
However, the simple approach to feature engineering customer metrics shown above has a problem: If you literally follow this approach you only update the metric once every four weeks which is not very dynamic. A lot can happen in four weeks. If you are going to reduce churn you will probably have to move faster. The drawing below illustrates the solution: Repeat the four week measurements at more frequent intervals, in this case weekly. As shown, the resulting four week windows overlap. You can also see that the measurement will gradually track between the monthly measurements made with the simple approach. For subscriber 1 the first couple of four week non-overlapping measurements were 2, 4, 2. The overlapping measurements include intermediate points where the value was 3, representing the transition period.
The SQL below demonstrates how to implement the staggered metric concept illustrated above:
with date_vals AS ( select i::timestamp as metric_date from generate_series('2017-01-29', '2017-04-16', '7 day'::interval) i ) select account_id, metric_date, count(*) as n_login from event e inner join date_vals d on e.event_time < metric_date and e.event_time >= metric_date - interval '28 day' where e.event_type_id=18 group by account_id, metric_date order by account_id, metric_date;
Other Types of Customer Metrics
So far we only looked at feature engineering customer metrics for churn that are simple counts of events. But whenever events have data in additional fields you will probably want to summarize that data in the metric. The most typical case is when an event has a numeric value associated with it. Some of the most typical cases are:
- The event has a duration in time, such as the length of a session or playback of some media
- The event has a monetary value, such as a retail purchase or an overage charge
In such scenarios the most common metrics are either:
- The total value of all the events
- The average value per event
Either one of these and many others can be calculated with very similar SQL's shown above, and I will save details for the book. Thats already plenty for this post! Now you know the basics of calculating behavioral metrics on users when you are Fighting Churn with Data.
The Most Important Metrics
With all this discussion of types of events you are probably already wondering which type are the most important. There is no definitive answer for this, but I can give some general guidelines. And just to be clear: Figuring out which events are the most important is one of the main points! That's why you should do the analyses described on this site and in the forthcoming book, because it is always different for every product service. So this is just a preview of things I explain in depth later.
The bottom line is this: the most important events are the events that are closest to the customer achieving the goal or purpose of the service. That’s vague, but some examples make it clear:
Software products usually have some goal, for example writing documents. So creating documents is more important than just logging in. In general login events are much less important than the events that are directly involved with achieving the goals.
Many B2B software products are used for making money. So if there is any way to measure how much money is likely to be made from the events then those are the most important. For example, if a product is a Customer Relationship Management system (CRM) used to track sales deals, then closed deals and their value is probably the most important type of event. Often a system is not that close to the money customers make. So you should focus on other key metrics that show the customers are using the system as intended. For example, in cloud services the events that capture computation or data handled on behalf of business customers show how much the businesses are using. The provider hopes naturally those are profitable endeavors.
For most media services, the purpose is to enjoy the media. So playing the content is generally important. But more specifically the most important indicators are of enjoying the content: like watching the whole thing, giving it a like, or sharing it. But you can never directly measure enjoyment because its a subjective state of the users.
For a dating service the purpose is to go on dates, so actual meetings are probably more important than things like searching, viewing profiles or online interactions. That presents another type of challenge because measuring success on the service is well defined, but the actual meetups may be arranged through means outside of the service such as phone or text.
For gaming, the purpose is to have fun. Just like with Media, subjective feelings are hard to measure; so the most important events may be things like achieving scores and levels, or social interactions with friends.
There are many important caveats that go along with this point and we have noted just a few, but the fact remains:
TAKEAWAY If you want to fight churn with data, look for events that are close as possible to the value created by the service, even when that value cannot be measured directly
My book and the rest of the posts on this site are all about bringing rigor to this simple intuition. This post has not gotten very far into how you actually use these customer metrics for reducing churn, so for that check out the details which are available in chapter 3 of the electronic edition of my book, Fighting Churn With Data