Lots of people talk about churn, but not everyone knows how to calculate it! In this post I’ll show you how to calculate churn with SQL for a typical subscription database.
Why Calculate Churn with SQL?
The usual scenario is that subscription data is stored in a database, and we need to figure out which accounts churned and what the churn rate is. There are a few things that make this hard: Part of the challenge is complexity, and another challenge is logistical.
The logistical challenge of calculating churn has two aspects: First, the data is sensitive. For any subscription product and service the subscription database is one of its most valuable assets. The subscription database usually contains Personally Identifiable Information (PII) of the subscribers and sensitive financial information for the company. So this is not data you want to extract and leave lying around in un-secure locations. The second pitfall is one that comes with success: If your product or service is successful, this data is big. For both of these reasons taking the data out of the database for processing is not a good idea.
So the problem is that dynamic logic is necessary to calculate churn but we’d be better off if we can do what needs to be done in the database without extracting the data. For that reason we will calculate churn using short SQL programs that return the result as the output of a select statement. This is a best practice: Do as much of the work as possible inside the database or data warehouse, and only extract reduced data when necessary.
The Meaning of the Churn Rate
The picture below demonstrates the idea of churn in a simple diagram. Each of the two circles represents the subscriber pool of at different points in time. The area of the circle can represent either the number of subscribers or the total amount that they pay, the latter being used whenever subscribers pay different amounts. We will refer to churn measurement based on subscriber count as the “standard” churn or just the churn, if we do not specifically say the churn measure is based on revenue. But either way the concept is the same: The churn is the downward facing crescent at the top, the part of the start circle that is not overlapping the bottom (end) circle – that is those subscribers that are no longer with the service. To complete the picture: The overlap between the two circles are the retained subscribers, and the upward crescent at the bottom of the end circle that doesn’t overlap the top circle are the newly acquired subscribers. Note that in general the size of the two circles will not be exactly the same, and Figure 1 shows a growing subscription service.
The churn rate is defined as what proportion of the start subscribers left by the end time. In an equation that is:
Churn Rate = # Churns / # Subscribers at the start
Where “Start” means the area of the start circle, and “Churn” means the area of the “Churn” crescent. Note that the churn rate does not use either the total subscribers at the end, or the subscribers acquired. Look at the diagram to understand why: The churn rate is the portion of the start circle that is not covered by the end circle, but the size of the end circle depends on both retention and the amount of new subscriber acquisition. The acquisition of new subscribers is an incredibly important subject but it is a separate matter from the churns because it results from a different set of processes (and is normally the job of different departments to manage). Consequently, the churn rate is based only on churn in relation to the subscribers at the start; if the Churn were instead divided by the area of the End subscriber pool it would be incoherent because that would be mixing the parts of the two pools and it would confuse the contributions of acquisition and retention to the size of the subscriber pool.
The Typical Subscription Database
Most subscription products or services have a database which tracks when subscriptions begin and end. The key elements of a typical subscription database are shown in table 1. In practice there are usually more fields than this, but for the purposes of learning how to fight churn with data we will say that a subscription, as a data entity, consists of the following core data elements:
- Subscription id: A unique identifier for each subscription
- Customer id: An identifier for the account holder or user. These are unique at the account level, but are not assumed to be unique in the subscription table. In general, accounts can hold multiple subscriptions.
- Product Id: An identifier for the unique product that is subscribed to. In this data model it is one product per subscription, but as mentioned accounts may hold multiple products. If a subscription service only offers a single product there may be no product field, but it is fair to consider it a required field because single product services that plan to grow are normally going to come up with new offerings.
- Start Date: Every subscription must start on some date. We will assume these are simple dates without times.
- End Date: Subscriptions may or may not have end dates. If there is no end date the subscription is assumed to last until explicitly cancelled. Subscriptions with end dates are usually referred to as “termed” subscriptions and the time between the start and the end is called the term. Subscriptions which last forever until cancelled are usually called “evergreen” subscriptions.
- Monthly Recurring Revenue (MRR) : Paid subscriptions have a recurring revenue amount associated with the subscription
- Billing Period: For paid subscriptions there may be a billing period (i.e. monthly, quarterly, annual) that may be distinct from the term of the subscription. For example, an subscription with a one year term may be billed monthly or quarterly.
In words, a subscription is a product sold to a customer beginning on a specific start date and for a specified recurring cost; as discussed there may be an end date or there may not. The rest is details. Here’s what the structure of the subscription data looks like in a table:
|subscription_id||integer or char||Yes|
|customer_id||integer or char||Yes|
|product_id||integer or char||Yes|
Note that your database table (or tables) that contains this information might not be called “subscription”: If you work at a B2B company that uses a Customer Relationship Management (CRM) system to track deals your company might store this information in the entity which represents “Opportunities”; alternatively if your company uses a Subscription Business Management (SBM) product designed for tracking a multi-product subscription offerings it might be called a “Product Rate Plan”. But as long as all of the required data elements are available to you then you have what you need to calculate churn.
SQL to Calculate the Churn Rate
This SQL program uses Common Table Expressions (CTE) which are a relatively new extension to ANSI SQL. CTE’s allows the definition of temporary tables in a query in the order they appear, and without excessive verbiage for the temporary table definition. We will use CTE’s for teaching these techniques because they allow a clear, step by step presentation of the program logic. If you use an older database that does not support CTE’s or if you interact with your database through a layer of business objects, it should not be difficult to re-write these queries using either classically declared temporary tables of sub-queries.
The following is a high level overview of the program, which is described in relation to the Churn diagram above:
- Set the start and end times for the measurement
- Identify the subscribers at the start and count the total number of them; this is the top circle in the diagram above
- Identify the subscribers at the end; this is the bottom circle in the diagram above
- Identify the churned subscribers and count how many there are: this is the upper (downward facing) crescent in the diagram above
- Divide the number of churns by the number of accounts at the start. This is the churn equation above
Next here is the SQL program – I’ll explain more below.
with date_range as ( select '2017-01-01'::date as start_date, '2018-01-01'::date as end_date ), start_accounts as ( select distinct account_id from subscription s inner join date_range d on s.start_date <= d.start_date and (s.end_date > d.start_date or s.end_date is null) ), end_accounts as ( select distinct account_id from subscription s inner join date_range d on s.start_date <= d.end_date and (s.end_date > d.end_date or s.end_date is null) ), churned_accounts as ( Select s.account_id from start_accounts s left outer join end_accounts e on s.account_id=e.account_id where e.account_id is null ), start_count as ( select count(start_accounts.*) as n_start from start_accounts ), churn_count as ( select count(churned_accounts.*) as n_churn from churned_accounts ) select n_churn::float/n_start::float as churn_rate, 1.0-n_churn::float/n_start::float as retention_rate, n_start, N_churn from start_count, end_count, churn_count
Here’s an explanation of each CTE in relationship to the steps in the calculation:
- date_range : A one row table holding the start and end date for the calculation. This is step 1.
- start_accounts : A table with one row per account active at the start. That table is created by selecting from the subscription table based on the condition that the account is active at the start of the churn measurement. This is step 2.
- end_accounts : A table with one row per account active at the end of the churn measurement. The condition for being “active’ is the same as for the start accounts, just using the churn measurement end date for the criteria. This is step 3.
- churned_accounts : A table with one row per account that is active at the start but NOT active at the end. This is created by the outer join on account id between the start_accounts table and the end_accounts table, and the where clause which removes accounts where the end account_id is not NULL. This is step 4.
- start_count : A one row table that sums the total number of accounts at the start of the churn measurement, for clarity
- churn_count : A one row table that sums the total number of accounts that churned during the measurement period, for clarity
- The final select statement : takes the results from the one row result tables start_count and churn_count and calculates the final results by plugging the values into the churn equation. This is step 5, the final step in the program.
This shows that in a series of five simple steps the Churn rate and Retention rate can be calculated from a typical subscription database in SQL. This SQL program was tested on a sample data set loaded into a Postgres 8 database and produced the following result
More to come…
This is just a sample from my forthcoming book, Fighting Churn with Data. In the book I’ll provide more details on advanced subjects related to this one like calculating Net Retention and MRR Churn with SQL