To be clear, this post is not about how to improve the performance of your title (that’s the point of pretty much all the other posts in this blog). It’s about how to write the queries that will provide you with the output your need to have LTV curves. More specifically, to write the queries that will provide you with the output you need to have LTV curves that don’t require too much computational power and that are flexible enough to compare the LTV curves of different user segments.
When I’m talking about LTV curves, I’m talking about a curve that will show the LTV for each day – not the discrete juxtaposition of Day 1, 3, 7, 14 etc. metrics. Take the example below. LTV curves are interesting and show you something you won’t see immediately if you look at discrete day x metrics side by side. Namely, the monetization trend of your userbase as it ages.
If you’re using a visualization tool like Tableau (although I started off a big sceptic, I’m now a big Tableau fan), then more often than not your solution to display LTV as curves will be to use count distinct. The problem with that is that count distinct is a heavy operation and having your data source at the user level seriously limits the volume of data you can operate with. That means it’s hard to look at different segments side by side if large volumes are involved (for example across countries, platforms or install periods). It’s even harder to have those curves in a way that allows you to filter and segment your userbase if you are using excel and pivot tables.
There are 2 things you need to be attentive to when you write a query to provide the LTV at each day since install. First, you need to make sure the user is baked. You don’t want to consider the day 30 LTV of a user who installed 21 days ago. Second, once you have ensured you are only considering days users could materially have logged in, you still need to consider the LTV of a user for days where s/he doesn’t log in. In other words, the LTV on a date for which there is no corresponding activity date. Going back to our user in our previous example. Your user installed 21 days ago. S/he might not have logged in on day 12, but there still is a corresponding LTV value for day 12. In other words, you want to have a subtable or subquery that will list for each user the activity dates s/he could have potentially played. By doing that, you are making sure both points mentioned above are covered: 1) you will only consider days for which your user is baked and 2) you will be able to account for all calendar days where you user could have played – even if s/he did not log in.
The main trick involves in creating a cartesian product in order to append all potential activity dates to your user table. You might be working with a user table where you have one record per user and which includes some relevant state level metrics (install date, country, platform, total number of matches played, etc.). What you want to do is have as many rows as there are days the user could have potentially logged in. In other words, you want to have a table where for each user you will have one record for each calendar day since s/he installed. And you can append to this table as many fields as you find relevant. For the purpose of the illustration here, I’ll start by focusing on the curve itself. I’ll provide at the end a sample query where I group by country, platform and install date.
In order to list for each user all the calendar dates since s/he installed, you need to create an artificial join key. The purpose of that join key is to create the cartesian product between one user and all existing activity dates: you will append to your user table all possible activity dates the user could have logged in. All you need to do that is have access to a user table and an activity table (the working assumption is you have some activity record for each calendar date). You then add in both a join key. The query looks something like this:
The point of the join_key is to append each distinct activity_date to the user_table. You are creating a field whose only purpose will be to provide a key to articulate user_id and dates. It’s important to insist on the fact that there needs to be only one record per user in what you use as the user_table. Also note you want to limit the activity dates considered to those that follow the user’s install date (or else you will also be listing calendar days that preceed install). That’s the purpose of the final datediff part. The output of this query will look something like this:
So you will be repeating for each row your relevant user status metrics (in this case userid and install_date). But each row will list all existing activity_dates – regardless if your user has an actual record on that date. You can then use this as a subquery on which you will left join your revenue table. Whether the user spends on a given day or not, there is a record for him/her at each potential activity date. In this case you left join on both user and date. Keep in mind you can use this same approach for any activity based KPI (such as retention or conversion) you want to see continuously from the perspective of days since install. The next part of the query looks as such:
Two small points of detail. First, you can use a window function (running sum) to have the LTV value on each day since install. Second, use a coalesce function (nvl if you’re using redshift) for the revenue field. It might not be necessary in most cases. But if you are looking at very small user segments (or very low performing geos) then there might be some cases where you will have cohorts in which nobody has spent at all. Coalesce will ensure you always have a value of 0 instead of a null. The output of this query will be structured as follows:
Once you’re at this stage, you just need to aggregate the result of this query by days since install. There are two main advantages. The first one is, you don’t have to add where clauses to ensure users are baked. The second advantage – the biggest one – is that you are dealing with aggregated data which is structured in a way which makes it easy to manipulate. This is especially important when you are thinking ahead to the visualization of your data.
By having the baked_installs fields and the total_cumulative_revenue fields seperate, you can then have a calculated field to determine the LTV for each days_since_install: sum(total_cumulative_revenue)/sum(baked_installs). This will allow you to look at at the LTV curve of your entire data source or compare across different segments with one same operation. And because in this case the operation is a sum (not a count distinct) you should see big performance improvements.
You can add as many dimensions as you want to aggregate by: country, platform, install date, etc. Once your data is structured this way, you can then process them, add filters etc in a much more efficient way. You are then summing the aggregated data – you don’t need to use count distinct which is a heavier operation and reduces the amount of data you can process. That means in one same data source you have the data of multiple cohorts structured in a manner that is much easier to manipulate. Below is a final example of the ltv curve that aggregates data per country, platform and install date.
nesting bad. cte’s good.
haha – yes we all have our little quirks when writing queries.