Basic mobile metrics in SQL – Big Stuff Going On

This post discusses SQL statements that compute basic metrics for mobile game analytics. We start with a two sample data sets given in .csv files: a record of installs and a record of in app purchases. For this post, we will compute some metrics dealing with installs and revenue. We will finish by computing a .csv file containing information about average revenue per user (ARPU).

Background

We’ll use SQLite3 to work through this example, because it’s lightweight and easy to get running. Some useful instructions can be found at this link. We’ll assume some familiarity with SQL syntax.

We’ll start with two comma-separated .csv files:

app_installs.csv – This file (~ 27 MB) has 4 columns: player_id, time, device_os, country. Each entry represents an installation of our game on a device.
item_purchases.csv – This file has 4 columns: player_id, time, item_type, item_cost. Each entry represents a purchase of a single item from within the game.

The two files cover the same time period, which is about a month long. By the end of our work, we’ll have several more .csv files filled with some interesting metrics related to install and purchase events.

Getting started

We’ll do all of our work inside SQLite3, so get it running. We will assume that we’re running SQLite3 from the same directory that the .csv files are held in.

Once we have SQLite3 running, we can prepare it for importing .csv files with the command

.mode csv

We will also use the following command to display column names while we are working.

.headers on

If you’d like to play around a bit with the settings, you can use .help to get a list of commands specific to the SQLite3 program.

We’ll load in our two files as tables using the following two commands

.import item_purchases.csv purchases
.import app_installs.csv installs

This loads the files in as tables named purchases and installs, respectively.

We can use the command .schema to see that the tables have been created and that the column names have been set using the names from the first line of the .csv files.

Note that everything has been loaded in as text in our tables. So, even though the item_cost column contains integers, it will be treated as a string of text. We won’t worry about this right now, but if it is an issue, we can update the table or treat these entries as integers using the CAST function.

What can we compute?

We have a list of install events and purchase events. These are just records like

“player ___ installed the game at date/time ___ on their ___ device in country ___”
“player ___, at date/time ___, purchased item ___, which costs ___”

One of the simplest interesting things we can do is to just count these events, broken down by day. That is, we can compute things like “on day ___, there were ___ many installs”. Later, we can use this to compute cumulative results, like “there were ___ many total installs, up to and including day ___”.

We’ll do these computations for installs and revenue. Afterwards, we can compute ratios to get information about revenue per user.

These are examples of a few basic metrics for mobile apps or games. The game is intended to make money, so we’re interested in metrics that help us understand how the game is performing as a business. The metrics we are going to compute help with understanding the acquisition and monetization of players in our game.

In a future post, we’ll talk more about where these metrics come from and why they are useful.

Daily installs

The first thing we’ll compute is the number of installations, grouped by day.

We can see an example of a row in our install table using a SELECT query with LIMIT 1. The query and its result are shown below.

select * from installs limit 1;

player_id,time,device_os,country
9b1ce66d0ab91cd2,"2015-06-01 14:49:02",android,pt

This represents a single installation event. This player installed the game at this time. We are only interested in the fact that this installation happened on a certain day.

In words, we want to look through the list of installations, keeping track of what day the installation occurred, and then count the number of installations that happened on each given day. In the end, we want a table with two columns: one for the day, and one for the number of installations on that day.

We can get this with a SELECT query, using GROUP BY. Because the times in the table are already nicely formatted, we can extract the year-month-day part of the timestamp using the DATE function. To get the number of installs on each day, we will just count the number of player_id’s we see.

select
date(time),
count(player_id)
from installs
group by date(time)
;

This gets the job done, but is a little sloppy. Also, we want to store this output so that we can use it in queries later. The commands and query below will

set SQLite3 to output the results to a new .csv file, daily_installs.csv,
run the query,
load daily_installs.csv into a table so we can keep working with the results, and
set SQLite3 to output to the usual output window again.

We’ll also update the query a bit and rename the columns in the output.

.output daily_installs.csv
select
date(time) as day,
count(distinct player_id) as daily_installs
from installs
group by day
order by day
;
.import daily_installs.csv daily_installs
.output stdout

Now we have a new file daily_installs.csv saved to the directory we’re working in. Also, we have a new table daily_installs loaded for us to work with.

Cumulative installs

We’ve just made a table that tells us how many installs happened on each day. Next, we’ll make a table that tells us how many installs in total have happened up to each given day.

This means we need to compute a running total, or a cumulative sum, of the data from our daily_installs table. To get this, we need to look at each day, and compute the total number of installs that occurred on a day up to and including that day.

We can do this with a nested query. That is, we’ll do a SELECT query which contains another SELECT query inside of it. We will use the inner query to pick out the “earlier days” and sum over their installs.

The code, with output and import commands included, looks like this:

.output daily_cumulative_installs.csv
select
t1.day,
t1.daily_installs,
(
    select sum(t2.daily_installs)
    from daily_installs t2
    where t2.day <= t1.day
) as cumulative
from daily_installs t1;
.import daily_cumulative_installs.csv daily_cumulative_installs
.output stdout

This creates a new .csv file, executes the query, and then loads the new .csv file daily_cumulative_installs.csv into a new table daily_cumulative_installs for us. It then sets the output back to the usual output window for you.

Notice that the inner SELECT query depends on the day t1.day we’re considering in the outer SELECT query. It computes a sum over the number of all installs which occurred on days up to and including t1.day.

This table we just made has three columns: day, daily_installs, and cumulative. Each row tells us how many installs occurred on a given day, and how many installs happened up to and including on that day.

Revenue metrics

We’ve just computed some metrics for installs, by day and cumulative. We will do the same thing in this section for revenue.

The purchases table holds rows like the following:

select * from purchases limit 1;

7472bb562cd9897d,"2015-06-20 07:13:13","item 6",25

This row tells us that this player, at this time, purchased item 6, which cost them 25 units. We will assume these units are the real cash value paid for the item in game.

The queries we use will be similar to the ones from the installs section.

However, to compute the revenue we need to sum over the item_cost column, rather than just count the number of rows we see. To do this, we use the SUM function. We also have to make sure we use item_cost as an integer so that SUM works. We will just CAST it to an integer value as we go.

We will just include the queries below.

.output daily_revenue.csv
select
date(time) as day,
sum(cast(item_cost as int)) as daily_revenue
from purchases
group by day
order by day;
.import daily_revenue.csv daily_revenue

.output daily_cumulative_revenue.csv
select
t1.day,
t1.daily_revenue,
(
    select sum(t2.daily_revenue)
    from daily_revenue t2
    where t2.day <= t1.day
) as cumulative
from daily_revenue t1;
.import daily_cumulative_revenue.csv daily_cumulative_revenue
.output stdout

We now have two more .csv files, (daily_revenue.csv and daily_cumulative_revenue.csv) and two more tables (daily_revenue and daily_cumulative_revenue) loaded that we can work with.

Revenue per user

In this section, we will use our cumulative installs and cumulative revenue tables to compute the average revenue obtained from each install we’ve seen.

This is one interpretation of average revenue per user (ARPU). We will obtain a table where each row tells us a day and the ARPU computed using the cumulative installs and cumulative revenue up to and including that day. In other situations, we might want to compute ARPU over different time periods, or only using a subset of our users.

Because ARPU is a ratio of the values we’ve stored in two different tables, we will need to join our tables daily_cumulative_installs and daily_cumulative_revenue. We will join them on their day column, so that we obtain an intermediate table listing each day, the cumulative installs up to that day, and the cumulative revenue up to that day.

We will store this joined table as a VIEW so we can reference it later. This is really just an aesthetic choice for presenting the queries here.

After the VIEW is created, we will use it in a query to compute the ARPU. The queries are below.

.output arpu.csv
create view cumulative_join as
select
t1.day,
t1.cumulative as cumulative_installs,
t2.cumulative as cumulative_revenue
from
daily_cumulative_installs t1
join
daily_cumulative_revenue t2
on t1.day = t2.day;

select
day,
cumulative_revenue*1.0 / cumulative_installs as arpu
from cumulative_join;

We now have a file arpu.csv which lists each day with the ARPU computed up to that day.

Because of the way we’ve computed ARPU based on the cumulative installs and revenue, the values in the later days are less sensitive to change, and can be interpreted as tending toward a longer term, overall ARPU. One of the downsides to this is, for example, a problem with sales on the 25th day might not be clearly reflected in the data on that day, since it is averaged with the previous 24 days.

Summary

In this example, we took two .csv files containing data about installs and purchase events, we computed some basic metrics for a mobile game, and we used SQL to output a .csv file containing the ARPU up to each day we’ve seen.

These queries (and a few more) and the output .csv files can be found over at my Github page.

In future posts, we will

compute some other interesting metrics, like conversion rates,
plot this information using python and matplotlib, and
discuss where these metrics come from and why they are important.