In Data analysis, How-to guides, Uncategorized

Welcome to part two of analyzing your game data in R.  The first part in the series was on data manipulation, this part will deal with making plots in R. In particular we will be learning how to use the ggplot2 library.

The ggplot2 library makes plotting both very easy and returns rather nice looking results by default. With a little bit more effort you can customize the graphs it returns as well.

Plotting is where R really comes into its own as an analysis tool. The graphics tools available can be used both for exploratory analysis and presentation quality outputs. R can also make interactive visualizations though a variety of means. While I won’t cover anything about interactive plots in this tutorial, if you are interested I’d recommend looking at ggvis, shiny, and the range of html widgets-based libraries.

We’ll use a different data set from the last blog and look at play times; that is, the length of time a user plays a mobile game. The query available here summarizes our data to a user level. We’ve pulled out when each user started, when they last played, what’s the highest level they’ve reached and some demographic information.

Again the data is available to download as a .csv file. If are trying to connect to deltaDNA’s Direct Access, our demo data is now hosted at ‘demo-account.demo-game’.

Step-by-step guide to plotting in R

I’ve made up a video to show you the exact steps to follow in this tutorial. Alternatively, you can follow the text with code below.

Of course, using a new library means installing that library, so start with:
install.packages('ggplot2')

The data loading set up is almost the same as the last R tutorial, so I’ll go though this quickly.
library(RPostgreSQL) # for database connection
library(dplyr) # for data manipulation
library(ggplot2) # for plotting

driver <- dbDriver("PostgreSQL")
connection <- dbConnect(driver,
user = "[email protected]",
password = "********",
host = "data.deltadna.net",
dbname = "demo-account.demo-game")

# Only select events game start events
query <- readLines('query_gamestart.txt', warn = FALSE)
query <- paste(query, collapse = " ")
data <- dbGetQuery(connection, query)

# Convert to tabular data frame
data <- tbl_df(data)

Here’s what the data we are going to be working with looks like:

## Source: local data frame [29,797 x 6] ##
## user_id last_seen gender age_group level start_date
## 1 1000-0016-5531-FAKE 2015-06-17 FEMALE 30-34 10 2015-04-02
## 2 1000-0016-8400-FAKE 2015-05-23 MALE 25-29 8 2015-04-07
## 3 1000-0016-3718-FAKE 2015-05-26 FEMALE 25-29 9 2015-03-30
## 4 1000-0016-5221-FAKE 2015-06-13 FEMALE 40-49 10 2015-04-01
## 5 1000-0012-5602-FAKE 2015-06-16 FEMALE 13-17 44 2015-01-31
## 6 1000-0016-4042-FAKE 2015-05-21 FEMALE 40-49 12 2015-03-30
## 7 1000-0002-6831-FAKE 2015-06-17 FEMALE 35-39 82 2014-10-12
## 8 1000-0016-3748-FAKE 2015-06-16 FEMALE 18-24 17 2015-03-30
## 9 1000-0016-8241-FAKE 2015-05-27 FEMALE 18-24 17 2015-04-06
## 10 1000-0002-5100-FAKE 2015-06-16 FEMALE 30-34 76 2014-10-10
## .. ... ... ... ... ... ...

Dealing with dates and times in R

Before we start looking at plotting, let’s have a quick interlude to look at dealing with dates and times in R. Install the lubridate library, which makes working with dates much easier.

install.packages('lubridate')

library(lubridate)

We extracted the start date of each player and the day they were last seen from the database. We are interested in working out how long our players have been playing for, so we want to know the difference between these two dates in ‘days’. To do this is we use %–% which finds the difference between two dates and %/% which is a kind of modular division for dates. The calculation days_seen = date_difference %/% days(1) finds the number of whole days between the two dates.

We then use the ‘select’ verb with a negative sign, which means select all variables except this one.
data <-
mutate(data,
date_difference = start_date %--% last_seen,
days_seen = date_difference %/% days(1)) %>%
select(-date_difference) # We won't need this again, so we can drop it

Play time vs. level: Creating a scatter plot 

In the last tutorial you learnt how to use the pipe operator (%>%) to connect functions. The ggplot library has a similar syntax where layers, aesthetics, and themes are built up to create a plot using the + operator.

Below is the code to make a scatter plot of the number of days we’ve seen each person play, against what level they’ve reached – I’ll explain how it works soon. We can see from the scatter plot that clearly – and perhaps unsurprisingly – the longer a person plays for, the higher a level they will get to.
ggplot(data) +
aes(x = days_seen, y = level) +
geom_point()

I’ll talk you though what each function does in the plot above. This method of plotting can seem a bit abstract at first. It also probably seems a bit over-complicated for making a simple scatter plot, but learning the ggplot system makes producing complicated graphics much easier.
ggplot(data)

All ggplots start with the ggplot function. The first argument is the data frame we are going to plot elements from. If you want to plot vectors not included in a data frame you can set this first argument to NULL.
aes(x = days_seen, y = level)

Here we introduce the variables that we want to plot. The letters ‘aes’ stand for aesthetics – these are the elements of the plot which depend on data. Some elements of the plots will depend on the data and some will be fixed. Clearly the x and y position of the points will depend on the data. Color is a good example of an element of the plot which can either be fixed or depend on the data; we can color the whole plot the same color and we can color points depending on some value. We’ll see an example of this in a minute.
geom_point()

Now we are telling ggplot that we want this plot to be a made of points i.e. a scatter plot. We could have used a different geom to plot this data in a different way. For example here’s the plot with a line geom instead:
ggplot(data) +
aes(x = days_seen, y = level) +
geom_line()

But clearly a line graph is really the wrong sort of graph for this data. Different geoms use elements passed from aes in different ways. Read the documentation for the geom you are using to understand which aesthetics it requires.

Adding color, transparency and facets to scatter plots

Lets put some color and transparency into the plot. Since we want all points to have the same color and transparency we do not use aes. Just include color and transparency (called alpha in ggplot) into the geom function.

Color can be hex colors or one of the named colors seen here. You can spell it as either colour or color. I like pink and I’m British so I’ve used colour = 'deeppink3' Transparency can be a very useful plotting element. Adding transparency we can now see that there’s a lot more players in the bottom left of the plot than in the top right.
ggplot(data) +
aes(x = days_seen, y = level) +
geom_point(colour = 'deeppink3', alpha = 0.2)

Now we’ve seen color as a fixed value, lets see it used as an aesthetic. We’ve set color to be equal to gender. We need to do this inside aes because we are mapping an element of the data (gender) onto an element of the plot (color). Now the plot will color different points depending on the gender variable.
ggplot(data) +
aes(x = days_seen, y = level, color = gender) +
geom_point()

Interesting! It seems like there might be a difference in final level reached for males and females.

Let’s investigate the age_group variable and see if there is any differences between different age groups. We want to do this while still being able to see the gender difference and see if it changes in different age groups. To examine age and gender at the same time you could try mapping age group to a different aesthetic – maybe size of points, or the transparency of points. However, I think that would be a little difficult to interpret. Instead lets make a new plot for each age group.

This is very easy to do in ggplot using facet_grid. This will make a grid of plots where each plot contains a subset of data corresponding to the faceting variable.
ggplot(data) +
aes(x = days_seen, y = level, color = gender) +
geom_point() +
facet_grid(~ age_group)

I don’t see any clear differences between different age groups. Although we can see that all the players of unknown gender also have unknown age group. Let’s now do some bar charts to investigate further.

How long do people play for? Putting data into charts and histograms 

Let’s just look at the length of time people play for. It seems from the scatter plots we just made that this might change or different genders and for different age groups. We can also look at a new geom, geom_bar, and look at how the same geom treats different types of data.

First, let’s plot a histogram of the length of times people play for to understand how it is distributed.

We only need one aesthetic since we are only plotting one variable, and let’s use geom_bar, which gives us both bar charts and histograms.
ggplot(data) +
aes(x = days_seen) +
geom_bar()

So days_seen is very right skewed. Most players play for a short time, and a small number of players play for a long time. Let’s split days_seen into groups – that should be easier to understand.

We can use the cut function to turn a continuous variable like days_seen into a factor. Factors are a special R data type for variables which take on only certain discrete values. For example the age group and the gender variables that we’ve been working with are factors.
data <-
data %>%
mutate(days_seen_group = cut(days_seen, breaks= c(-1, 0, 5, 10, 100, 500)))

We have set breaks to c(-1, 0, 5, 10, 100, 500) This will give us 5 groups: first between -1 and 0, then between 0 and 5, then between 5 and 10, then between 10 and 100 and finally between 100 and 500. I set the first group to be -1 to 0 since this will catch all values which are 0. I was careful to ensure that the top value in breaks was greater that the maximum value in the data so that every data point will fall into a grouping.

There’s a few other ways of setting groupings when using cut. See ?cut for details.

Okay, now that we’ve got a new factor variable, let’s plot this in a bar chart to understand how many people fall into each group. Note that this is almost exactly the same syntax as the first geom_bar we made, except we are using the new factor variable instead of the old continuous variable.
ggplot(data) +
aes(x = days_seen_group) +
geom_bar()

We have a bar plot. Note that there is space between the bars. The ggplot2 library knows the difference between categorical variables and continuous variables and can make plots that suit each type.

Understanding the difference between genders in bar charts

Now that you’ve learnt how to make bar plots, let’s try understand the difference between the genders using them. I’ve used dplyr to pipe a data frame without the unknown genders into ggplot, so we’ll only look at the difference between males and females. Note for bar charts we use ‘fill’ rather than ‘color’. For geom_bar color controls the color of the line around the bar.
data %>%
filter(gender != 'UNKNOWN') %>%
ggplot() +
aes(x = days_seen_group, fill = gender) +
geom_bar()
I find this chart with the bars stacked a little hard to interpret. Let’s try it with the male and female bars placed side by side. To do this we need to change the position to ‘dodge’ in geom_bar.
data %>%
filter(gender != 'UNKNOWN') %>%
ggplot() +
aes(x = days_seen_group, fill = gender) +
geom_bar(position = 'dodge')
This makes it clear that there’s much more females than males who’ve been playing more than 100 days.Now let’s look at age group.
data %>%
filter(age_group != 'UNKNOWN') %>%
ggplot() +
aes(x = days_seen_group, fill = age_group) +
geom_bar()
The age group plot looks pretty but it’s hard to tell if the proportions in each age group are different in the ‘days played’ groups. We can set the position to ‘fill’ to get a comparison of proportions.
data %>%
filter(age_group != 'UNKNOWN') %>%
ggplot() +
aes(x = days_seen_group, fill = age_group) +
geom_bar(position = 'fill')
Seems to be pretty much the same proportion to me. Different geoms have various options for position, check out the help files to find more options for plotting. You might need to play around with graphs a bit to find the clearest or most persuasive way of presenting your data.

Heat plots and more complicated graphs

I’ll finish this tutorial with a more complicated graph. I’ll also go into a bit of detail about customizing graphics in ggplot.

I want to present a lot of information about the final level reached in one graph, and I want it to look good so I can show everyone what I’ve found. So I’m going to make a heat plot, where the color corresponds to the final level people reach.

This requires a bit more data preparation than the other graphs – we’ll need to find the average level that players reach in age groups and for each gender. I’m going to remov all users who have played recently – they are probably still playing and could reach higher levels in the future. If you are familiar with dplyr the code below should make sense.
plot_data <-
data %>%
filter(last_seen < today() - days(7)) %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
group_by(gender, age_group) %>%
summarise(average_level = mean(level)) %>%
ungroup()

Now we can plot this data. To make a heat plot, the geom we want is geom_raster. The graph is plotting three variables: gender and age_group, which are displayed as boxes along the y axis and x axis. The ‘fill’ is the average level that we calculated earlier.
ggplot(plot_data) +
aes(y = gender, x = age_group, fill = average_level) +
geom_raster()

Looks okay, but there are a few things I want to do to improve it. First, let’s label each square in the heat map with the actual value of average_level. For the labels we want the numbers rounded to one decimal place, and they need to be character variables for geom_text to plot them.
plot_data$label <- plot_data$average_level %>% round(1) %>% as.character

We also want a different color scheme. The RColorBrewer package has some nice color schemes for expressing variables, so let’s install it and load it in.
install.packages('RColorBrewer')

library(RColorBrewer)

Now we can build our plot up in layers. Note that this plot has two geoms – geom_raster and geom_text. Geoms added later will lie on top of geoms added earlier; if we had put geom_text before geom_raster, then geom_text would be hidden by the colored boxes.

These geoms both take their aesthetics from the first aes function, however if you wanted to add different aesthetics or even different data to different geoms in the same plot,  you could do that, you just need to specify it inside the geom function. Any added aesthetics need to be wrapped in the aes function to discriminate them from fixed elements.

The scale_fill_distiller function comes from the color brewer package and adds the new color scale. There is a range of functions that act on either fill or color and on factor or continuous variables (brewer for factor variables, distiller for continuous variables). I’ve used a diverging color scheme to highlight the differences between males and females. You should try experimenting with different pallets and types.

I have changed the default x and y labels. You can use the newline character ‘\n’ in labels to format them the way you want or add space.

Lastly, I have changed the theme from the default ggplot theme. It’s hard to see the full effect of the new theme in this plot but you can see the background is now white rather than grey. For more themes try the ggthemes package, or try writing your own themes.
ggplot(plot_data) +
aes(y = gender, x = age_group, fill = average_level, label = label) +
geom_raster() +
geom_text(colour = 'white',
size = 8) +
scale_fill_distiller(name = 'Average\nFinal\nLevel',
type = 'div',
palette = 3) +
xlab('\nAge Group') +
ylab('Gender') +
theme_bw()

 

This really only scratches the surface of what you can do with ggplot and the range of options for plotting in R. I have found the ggplot cheatsheet very useful, as well as the R Graphics Cookbook by Winston Chang.

If you enjoying this tutorial, read about how our team analysed in-app purchasing pricing strategy in games to ask, Do bundle discounts work?

The deltaDNA platform is a powerful toolkit for analysts, made especially for games. Try it out for free, find out more about the features available or request a demo

Recommended Posts

Leave a Comment

Start typing and press Enter to search

Paid for content in premium games - image of vanity items being offered
X