In this tutorial we are going to look at some basic statistical modelling. The previous two parts were about manipulating data with dplyr and making plots with ggplot2. We will be using these two libraries a little in this tutorial, so you might want to go back and take a look. If you are familiar with the basics of R, this tutorial should stand on its own.

If you’re looking for a very quick step-by-step guide on linear modeling in R, check out the video below. Otherwise, keep reading for the full tutorial with code and visualizations. Use our downloadable sample data to follow this tutorial.

## Why do we make models?

There are probably two main reasons to make a model: to understand your data, or to make predictions. If all you care about is making accurate predictions of future data, then you might want to look at machine learning type algorithms. However, if you want to understand and explain the patterns you are seeing in your data, you’ll want to build a traditional statistical model. Traditional statistical modeling is what this tutorial will cover.

When you build a model you assume that some sort of simple formula describes the underlying behavior of your data. Almost always this simple behavior will be hidden by some sort of inherent randomness. When modeling, you are attempting to find that simple underlying formula as well as understanding how bad the randomness is. Understanding how bad the randomness is helps you understand how likely your formula is to be right. You have to assume:

data = simple formula + randomness

Creating a model has several advantages over just summarizing the data. Firstly, you can describe what the underlying relationship is. Secondly, you can understand the uncertainty in any relationships that you see.

Say you are researching a link between intelligence and playing video games. It might look like the two are related from the raw data, but a model can help you understand if this link is just due to random chance or an actual relationship.

Always when modelling you have to be careful about causation. Though your model will be able to tell you if intelligence and video game playing are related, it won’t be able to tell you if smart people are more likely to enjoy playing games, or if playing games makes you smarter. Another very common possibility is that the two things you are studying are related to a third variable. For example, it might be that smart people have more money and so have the ability to play more video games, not that their intelligence is directly leading them to play more games.

### A simple test between two samples

If you have ever done a statistical test, you’ll have used a model. A simple test for the difference between two samples is the most basic model. The underlying relationship is these two samples have a different means, and the test is trying to work out if this difference is caused by noise.

I said earlier that you have to assume that your data is created by signal and noise. But to actually do the mathematics of creating a model that best fits your data, you need to make a bunch of other assumptions about exactly what this signal is and what the noise is. Different models have different signals and different noise, and it can be difficult to know which signal and noise best fit your data until you actually fit some models. For each model you’ll need to check if your assumptions about the data are true.

Hopefully all this talk about signal and noise will become more concrete when we look at the data. So, let’s load it in! We’re using the same sample data as last time. In particular we’ll be looking at the relationship between time played and the level reached.

```library(RPostgreSQL)

library(dplyr)

library(lubridate)

library(ggplot2)

driver```

We need to calculate the length of time that people have played for. We are also going to remove people who have played recently. We are interested in finding the final amount of time people have played for and the final level that we reach. For this reason we want to exclude people who have played recently – they might still be playing the game and will get to later levels. I’m also going to remove players of unknown age and gender. When we explored the data there seemed to be an unusual pattern with these players, and I don’t trust these observations.

The reason that I’m using the same data as in the previous tutorial is that doing a proper exploration of the data before modeling is really important. If we hadn’t visually explored the data in detail earlier, we would have never known about the potential problem with unknown genders and ages.
```# Add in the length of time people have played for data <- mutate(data, date_difference = data\$start_date %--% data\$last_seen, days_seen = as.period(date_difference) %/% days(1)) %>% select(-date_difference) # We won't need this again, so we can drop it # Clean the data data <- data %>% filter(last_seen < today() - days(7)) %>% # Only people who haven't played recently filter(gender != 'UNKNOWN') %>% filter(age_group != 'UNKNOWN') ```

## Making a model

Let’s have a look at the scatter plot of time played vs. level reached.

```ggplot(data) + aes(x = days_seen, y = level) + geom_point()``` It looks like the longer a player plays for, the higher a level they reach (pretty obvious). Also, it seems like this happens at a steady rate, which is potentially what you were hoping to achieve when you balanced the game.

The model we are going to use assumes that the signal is some sort of linear progression though the levels, where each day a person plays they move up though a certain number of levels. There’s some noise because everyone finds different levels difficult.

Here are the questions we can answer with modelling:

1. How fast do players move though the levels on average?

To do this we use the lm function included in base R (this is just part of R, we don’t need a library). This function creates a linear model. We are using a linear model because it seems there is a linear relationship between the variables. To go back to our formula, here we will assume:

level = a + b x days_seen + noise

The numbers a and b will be estimated by R using the data. What do a and b mean? The number a is the level people are at before their first day (you’d really expect this to be 0, but let’s see what the model decides). The number b will be the estimate for the number of levels a player is expected to move up each day they play. These numbers, a and b, are often known as the coefficients of the model. Making this model in R is very easy:
`model <- lm(level ~ days_seen, data)`

That’s it. The little ~ sign just means ‘model level as a function of days_seen’. What happens when we print this model to the screen?
`model`

```##
## Call:
## lm(formula = level ~ days_seen, data = data)
##
## Coefficients:
## (Intercept)    days_seen
##     -0.8691       0.2599```

The coefficients are that a and b I was talking about earlier.

So the model estimates that before you play, you’ll be at level -0.8434 (hmmm) and for each day you play you move up 0.2576 levels, i.e. it takes about 4 days to move up one level.

So, that’s our first question answered – players move up though levels at a rate of about a quarter of a level a day. It’s up to you to decide if this is a good thing for a bad thing for your game.

Now for the second question – how sure are we about this estimate? One way of doing this is to look at the p-value for these estimates. You can do this by calling summary on the model.
`model `

```##
## Call:
## lm(formula = level ~ days_seen, data = data)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -26.6552  -1.5495   0.6135   1.8691  25.6884
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8690556  0.0386026  -22.51   <2e-16 ***
## days_seen    0.2599123  0.0009649  269.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.312 on 10027 degrees of freedom
## Multiple R-squared:  0.8786, Adjusted R-squared:  0.8786
## F-statistic: 7.256e+04 on 1 and 10027 DF,  p-value: < 2.2e-16```

We can ignore most of the information here.

Look into the section called coefficients. Do you see the little table with our two coefficients that we mentioned earlier? Look at the number of stars at the end of each row of the table. The number of stars tells us how sure we are that this coefficient isn’t 0. Three stars means ‘very sure’ this number isn’t zero, two starts means ‘sure’ this isn’t zero, one star means ‘pretty sure’.

So the model is ‘very sure’ that players don’t start at level 0 (hmmm!), and it’s ‘very sure’ that the rate that players move up levels isn’t zero.

You can never be 100% certain, no matter how many stars R gives the coefficient. This is partly because of random chance – you might just be really unlucky and players are actually not moving up levels, on average, we just managed to pick a really strange looking bunch that did. Partly, and more importantly, this is because models come with assumptions and these assumptions can be wrong (I’m pretty sure that players are starting at level zero!).

## How to measure the uncertainty in values

So we’re ‘very sure’ that players are moving up levels the more they play. But that’s not really very interesting, we want to know how sure the model is about its estimate. To answer this question, we need confidence intervals about the parameters. This is really easy in R, just use the confint function.

`confint(model)`

```##                  2.5 %     97.5 %
## (Intercept) -0.9447244 -0.7933868
## days_seen    0.2580208  0.2618037```

Confidence intervals are the model saying that okay, the average levels per day might not be exactly 0.2576 but I’m fairly sure it’s somewhere between 0.2557115 and 0.2594721. So, the model is pretty darn sure about that number. If the confidence interval had come out as between 0.1 and 7.4 we could summarize that we don’t really have any good idea about the speed people move though levels.

Those last few paragraphs are major, major simplifications of what p-values and confidence intervals are. What I really want you to understand is that they are both measures of the uncertainty in values. The p-value is measuring how sure you are that this number isn’t zero (if it was zero, then the term shouldn’t be in the model). Confidence intervals give you an idea of the uncertainty of the estimate. Understanding these more precisely isn’t difficult; any introduction to statistics course will give you a good overview. Here are some online courses, and although I haven’t tried them they look good: Coursera, Khan Academy and Udacity.

We can visualize our estimated average movement though levels as well as the uncertainty in this estimate. The model’s best estimate is shown in pink over the data and the most extreme ends of our confidence intervals are shown in grey.
```ggplot(data) + aes(x = days_seen, y = level) + geom_point() + geom_abline(intercept = -0.8478, slope = 0.2582, colour = 'deeppink') + geom_abline(intercept = -0.9012297, slope = 0.2556008, colour = 'darkgrey') + geom_abline(intercept = -0.7510752, slope = 0.2593519, colour = 'darkgrey') ``` ## 4 assumptions to check

A lot of people would stop here and be done with it. Not me though! Remember at the start when I said that all statistical models make assumptions? Well, a very important final step after you make a model is to check these assumptions.

There are 4 main assumptions to check for a basic linear model like this.

1. Linearity
2. Normality (of residuals)
3. Constant Error Variance (of residuals)
4. Independence (of residuals)

So, most of the assumptions are about residuals. What are residuals you ask? They are the noise part of the equation.

data = underlying equation + noise

i.e.

data = model + residuals

So, to find the residuals we simply rearrange that equation:

residuals = data – model

Residuals are what is left over after your model, they are the bit the model can’t explain. There’s nothing wrong with the model not explaining everything! If your model did explain everything it would almost certainly be too complicated. We just need to ensure that the residuals are random in the way that the model expects them to be random. Why does this matter? The residuals mainly matter from the point of view of those uncertainty estimates. If your residuals are not what the model expects, then it will be under or over confident about it’s estimates, and your p-values and confidence intervals will be wrong. Sometimes this uncertainty doesn’t matter very much, sometimes it matters a lot.

### 1. Linearity

The first, and most important assumption is that relationship between our two variables is linear. To check for a linear relationship go back and look at that scatter plot at the start – does it look like a linear relationship? It sure does to me.

If you attempt to model a non-linear relationship with a linear model you’ll get results, but those results may not make much sense. Here’s an example of what a non-linear relationship would look like. Say this was the level your players were at vs. the number of coins in their balance. If you run a linear model on this data you get an estimated number of coins at level 0 of 66 and an estimate that players lose half a coin per level. Does that make any sense? No! But that’s what doing a linear model would be telling you.

### 2. Normality

The mathematical definition used in linear modeling assumes that this noise component comes from a normal distribution. You may know a normal distribution as a bell curve, a classic example would look like this: Since we are assuming the noise comes from a normal distribution then the residuals should come from a normal distribution. We can extract the residuals from the model running the residuals function on it. Let’s do a histogram of the residuals and see if they look like the example of a normal distribution above.
```ggplot(NULL) + aes(x = residuals(model)) + geom_bar()``` Not really. So, they sort of look like a normal distribution that got cut off in the middle. How much does this matter? Not hugely. The uncertainty estimates will be off. In particular, the confidence interval suggests that we are equally likely to be underestimating or overestimating the rate of moving up levels. However, this probably isn’t true for this data.

### 3. Constant Residual Variance

We only have one estimate for the uncertainty across the whole range of the data. So, it stands to reason that the noise needs to be equally noisy across all the data or the estimates of uncertainty will be too tight in some places and too loose in other places. Let’s try plot the the residuals against the fitted values of the model and see if they are equally noisy everywhere.
```ggplot(NULL) + aes(x = fitted(model), y = residuals(model)) + geom_point()``` This is pretty much a classic case of the residuals being higher for higher values. This makes sense – at higher levels there will be more variability in what people do. Again, like normality this might not matter too much, just something to keep in mind when describing results. In particular, if you want to make predictions about what level a user will be at when they’ve been playing for a long time, then the model will give unreasonably small uncertainty estimates.

For reference, a perfect version of this residual plot would look something like this: ### 4. Independence

Another way your error estimates can go wrong is if your observations aren’t independent i.e. For our data that would mean that if one user got to an usually high level then the next user would also get to an unusually high level and vice versa for low levels. To me this sounds pretty unlikely – there’s nothing special about the order the users appeared in. However, we should check this anyway. We do this by simply plotting the residuals in the order they appear in the data.
```ggplot(NULL) + aes(y = residuals(model), x = 1:length(residuals(model)) ) + geom_point() + coord_cartesian(x = c(1, 100)) ``` I’ve zoomed the plot in to the first 100 points since its hard to see any patterns with all the points. As I suspected I don’t see any evidence of non-independence. Here’s what the plot might looks like if there was non-independence. If you see non-independence you probably need to model the order that observations came from in an explicit way.

## Should you use this model?

So you’re probably wondering if this is a good model overall. Heck yes, it’s pretty good! I know I’ve just pointed out a whole bunch of problems with it, but as models go this one is amazing. If I was asked how fast users are going up levels I would say: on average our users are going up a level every 4 days, and I’d be pretty happy with that.

Modeling can be difficult, there’s lots of ways your model can mislead you. This data is fabricated and you’ll probably never see anything as clean as this in real life, and it still failed two of our assumptions and gave a clearly wrong answer for the intercept. It’s a bit of a cliche in statistics, but all models are wrong yet some are useful. You just need to keep plenty of common sense and understand what you are doing.

Obviously statistical modeling is a huge field and this tutorial can only really give you a taste of it. I do hope you’ve gained an understanding of what models can do, the challenges and subtleties of modelling, and maybe even been inspired to learn more.

Check out other R tutorial videos in this series on manipulating data with dplyr and making plots with ggplot2. Or why not find out how to do text mining in R?

Recommended Posts
• Chuka Ikokwu

This is very helpful; question though – How is the model great if it failed 2 of the tests and the intercept was off? Is it because it had 3 stars (* * *) or low P-Value and ‘0’ wasn’t contained in the Confidence Interval? Also you mention if the CI was 0.1 to 7.4 then the model would be bad…is it because this interval is too wide?

• Mhairi

The model is certainly imperfect, however, it could be useful.

Almost every model has some problems, and no model fits the data perfectly, but often the model can help you do what you want to do. Here, you want to work out how many levels people move up per day on average, and the model is useful for letting you calculate that number. It can be difficult to evaluate a model in simple terms of ‘good’ and ‘bad’ – it may help instead to think about exactly what you want to do and how accurate it needs to be.

As for the confidence interval, there is very little difference between a confidence interval of (0.1, 7.4) and a confidence interval of (0, 7.4). Depending on the size on your data, that could just be a few data-points slightly different. However, some people would draw very different conclusions from those two intervals.  