In Data analysis, How-to guides, Uncategorized

Today we’ll be looking at some doing data manipulation with R. We’ll be using a library for data manipulation called dplyr.

If you are currently analyzing your data in SQL, the logic of dplyr will seem similar to you. Indeed, if you are a SQL analyst you may ask, why learn R when I can do exactly the same things in SQL? Well, that question will be answered in upcoming R tutorials. However, to get your data into those tools you’ll need to become adept at moving data around using R.

If you are used to Excel, R might seem a little strange at first. Manipulating data with R has three big advantages over using Excel:

  1. It’s much harder to accidentally delete or change your data
  2. R is more flexible and can do more data manipulation tasks
  3. You can work with much bigger data sets in R

So, today we’re going to look at how to do data manipulation with R. I’ve recorded this step-by-step video guide to show you a number of ways to do this. If you prefer to follow text, you’ll find the transcript below. I’ve also included some downloadable sample data in a .csv file, and just the code (if you’d rather copy/paste).

(Downloadable .csv sample data file or the code)

Installing R

Now that I have convinced you that you must learn R, let’s get started!

Go to http://www.r-project.org/ and download R.

Once you’ve done that, I’d strongly recommend downloading R Studio which is the most popular IDE for R.

Now you’ve got RStudio, I’d recommend setting up a project for working though this tutorial (go to File > New Project). Making a project is just telling RStudio that your R scripts, data, and other files you need for analysis, are in the same place. This will make managing these files much more straight forward.

Loading in the data

I’ll start by explaining how to extract data from Direct Access. For a more detailed explanation see our documentation. I’ll be using our test game deltaCrunch in all my examples – if you are signed up to the platform (you can do this for free) you will have direct access you’ll be able to access deltaCrunch’s data. If you don’t have direct access, I’ve included the data I’ve used as a csv file. Simple load the data into your R session using: data <- read.csv('currency_data.csv') This will work as long as your R script and your csv file are all in the same project.

First you’ll need to install the RPostgreSQL package. R packages are collections of code that other people have written; the RPostgreSQL package is a set of code for connecting to PostgreSQL databases, like the direct access database.
install.packages('RPostgreSQL')
Once you have installed the package, you’ll need to load it into the workspace to use it. For your next project you won’t need to install the package again, you can just call the library function to load it.
library(RPostgreSQL)

Now set up your connection to the database. You’ll need to use your own deltaDNA username and password.
driver <- dbDriver("PostgreSQL")
connection <- dbConnect(driver,
user = "[email protected]",
password = "yourpasswordhere",
host = "data.deltadna.net",
dbname = "demo-co.deltacrunch")

We’re going to read in a SQL query that is saved in a text file. You can download the query here. All the query does is select all events which correspond to someone spending money. We don’t need every column in the database, so we’ve selected only the 5 relevant ones for this analysis.
query <- readLines ('query.txt', warn = FALSE)

The readLines function reads each line separately, so we need to use the paste function to stick those lines back together again. Then we can use dbGetQuery to extract the data out of direct access. The data is now stored in the variable ‘data’.
query <- paste(query, collapse ="")
data

Data frames

So, the data is in ‘data’ and is loaded into the workspace. Let’s now look at how R stores the data.

The str (short for structure) function is one of Rs most useful functions. Here’s the structure of bdatab.
str(data)

## 'data.frame': 13610 obs. of 5 variables:
## $ user_id : chr "1000-0015-8438-FAKE" "1000-0018-4033-FAKE" "1000-0000-0690-FAKE" "1000-0000-5884-FAKE" ...
## $ time : POSIXct, format: "2015-06-16 13:30:36" "2015-06-16 12:33:14" ...
## $ gender : chr "FEMALE" "FEMALE" "MALE" "FEMALE" ...
## $ age_group : chr "25-29" "13-17" "40-49" "18-24" ...
## $ spend_amount: num 399 299 299 549 99 999 99 99 99 99 ...

So, the first thing it tells us is that data is a data frame. This is R’s default way of storing data and is a like a table in SQL or a sheet in Excel.

It tells us we have 13,554 observations of 5 variables, and then it gives us a bit more information about each of those variables including the variable’s name, its type, and the first few observations for that variable.

We can see the data in more detail using View. This will bring up a separate window in R studio with the data in tabular form.
View(data)

What is the average amount of currency spent?

Now we’ve got the data into R. let’s answer some questions we might have about the data. Each observation in this data set corresponds to one spend event. How much do people spend on average each time they spend?

The information about the amount of money spent is contained in the variable spend_amount. We can access just this variable out of our dataframe using $. Let’s have a look at the structure of the spend_amount variable.
str(data$spend_amount)

## num [1:13610] 399 299 299 549 99 999 99 99 99 99 ...

R has a whole bunch of inbuilt functions that let you do statistical operations on your data. We simply call one of these functions onto the spend_amount variable to get the mean amount spent per transaction and the median amount spent per transaction.
mean(data$spend_amount)

## [1] 523.0191

median(data$spend_amount)

## [1] 299

Basic dplyr

That’s all the data manipulation that we’re going to do using vanilla R. Now let’s install and load the dplyr package. The name dplyr is a mixture of data frame and pliers, the idea being, that this package is a multi-purpose tool that lets you push and pull your data into any form you want. Start by installing dplyr – if it’s not already installed – and loading into the session.
install.packages('dplyr')

library(dplyr)

A useful feature of dplyr is you can convert your data frames into tabular data frames. These are basically the same as data frames but you can view a quick summary of the data without using str.
data <- tbl_df(data)
data

## Source: local data frame [13,610 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 399
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 299
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 299
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 549
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 999
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 99
## .. ... ... ... ... ...

Data manipulation in dplyr is done through five ‘verbs’, which can be stacked together to do almost any type of manipulation you want.

Each verb is simply a function that takes a data frame/tabular data frame as its first argument and returns a data frame/tabular data frame with some sort of manipulation performed on it.

If you want more information about any of these verbs (or any function in R), you can call the help file using ‘?’. For example, to see the help on select use ?select.

Verb 1: Select

Select returns a copy of the data frame with only the variables you ask for. Here we are asking for the time and the gender variables.
select(data, time, gender)

## Source: local data frame [13,610 x 2] ##
## time gender
## 1 2015-06-16 13:30:36 FEMALE
## 2 2015-06-16 12:33:14 FEMALE
## 3 2015-05-16 05:08:54 MALE
## 4 2015-05-16 10:40:48 FEMALE
## 5 2015-05-16 07:02:22 FEMALE
## 6 2015-05-16 04:14:12 FEMALE
## 7 2015-05-16 20:32:43 FEMALE
## 8 2015-05-16 04:56:16 FEMALE
## 9 2015-05-16 00:32:29 FEMALE
## 10 2015-05-16 04:48:28 FEMALE
## .. ... ...

Verb 2: Filter

Filter returns the data frame with only the observations that meet the condition we ask for. Select and filter are similar, except select operates on variables and filter operates on observations.

Here we are only returning observations were the the amount spent is greater than 999 cents.
filter(data, spend_amount > 999)
## Source: local data frame [1,342 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0013-7975-FAKE 2015-05-16 13:05:49 FEMALE 50-64 1499
## 2 1000-0016-3689-FAKE 2015-05-16 08:36:55 FEMALE 50-64 2999
## 3 1000-0011-2482-FAKE 2015-05-17 09:02:25 FEMALE 40-49 4999
## 4 1000-0016-7893-FAKE 2015-05-17 02:00:54 MALE 40-49 1499
## 5 1000-0009-9670-FAKE 2015-05-18 02:07:34 FEMALE 25-29 1999
## 6 1000-0011-6573-FAKE 2015-05-18 18:34:04 FEMALE 13-17 1999
## 7 1000-0015-4630-FAKE 2015-05-18 08:44:39 FEMALE 35-39 2499
## 8 1000-0017-2402-FAKE 2015-05-18 10:08:50 FEMALE 50-64 1499
## 9 1000-0017-2549-FAKE 2015-05-18 08:10:52 FEMALE 50-64 1999
## 10 1000-0017-6308-FAKE 2015-05-18 01:19:23 FEMALE 25-29 4999
## .. ... ... ... ... ...

Verb 3: Arrange

Arrange reorders the observations in a data frame by the variable selected. Here we are ordering by time, so that the events that happened first are first in the data frame.
arrange(data, time)

## Source: local data frame [13,610 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0003-2493-FAKE 2015-05-16 00:03:34 FEMALE 40-49 299
## 2 1000-0011-3198-FAKE 2015-05-16 00:07:16 FEMALE 25-29 999
## 3 1000-0016-2209-FAKE 2015-05-16 00:12:38 FEMALE 18-24 299
## 4 1000-0016-5350-FAKE 2015-05-16 00:12:50 FEMALE 25-29 99
## 5 1000-0011-4369-FAKE 2015-05-16 00:12:56 FEMALE 18-24 299
## 6 1000-0010-2359-FAKE 2015-05-16 00:31:15 FEMALE 40-49 299
## 7 1000-0015-1311-FAKE 2015-05-16 00:32:04 FEMALE 30-34 99
## 8 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 99
## 9 1000-0010-2194-FAKE 2015-05-16 00:33:38 FEMALE 35-39 99
## 10 1000-0005-8418-FAKE 2015-05-16 00:43:49 FEMALE 18-24 1499
## .. ... ... ... ... ...

Verb 4: Mutate

Mutate creates a new variable or changes an existing variable. You simply give the name of the new variable and a formula for calculating it in terms of the other variables.

In this example we are overwriting the spend_amount variable with the spend in dollars, rather than cents.
mutate(data, spend_amount = spend_amount/100)

## Source: local data frame [13,610 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 3.99
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 2.99
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 2.99
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 5.49
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 0.99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 9.99
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 0.99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 0.99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 0.99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 0.99
## .. ... ... ... ... ...

Verb 5: Summarize

Summarize takes in a data frame and returns one number. It’s similar to mutate, except mutate will return a data frame the same size as the original data frame while summarize will return a data frame of size one.

In this example, we do the same calculation as we did earlier – calculating the mean amount spent in each transaction and the median amount spent in each transaction. Summarize becomes very powerful when combined with group_by, as we shall see in the next section.
summarize(data, mean_spend = mean(spend_amount),
median_spend = median(spend_amount))

## Source: local data frame [1 x 2] ##
## mean_spend median_spend
## 1 523.0191 299

Is average spend different for different genders?

Now you understand the basics of dplyr, we can start to answer some more interesting questions about the data. Let’s try work out if the amount spent is different for different genders.

First we use mutate to convert our spend into dollars. We save the output of this as data2 and then use group_by on data2. When you use the group_by function on a data set and then use summarize on it, you will get a separate summarization for each group. Since we are summarizing by gender, and there are three genders in the data set (female, male and unknown), then we will get a data set with three observations returned. Let’s see this in action.
data2 <- mutate(data, spend_amount = spend_amount/100) # Change to dollars
data3 <- group_by(data2, gender) # Group by gender
data4 <- summarize(data3, # Summarize each group
mean = mean(spend_amount),
median = median(spend_amount),
count = n())
data4

## Source: local data frame [3 x 4] ##
## gender mean median count
## 1 FEMALE 5.217094 2.99 11438
## 2 MALE 5.343067 2.99 1875
## 3 UNKNOWN 5.021987 2.99 297

Now with piping!

So that worked pretty well, but it’s a bit annoying that we had to make all those temporary data sets: data2, data3 and data4. One of the most powerful features of dplyr is the pipe operator that lets us avoid temporary data and write code in a clear and logical way.

To understand pipes, let’s look at a simple function in R – the identical function. This function just checks if its two arguments are the same:
identical(4, 4)
## [1] TRUE
identical(5, 4)
## [1] FALSE
We can perform exactly the same operation with a pipe %>%. All the pipe operator does is place the argument to the left of the operator as the first argument of the function of the right. So we can write the code above as:
4 %>% identical(4)
## [1] TRUE
5 %>% identical(5)
## [1] TRUE

For the identical function, this doesn’t make much sense. For data manipulation tasks this can be very useful. Data manipulation is often a series of operations that the data passes though until an output.

The pipe operation works particularly well with dplyr, since dplyr takes data frames as a first argument and returns data frames. That means we can rewrite our code which found the average spend for each gender as:
data %>%
mutate(spend_amount = spend_amount/100) %>%
group_by(gender) %>%
summarize(mean = mean(spend_amount),
median = median(spend_amount),
count = n())

## Source: local data frame [3 x 4] ##
## gender mean median count
## 1 FEMALE 5.217094 2.99 11438
## 2 MALE 5.343067 2.99 1875
## 3 UNKNOWN 5.021987 2.99 297

Is this different for different age groups?

The best way to understand dplyr and piping data is to see some more examples. Here’s a more complex one where we start by removing observations with unknown gender and age group. Also, we are going to group by both gender and age group; this gives us a data frame where we summarize by all combinations of gender and age group.

The ungroup function may seem a little mysterious. Using summarize removes one layer of grouping, so when we are grouping by only one variable it will remove the grouping. However, since we are grouping by two variables, we need to remove the second grouping. It’s a good idea to make sure the data sets you are working with don’t have any grouping on them or you may get unexpected results.

To make this example easier to understand, I’ll show you the results of each verb and how they stack together.

The starting data:
data
## Source: local data frame [13,610 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 399
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 299
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 299
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 549
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 999
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 99
## .. ... ... ... ... ...

All observations with unknown gender are removed:
data %>%
filter(gender != 'UNKNOWN')

## Source: local data frame [13,313 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 399
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 299
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 299
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 549
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 999
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 99
## .. ... ... ... ... ...

All observations with unknown age group are removed:
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN')

## Source: local data frame [13,313 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 399
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 299
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 299
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 549
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 999
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 99
## .. ... ... ... ... ...

Change spend_amount to be in dollars:
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
mutate(spend_amount = spend_amount/100)

## Source: local data frame [13,313 x 5] ##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 3.99
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 2.99
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 2.99
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 5.49
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 0.99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 9.99
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 0.99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 0.99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 0.99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 0.99
## .. ... ... ... ... ...

Add age and gender groupings to the data set
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
mutate(spend_amount = spend_amount/100) %>%
group_by(age_group, gender)

## Source: local data frame [13,313 x 5] ## Groups: age_group, gender
##
## user_id time gender age_group spend_amount
## 1 1000-0015-8438-FAKE 2015-06-16 13:30:36 FEMALE 25-29 3.99
## 2 1000-0018-4033-FAKE 2015-06-16 12:33:14 FEMALE 13-17 2.99
## 3 1000-0000-0690-FAKE 2015-05-16 05:08:54 MALE 40-49 2.99
## 4 1000-0000-5884-FAKE 2015-05-16 10:40:48 FEMALE 18-24 5.49
## 5 1000-0001-6372-FAKE 2015-05-16 07:02:22 FEMALE 18-24 0.99
## 6 1000-0002-3313-FAKE 2015-05-16 04:14:12 FEMALE 25-29 9.99
## 7 1000-0003-7303-FAKE 2015-05-16 20:32:43 FEMALE 18-24 0.99
## 8 1000-0003-7464-FAKE 2015-05-16 04:56:16 FEMALE 18-24 0.99
## 9 1000-0004-5122-FAKE 2015-05-16 00:32:29 FEMALE 40-49 0.99
## 10 1000-0004-7247-FAKE 2015-05-16 04:48:28 FEMALE 18-24 0.99
## .. ... ... ... ... ...

Find the average spend amount and the number of observations for each gender:
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
mutate(spend_amount = spend_amount/100) %>%
group_by(age_group, gender) %>%
summarize(mean = mean(spend_amount),
median = median(spend_amount),
count = n())

## Source: local data frame [14 x 5] ## Groups: age_group
##
## age_group gender mean median count
## 1 13-17 FEMALE 5.339866 2.99 746
## 2 13-17 MALE 5.331954 2.99 174
## 3 18-24 FEMALE 4.991363 2.99 2201
## 4 18-24 MALE 5.168873 2.99 355
## 5 25-29 FEMALE 5.676702 2.99 1722
## 6 25-29 MALE 6.111908 2.99 283
## 7 30-34 FEMALE 5.342721 2.99 1470
## 8 30-34 MALE 5.365000 2.99 200
## 9 35-39 FEMALE 5.179470 2.99 1301
## 10 35-39 MALE 4.686281 2.99 242
## 11 40-49 FEMALE 5.102578 2.99 2083
## 12 40-49 MALE 5.266812 2.99 345
## 13 50-64 FEMALE 5.069112 2.99 1915
## 14 50-64 MALE 5.441087 2.99 276

Remove the extra grouping that still remains:
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
mutate(spend_amount = spend_amount/100) %>%
group_by(age_group, gender) %>%
summarize(mean = mean(spend_amount),
median = median(spend_amount),
count = n()) %>%
ungroup()

## Source: local data frame [14 x 5] ##
## age_group gender mean median count
## 1 13-17 FEMALE 5.339866 2.99 746
## 2 13-17 MALE 5.331954 2.99 174
## 3 18-24 FEMALE 4.991363 2.99 2201
## 4 18-24 MALE 5.168873 2.99 355
## 5 25-29 FEMALE 5.676702 2.99 1722
## 6 25-29 MALE 6.111908 2.99 283
## 7 30-34 FEMALE 5.342721 2.99 1470
## 8 30-34 MALE 5.365000 2.99 200
## 9 35-39 FEMALE 5.179470 2.99 1301
## 10 35-39 MALE 4.686281 2.99 242
## 11 40-49 FEMALE 5.102578 2.99 2083
## 12 40-49 MALE 5.266812 2.99 345
## 13 50-64 FEMALE 5.069112 2.99 1915
## 14 50-64 MALE 5.441087 2.99 276

Order the data first by gender, then by age_group:
data %>%
filter(gender != 'UNKNOWN') %>%
filter(age_group != 'UNKNOWN') %>%
mutate(spend_amount = spend_amount/100) %>%
group_by(age_group, gender) %>%
summarize(mean = mean(spend_amount),
median = median(spend_amount),
count = n()) %>%
ungroup() %>%
arrange(gender, age_group)

## Source: local data frame [14 x 5] ##
## age_group gender mean median count
## 1 13-17 FEMALE 5.339866 2.99 746
## 2 18-24 FEMALE 4.991363 2.99 2201
## 3 25-29 FEMALE 5.676702 2.99 1722
## 4 30-34 FEMALE 5.342721 2.99 1470
## 5 35-39 FEMALE 5.179470 2.99 1301
## 6 40-49 FEMALE 5.102578 2.99 2083
## 7 50-64 FEMALE 5.069112 2.99 1915
## 8 13-17 MALE 5.331954 2.99 174
## 9 18-24 MALE 5.168873 2.99 355
## 10 25-29 MALE 6.111908 2.99 283
## 11 30-34 MALE 5.365000 2.99 200
## 12 35-39 MALE 4.686281 2.99 242
## 13 40-49 MALE 5.266812 2.99 345
## 14 50-64 MALE 5.441087 2.99 276

Does the amount people spend increase over time?

Let’s finish on a more complicated example.

Here we group by users then arrange on time. This means that we have a grouped data frame where each group is in date order. We then use the row_number function. On grouped data this simply gives the order of each observation in the group. Since the groups are in time order this will number the first spend of a user as one, second spend as two, third spend as three.

Then we can get rid of the grouping and instead group by spend number. Using summarize we can calculate how much each spend is worth at each spend number. Try running each line at a time to understand how this works.
data %>%
group_by(user_id) %>%
arrange(time) %>%
mutate(spend_number = row_number()) %>%
ungroup() %>%
group_by(spend_number) %>%
summarize(mean = mean(spend_amount),
median = median(spend_amount),
count = n())

## Source: local data frame [13 x 4] ##
## spend_number mean median count
## 1 1 520.6153 299 3442
## 2 2 531.4109 299 2891
## 3 3 531.0567 299 2363
## 4 4 493.5052 299 1829
## 5 5 531.3621 299 1287
## 6 6 535.1882 299 829
## 7 7 507.3838 299 495
## 8 8 496.9554 299 269
## 9 9 657.5470 299 117
## 10 10 424.4717 99 53
## 11 11 688.1304 299 23
## 12 12 486.5000 424 8
## 13 13 249.0000 299 4

You should now be able to do some pretty useful data manipulation with R.

If you enjoyed this language tutorial, why not watch or read how to do text mining in r? You’ll also learn how to make a word cloud.

 

Recommended Posts

Leave a Comment

Start typing and press Enter to search

YoudaGames7 insights Vertica Webinar
X