In Data analysis, How-to guides, Uncategorized

Text analysis is the hot new trend in analytics, and with good reason! Text is a huge, mainly untapped source of data, and with Wikipedia alone estimated to contain 2.6 billion English words, there’s plenty to analyze. Performing a text analysis will allow you to find out what people are saying about your game in their own words, but in a quantifiable manner. In this tutorial, you will learn how to do text mining in R, you will get the tools to do a bespoke analysis on your own and find out how to plot a word cloud.

Text mining in R: how to find term frequency

A great way of applying text analysis towards your game reviews is to find a simple frequency of each word used. I’ll show you how to do this in the video below, and how to then plot these frequencies as a word cloud.

Direct Access

If you don’t have your own data to use, download our sample of 1000 reviews of popular free-to-play games from the iTunes store. This is the data that will be used in this example.

Here’s a step-by-step guide

First, you’ll need to ensure you have the most recent version of R, head over to http://cran.r-project.org/ to download it.

You can copy and paste the following commands into the R Console, although, we use R-Studio and would recommend it.

Then you’ll need to install “tm”, the text mining library for R.

install.packages('tm')

Once it’s installed you need to load the library into your session.

library(tm)

Use setwd() to change the working directory to wherever you saved your CSV file to (note that you need to use a double forward slash in windows).

setwd(‘C:\Users\Mhairi\Documents\’)

Then read in the csv file containing your data. This is assuming that the csv file of reviews is in the same place as your R file. If you’ve gathered your own data, you’ll need to find a way of loading it in. We need to set strings As Factors to FALSE, because we’re going to be treating our strings as strings, rather than as categories.

reviews <- read.csv ("reviews.csv", stringsAsFactors=FALSE)

Have a quick look at reviews to see if the csv has loaded correctly.

str(reviews)

Output:
‘data.frame’: 1000 obs. of 4 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 …
$ rating : int 5 5 5 5 5 1 3 5 4 4 …
$ location: chr “us” “us” “nz” “gb” …
$ text : chr ” SO ADDICTING DEFF DOWNLAOD ITS EPIC YOU CAT LOVERS WILL FALL IN LOVE <3″ ” Great game I love this game. Unlike other games they constantly give you money to play. They are always given you a bone. Keep”| __truncated__ ” Sooo much FUN I would definitely recommend this game, it’s fun for dress up and business. It’s extremely entertaining, I’m hoo”| __truncated__ ” AWESOME Epic game so addictive 5stars <f0><U+009F><U+0098><U+0084>” …

Now we start using the tm package.The tm package is designed for comparing different texts against each other. These are the steps the tm package expects you to take:

1. Set up a source for your text
2. Create a corpus from that source (a corpus is just another name for a collection of texts)
3. Create a document-term matrix, which tells you how frequently each term appears in each document in your corpus

 

We are going to do a more simple analysis where we just treat all the reviews as one text and look at the word count in those reviews. However, it’s still worth using the tm library as it has excellent text cleaning tools and makes counting words easy. Some of the syntax will look a little strange as we will be giving tm a corpus of size one.

We currently have all the text of every review in a vector, reviews$text, of size 1000. Each element of the vector corresponds to one review. Since we’re currently not interested in the difference between each review, we can simply paste every review together, separating with a space.

review_text <- paste(reviews$text, collapse=" ")

The collapse argument in paste tells R that we want to paste together elements of a vector, rather than pasting vectors together.
Now we can set up the source and create a corpus.

review_source <- VectorSource(review_text)

corpus <- Corpus(review_source)

Easy!

Next, we begin cleaning the text. We use the multipurpose tm_map function inside tm to do a variety of cleaning tasks:

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, removeWords, stopwords(“english”))

So, what have we just done? We’ve transformed every word to lower case, so that ‘Fun’ and ‘fun’ now count as the same word. We’ve removed all punctuation – ‘fun’ and ‘fun!’ will now be the same. We stripped out any extra whitespace and we removed stop words. Stop words are just common words which we may not be interested in. If we look at the result of stopwords (“english”) we can see what is getting removed:

stopwords("english")

Output:
[1] “i” “me” “my” “myself” “we”

[6] “our” “ours” “ourselves” “you” “your”

Depending out what you are trying to achieve with your analysis, you may want to do the data cleaning step differently. You may want to know what punctuation is being used in your text or the stop words might be an important part of your analysis. So use your head and have a look at the getTransformations() function to see what your data cleaning options are.

Data Mining1

Now we create the document-term matrix.

dtm <- DocumentTermMatrix(corpus)

Since we only have one document in this case, our document-term matrix will only have one column.

The tm package stores document term matrixes as sparse matrices for efficacy. Since we only have 1000 reviews and one document we can just convert our term-document-matrix into a normal matrix, which is easier to work with.

dtm2 <- as.matrix(dtm)

We then take the column sums of this matrix, which will give us a named vector.

frequency <- colSums(dtm2)

And now we can sort this vector to see the most frequently used words:

frequency <- sort(frequency, decreasing=TRUE)

head(frequency)

Output:
game great fun good love get
917 249 241 236 222 149

Voila!

Plotting a word cloud

However, a list of words and frequencies is a little hard to interpret. Let’s install and load the wordcloud package to visualize these words as a word cloud.

install.packages('wordcloud')

library(wordcloud)

Word cloud is very easy to use, we just need a list of the words to be plotted and their frequency. To get the list of words we can take the names of our named vector.

words <- names(frequency)

Let’s plot the top 100 words in our cloud.

wordcloud(words[1:100], frequency[1:100])

Now, I’m sure this is far from the prettiest word cloud you’ve ever seen. But I hope it inspires you to try a piece of text analysis.

If you liked this, you may be interested in reading our data manipulation in r language tutorial.

For a free tool which can compare your reviews against other reviews automatically, performs sentiment analysis on those reviews and visualizes the key words in your reviews,  check out the Benchmark dashboard on deltaDNA. If you’re not already on our platform, you can sign up for a free trial.

Recommended Posts

Start typing and press Enter to search

X