It’s not just our clients that like making magic with our Data Mining tools. Spread across our various teams, there is an elite band of individuals constantly experimenting with ways to get more out of data using SQL.
We asked each of our resident query kings to choose one favorite query that our clients could use to unearth valuable insights from their game data. These are the answers they gave.
Russell ‘Mighty Joe’ Young
I like to have a nose at how a random player flows through the game. Looking at individual flows is a really good way of getting a quick feel for typical event sequences in the platform.
This very simple query returns a whole lot of interesting and useful information:
select eventTimestamp, eventName, missionName, productName, transactionVector from events where userID = '1001-0211-7978-FAKE' order by 1
The screencap below shows the event sequence of a player starting a session and how they fare in their first couple of sessions, breezing through 2 missions.
Looking at what happened immediately after, we can see transactions starting to come into the equation. After completing mission First Time User Forest #2, the player spends some of their gold coins in exchange for an inventory ‘shuffle’ before going onto the next mission with a new set of items.
Isaac ‘Wild’ Roseboom
Chief Product Officer
Along the same lines as Russell above, my query allows you to see the event stream for players that meet certain criteria. It’s super handy for QA and debugging because you’ll get a different user every time you run it.
If, you suspect that events aren’t being correctly recorded for players who installed a certain version of your game, for example, you can run something like the following:
select collectInsertedTimestamp, eventTimestamp, userID, eventName, sessionID, clientVersion from events where userID IN ( select user_ID from user_metrics where eventNewPlayerCount is null and user_first_seen_date between '2019-08-01' and '2019-09-25' limit 1 ) order by collectInsertedTimestamp
What comes back is the event stream of a random player running the game version that seems to be experiencing problems. Do this a few times and you can very quickly start to spot patterns in the data.
Michael ‘The Rain’ Forrest
Your global playerbase will show peaks and troughs in user activity, but I really like to observe player behaviors in the context of specific timezones. By looking at activity patterns against common 24-hour time periods, you can start to build a clear picture of how individuals like to play and when.
Not all games track local time but, assuming yours does, you can do something like the following query:
select date_part('hour', userTimestamp) , count(*) events , count(distinct userID) users , sum(msSinceLastEvent)/1000/60/60 as time_spent_hours , sum(msSinceLastEvent)/count(distinct userID)/1000/60/60 as time_spent_hours_per_user , count(case when eventName = 'transaction' and eventLevel = 0 then 1 else 0 end) as purchases -- and so on from events group by 1 order by 1
Here’s a nice example of the kind of insights you can get from looking at things in terms of timezones:
For games that do track local time as a timestamp, you can do some magic to shift that timestamp and compare the behaviors of players from multiple timezones as if they were all on the same schedule. More on that particular wizardry here. What a world.
Ben ‘Brand’ Newbon
Senior Data Analyst
This is an absolute monster of a query but it’s a belter, honest. The complexity is justified because what you get back is a list of all the units/items/cards that a player can own, as well as performance and usage stats per player.
To help you deconstruct this behemoth, I’ve made some of the key terms bold so that I can refer to them specifically and explain what they mean below. God’s speed:
with starters as( select 'deckChanged' eventName,'ADDED' changeType,'CARD_FIRETEAM' cardName union select 'deckChanged','ADDED','CARD_RPG' union select 'deckChanged','ADDED','CARD_SHOTGUNNER' union select 'deckChanged','ADDED','COMMAND_BOMB_DROP' ) ,activeDeckData as( select eventTimestamp,userId,eventname,cardName,changeType,missiontype,outcome,userscore from events where eventName in('cardGained','deckChanged','cardUpgraded') and userId in (select user_Id from userSelection) union all select min(eventTimestamp) as eventTimestamp,userId,s.eventname,s.cardName,s.changeType,'' as missiontype,'' as outcome,userscore from events e full outer join starters s on 1 where e.eventName = 'newPlayer' group by userid,s.eventname,s.cardname,s.changeType,missionType,outcome,userscore ) ,battleData as( select eventTimestamp,userId,eventname,cardName,changeType,missiontype,outcome,userscore from events where eventName in ('missionStarted','missionSummary') ) ,userCardData as( select userId,cardName from activeDeckData group by userId,cardName ) ,enrichedBattleData as( select eventTimestamp,b.userId,eventName,u.cardname,changeType,missiontype,outcome,userscore from battleData b inner join userCardData u on b.userId = u.userID ) ,combinedData as( select * from activeDeckData union all select * from enrichedBattleData ) ,data as( select * ,1 as unlocked ,greatest(0,sum(case when eventname = 'deckChanged' and changeType = 'ADDED' then 1 when eventname = 'deckChanged' and changeType = 'SWAPPED' then -1 else 0 end) over (partition by userId,cardName order by eventTimestamp)) as inDeck from combinedData order by userid,cardname,eventtimestamp ) select cardName ,count(distinct case when unlocked > 0 then userid end) as owners ,round(count(case when eventname = 'cardUpgraded' then 1 end) / count(distinct case when unlocked > 0 then userid end),2.0) as Upgrades ,round( count(case when eventname = 'missionStarted' then 1 end) / count(distinct case when unlocked > 0 then userid end),2.0) as BattlesSinceUnlocked ,round( count(case when eventname = 'missionStarted' and inDeck > 0 then 1 end) / count(distinct case when unlocked > 0 then userid end),2.0) as BattlesPlayedEquipped ,round(count(case when eventname = 'missionStarted' and inDeck > 0 then 1 end)/greatest(1,count(case when eventname = 'missionStarted' and unlocked > 0 then 1 end))*100, 2.0) as 'usage %' ,round(count(case when eventname = 'missionSummary' and outcome = 'WON' and inDeck > 0 then 1 end)/greatest(1,count(case when eventname = 'missionSummary' and inDeck > 0 then 1 end))*100, 2.0) as 'win %' ,round(count(case when eventname = 'missionSummary' and outcome = 'LOST' and inDeck > 0 then 1 end)/greatest(1,count(case when eventname = 'missionSummary' and inDeck > 0 then 1 end))*100, 2.0) as 'loss %' from data group by 1 order by 2 desc
As intimidating as it may look in SQL, the results of this query are actually very easily digestible. Check them out below.
Notes & terms
The bracketed sections next to each term are referred to as CTEs (common table expressions) and can be used to set up a kind of sub-table that only exists within the runtime of the query.
- starters – hard codes into the query the cards that the player begins the game with when there isn’t data sent for players receiving these cards like others they gain during gameplay.
- activeDeckData – combines starters with actual data received surrounding the gaining, assigning and upgrading of cards in time order.
- battleData – pulls out all the mission data for each player – the card related fields will come through as null from the mission events.
- userCardData – lists all the cards ever owned by each player.
- enrichedBattleData – combines battleData and userCardData so each battle now has a line for each card (even if the card wasn’t used for that battle).
- combinedData – adds activeDeckData to enrichedBattleData.
- data – sorts combinedData in time order and adds tags to track when a card is unlocked and when it is equipped or not.
- The final section uses data to pull out the overall stats for each card.
Steven ‘The Transit’ Van Stiphout
I’ve chosen to highlight one particular trick that utilizes the structure of our data warehouse to make running queries more efficient. By examining specific subsets of users, it really comes into its own when drilling into large amounts of data with complex computation.
Because our data warehouse stores all values of a column together, rather than rows, that means that limiting the number of columns included causes queries to run a LOT faster.
Let’s say you want to run this simple query against your data:
select count(distinct eventName, userID)/count(distinct userID) as 'avg number of event types per user' from events
You might expect it to run fairly quickly given its simplicity but, with a large dataset, that isn’t always the case. Taking advantage of the shape of our warehouse and the fact userIDs are stored as a hash, a quick trick can be used to reduce the number of users that are included to increase the query speed:
select count(distinct eventName, userID)/count(distinct userID) as 'avg number of event types per user' from events where mod(hash(userID, 100) = 66
Here, we’re splitting our users into 100 pseudo-randomly assigned and numbered buckets and taking a look at bucket #66.
Adding this code into simple queries won’t make much difference but, as we increase computational complexity, filtering data in this way becomes more and more important. It saves time and money, and it allows you to get insights on sample subsets of users without having to process entire datasets.
In the pursuit of knowledge, our SQL heroes have forgone countless Friday nights, relationships, weddings, and Christmas dinners* so that you don’t have to. When you run these queries, take a moment to remember the brave souls that created them.