About Redshift
Redshift is the cloud-based data warehouse solution from Amazon that we use to store the Analytics data from our games. Redshift has a very optimized architecture designed specifically for big data applications.
Connecting to Redshift
We have two Redshift clusters, one for staging games and one for prod games. Each contain one database named staging and prod respectively. Tables have the naming convention: rawdata_<game_skuid>_current
Connecting through Command Line
The most straight forward way to connect to the Redshift Database is through the PostgresQL command line interface. Install the psql command and connect to the databases with these commands:
psql -h analytics-staging.c3daf8xd7gja.us-east-1.redshift.amazonaws.com -p 5439 -U magic -W -d staging # analytics staging
psql -h analytics-prod.c3daf8xd7gja.us-east-1.redshift.amazonaws.com -p 5439 -U magic -W -d prod # analytics prod
You will be prompted for the Magic user’s password. The password is not on this document for security sake. Once connected, you can run regular postgres queries against the data tables.
Connecting through SQL Workbench/J
Download SQL Workbench/J here.
You will also need the Amazon Redshift Driver here (this is for JDBC 4.1).
Setup and configure SQL Workbench/J as follows:
- Launch SQL Workbench/J. The “Connect Window” will be displayed with “New Profile” highlighted and ready for editing. Give the profile a name like “Analytics DB Prod/QA”.
- Next, you will need to be able to select the Redshift JDBC driver. Before you can do that, you have to add it. So click “Manager Drivers” in the lower left corner of the “Connect Window”. The “Manager Drivers” window should be displayed.
- In the top left corner of the “Manage Drivers” window, click the icon that looks like a page/paper. This will display a “New Driver” dialog. For name, type “Redshift” and then select the driver you just downloaded by clicking the folder icon with the green arrow. This should auto-fill Classname and you just need to click “OK.”
- Now back in the “Connect Window,” select the Redshift driver from the “Driver” drop down menu.
- URLs are of the form:
jdbc:redshift://analytics-staging.c3daf8xd7gja.us-east-1.redshift.amazonaws.com:5439/stagingfor staging andjdbc:redshift://analytics-prod.c3daf8xd7gja.us-east-1.redshift.amazonaws.com:5439/prodfor prod. - Get the login credentials from the Scryptkeepers.
- Click “OK” to connect. Test your connection with a simple query like:
SELECT 1
ETL
There are machines in our AWS cloud that are responsible for gathering the analytics sent up from user devices and pushing them into the data warehouse in Redshift. This process is called ETL. We have two ETL machines connected via a load balancer that are both responsible for running a server that accepts incoming analytic events from devices. The code for the node server and associated ETL scripts is located in beanstalk under the magic-etl repository.
The flow of an analytic event is as follows:
- Analytic events are stored as JSON on the device and are periodically sent up to [https://magic-analytics.magmic.com/v2/magic.php]
- The load balancer distributes the requests between the ETL machines running a node server
- The node server saves these analytic events on the local machine in a temporary folder by skuid
- A cronjob running on the machine periodically uploads the contents of the temporary analytic events folder to the
magic-etlbucket on s3 (s3push.js script) - A cronjob running specifically on the ETL-1 machine grabs a batch of events from the
magic-etlbucket and processes them into a csv file and places it temporarily inmagic-etl-csv. It them performs aCOPYcommand in Redshift to ingest this csv data into the data warehouse. (s3JSONtoCSV.js script)- To avoid race conditions, only one machine performs this operation
- Basic processing on each analytic is done, such as tossing out old, invalid, or otherwise incompatible event data
- Analytic events that are moving through the s3 buckets are not stored indefinitely. Check the individual buckets’ expiry settings (either 30 or 7 days).
Maintenance
Adding New Games to Redshift
Adding a new game through the Hope dashboard automatically adds the corresponding Redshift tables, but if it needs to be done manually or for an older game, here is how to create a new table:
CREATE TABLE rawdata_<game_skuid>_current (
mid varchar(64) not null encode zstd,
rid varchar(1024) encode zstd,
api varchar(12) not null encode zstd,
pid varchar(64) not null encode zstd,
did varchar(64) not null encode zstd,
os varchar(64) not null encode zstd,
d varchar(64) not null encode zstd,
a varchar(12) not null encode zstd,
ver varchar(24) not null encode zstd,
c varchar(24) not null encode zstd,
lg varchar(24) not null encode zstd,
ab varchar(8) encode zstd,
uid varchar(128) encode zstd,
sid varchar(64) not null encode zstd,
tz varchar(8) encode zstd,
ts timestamp sortkey encode zstd,
tig int not null encode zstd,
n varchar(36) encode zstd,
st1 varchar(128) encode zstd,
st2 varchar(128) encode zstd,
st3 varchar(128) encode zstd,
v varchar(128) encode zstd,
ut varchar(1024) encode zstd,
p1 varchar(128) encode zstd,
p2 varchar(128) encode zstd,
p3 varchar(128) encode zstd,
p4 varchar(128) encode zstd,
p5 varchar(128) encode zstd,
p6 varchar(128) encode zstd,
p7 varchar(128) encode zstd,
p8 varchar(128) encode zstd,
p9 varchar(128) encode zstd,
p10 varchar(128) encode zstd
)
DISTSTYLE EVEN;
Archiving Data
Archiving old data has several benefits:
- Making your active data tables smaller increases query performance
- Data is still retained and available for queries if needed
- Archived tables are much easier to move into more cost effective storage solutions if space needs to be freed up
- We can continue to stay under our current Periscope package’s row count by continually archiving the inactive data
Our previous solution for archiving data was to run an unload command, which moved the data into the magic-redshift-unloads S3 bucket, then we would delete the rows from the table we backed up and run a vacuum delete only on the table to reclaim space.
Now that we understand better how column compression in Redshift works and how Periscope accesses tables, we have a better solution now that uses time-series tables.
We now have a script automatically runs through the rawdata_<game_skuid>_current tables for each game, slicing the data up and inserting it into the respective time-series table of the form rawdata_<game_skuid>_YYYYMM.
For example, here is how to manually create a time-series table of Scattergories Android with data between February 2017 and March 2017.
-- Creating a time-series or archive table.
-- Using the LIKE version of this operation preserves the sortkey, distkey, and column compression settings of the original table.
-- NOTE the periscopedata user does not automatically get select privileges on this table (which is good)
CREATE TABLE IF NOT EXISTS rawdata_ec50174c33f5430c3b35d5d8_201702(LIKE rawdata_ec50174c33f5430c3b35d5d8_current);
INSERT INTO rawdata_ec50174c33f5430c3b35d5d8_201702 (
SELECT *
FROM rawdata_ec50174c33f5430c3b35d5d8_current
WHERE
DATE(ts) >= '2017/02/01' AND
DATE(ts) < '2017/03/01'
);
ANALYZE rawdata_ec50174c33f5430c3b35d5d8_201702;
Note the name of the time-series table tells you what month of data it contains. Now that the table is created, the active data table can be cleaned up.
-- Delete the old rows from the active table
DELETE FROM rawdata_ec50174c33f5430c3b35d5d8_current WHERE DATE(ts) >= '2017/02/01' AND DATE(ts) < '2017/03/01';
-- Vacuuming with 'DELETE ONLY' reclaims the space from the delete
VACUUM DELETE ONLY rawdata_ec50174c33f5430c3b35d5d8_current;
-- It can be a good idea to 'ANALYZE' a table after a major change to its contents.
ANALYZE rawdata_ec50174c33f5430c3b35d5d8_current;
Backing up Archives
If more space is needed on the Redshift cluster, archived time-series tables can be backed up to S3 and removed with the following query:
UNLOAD ('SELECT * FROM 'rawdata_ec50174c33f5430c3b35d5d8_201702')
TO 's3://magic-redshift-unloads/2018-03-01/Scattergories_Android_'
CREDENTIALS 'aws_access_key_id=<access_key_id>;aws_secret_access_key=<secret_access_key>'
GZIP
ESCAPE
MANIFEST;
DROP TABLE rawdata_ec50174c33f5430c3b35d5d8_201702;
Reloading Backups
If a backed up table needs to be re-imported back into redshift, it can be done so with this query:
-- First, create the table to copy the backup data into.
CREATE TABLE rawdata_ec50174c33f5430c3b35d5d8_201702(LIKE rawdata_ec50174c33f5430c3b35d5d8_current);
COPY rawdata_ec50174c33f5430c3b35d5d8_201702
FROM 's3://magic-redshift-unloads/2018-03-01/Scattergories_Android_manifest'
WITH CREDENTIALS 'aws_access_key_id=<access_key_id>;aws_secret_access_key=<secret_access_key>'
GZIP
ESCAPE
MANIFEST
TRUNCATECOLUMNS -- Optional, but will not fail the operation if the data is too large for the column.
;
Sometimes the format of the tables between much older archives and the current schema has changed, such as the recent addition of the rid column. In this case, we have to specify the name and order of the columns in the archived data. Additional columns in the table will be filled in with their default value.
CREATE TABLE rawdata_ec50174c33f5430c3b35d5d8_20161107_s3archive(LIKE rawdata_ec50174c33f5430c3b35d5d8_current);
COPY rawdata_ec50174c33f5430c3b35d5d8_20161107_s3archive(mid, api, pid, did, os, d, a, ver, c, lg, ab, uid, sid, tz, ts, tig, n, st1, st2, st3, v, ut, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10)
FROM 's3://magic-redshift-unloads/2016-11-07/Scattergories_Android_manifest'
WITH CREDENTIALS 'aws_access_key_id=<access_key_id>;aws_secret_access_key=<secret_access_key>'
-- GZIP -- the really old archives might not have been created with the GZIP option, so we have to remove this
ESCAPE
MANIFEST
TRUNCATECOLUMNS -- Optional, but will not fail the operation if the data is too large for the column.
;
Creating Time-Series Views
An SQL View can be created that looks at several time-series archives with UNION ALL, and queries against the view will run against all tables in the view.
CREATE OR REPLACE VIEW scattergories_android_2017_06_01_to_present AS (
SELECT * FROM rawdata_ec50174c33f5430c3b35d5d8_current
UNION ALL
SELECT * FROM rawdata_ec50174c33f5430c3b35d5d8_201708
UNION ALL
SELECT * FROM rawdata_ec50174c33f5430c3b35d5d8_201707
UNION ALL
SELECT * FROM rawdata_ec50174c33f5430c3b35d5d8_201706
);
Periscope
Periscope Data is a data visualization platform. We use Periscope’s tools to display our analytics data in easier to digest and meaningful ways. Periscope has a pricing tier determined by the number of rows in our databases they have read access to. We can modify the amount of data we can display in Periscope by limiting the tables the periscope user has been granted access to, or rather, grant them access to a single time-series view for each game that we can change at any time to include any range of months we wish.
The nightly maintenance script re-creates a view specifically for periscope with the form periscope_<game_skuid> an includes the _current table, the current month’s time-series table, and x previous time-series tables as defined in a configuration file (soon to be in a manifest for ETL).
Manually give the Periscope user access to a table with this query:
GRANT SELECT ON periscope_<game_skuid> TO periscopedata;
If you set up a new cluster, you will first need to create the Periscope user and grant it access to the public schema.
CREATE USER periscopedata PASSWORD '<can be found in lastpass>';
GRANT USAGE ON SCHEMA public TO periscopedata;
Nightly Maintenance Jobs
Nightly maintenance operations are run on the Redshift tables by our ETL server. These operations include:
- Generating and/or updating time-series tables off of data from the rawdata_x_current table.
- Running a VACUUM and ANALYZE on every rawdata_x_current table and any time-series table that was modified.
- Re-creating the periscope views.