This evening, I spent some time looking in to how engineering/data-proud companies like Netflix and Spotify do their data analytics.
Here are my notes on "DATA & ANALYTICS - #bigwins with BigQuery @Spotify"
Spotify has the largest Hadoop cluster in Europe
- Technologies: Scalding, Hive (a SQL language, for ad-hoc analysis), Apache Crunch, Python MapReduce (for stuff that's scheduled)
This presented two main problems
- Speed of iteration was VERY slow
- Spotify required QUICK analysis (think bigger stars like Taylor Swift) but this became impossible--the queries themselves were taking an hour.
- Complex analysis would span several days. I feel this…even with smaller databases, good analysis takes time. I can only imagine how frustrating it can be when you begin to explore data structures like these.
- When Taylor Swift's music was removed from Spotify…what happened? "I want to understand what happened with the users when something bad happened". Creating hypothesis (Churn rate? Users left to explore other replacement artists?)
- High resource cost when debugging failures--like a bad cron job that takes resources you can never get back.
When using this on-premise cluster, Spotify also needed to maintain tools:
- Spotify built internal Web UI's - allow analysts to go to a web site, build their own SQL query and execute
- The downside of this is building products for internal use--all the resource constraints and tech debt that things brings.
Platform Spotify used to solve these problems?
- Fast, SQL (no learning curve), and no infrastructure to manage
- Permissions managed through Google Groups
- "Data Commons" is the main data set, owned by multiple teams, you can trust the data…this is free for everyone.
- Additionally, there are projects for specific teams--like their own little sandboxes, ad-hoc data sets for debugging, analysis. Freedom to do this without ruining the Data Commons.
- Data is still going to HDFS, but from there is goes to GCS (Google Cloud Services),a and from there, to BigQuery.
- BigQuery is an easy place for people to just run their own SQL!
- BigQuery will also work with Jupyter notebooks--there are plugins to make this easy.
- (Google DataLab was not used as a Notebook Environment instead of Jupyter, just because of familiarity)
- Workshops on BigQuery.
- I can host these!!
- Google also visited Spotify to help
- Onboarding material for new employees (this has gone well for Spotify--allows users (new devs, PMs, etc.) to immediately begin understanding the data.
- Create a learning Culture
- Slack, BigQuery reference, StackOverflow
There's some ultra-techy stuff to consider…I'll come back to this when it matters.
Common Query Types
- "I want to do a social targeting campaign--give me all Foo Fighters superfans in Australia"
- "How many 18-30 year old hip hop fans are there in the US? What platforms do hip-hop fans like to use?"
- It used to takes a lot time waiting for Hive to complete, moving data between parties, etc…
Web app built!
- Audience Insights
- Web App -> Request Filters
- Freed up time for analysts, and empowered teams to answer their own damn questions :)