Jean-Nicholas Hould Data Science & Business Bootstrapping

Descriptive Statistics in SQL

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

For this post, I am using an open data set. The million songs data sets is a «freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

  1. Download the Million Song Data Set. I used the 10K songs subset.
  2. Download a SQLite Client
  3. Open subset_track_metadata.db file with SQLite and start exploring

Why Descriptive Statistics?

When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendancy (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency

Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs

-- | avg_year |
-- |----------|
-- | 934      |
  • CAST: Run-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
  • AVG: Aggregation function that returns the mean of the input expression value.
  • as avg_year: Temporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_year
FROM songs 
ORDER BY songs.year 
LIMIT 1 
OFFSET (SELECT COUNT(*) FROM songs) / 2

-- | median_year |
-- |-------------|
-- | 0           |
  • ORDER BY: Sorts the resulting data set by one or more column. The ordering can be ascending ASC (default) or descending DESC.
  • COUNT: Aggregation function that returns the # of rows that matches a criteria.
  • LIMIT: Specifies the maximum number of rows that can be returned from the resulting data set.
  • OFFSET: Skip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row count COUNT(*) divided by 2).

Mode

The mode is the value that appears most often in a set of data.

SELECT 
    songs.year,
    COUNT(*) as count
FROM songs
GROUP BY songs.year
ORDER BY COUNT(*) DESC
LIMIT 1

-- | year | count |
-- |------|-------|
-- | 0    | 5320  |
  • GROUP BY: Used with aggregation functions such as COUNT, AVG, etc. Groups the resulting data set by one or more column.

Variability

Min/Max Values

Minimum/Maximum value in a data set.

SELECT 
    MIN(songs.year) as min_year,
    MAX(songs.year) as max_year
FROM
    songs

-- | min_year | max_year |
-- |----------|----------|
-- | 0        | 2010     |
  • MIN: Aggregation function that returns the smallest value in a data set.
  • MAX: Aggregation function that returns the largest value in a data set.

Distribution of songs per year

Count of songs released in each years

SELECT 
    songs.year,
    COUNT(*) songs_count
FROM songs
GROUP BY songs.year
ORDER BY songs.year ASC

-- | year | song_count |
-- |------|------------|
-- | 0    | 5320       |
-- | 1926 | 2          |
-- | 1927 | 3          |
-- | ...  | ...        |
-- | 2009 | 250        |
-- | 2010 | 64         |

Next Steps

The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.

SQL for Data Analysis

I’ve a had a few discussions over the past months with people wanting to get into the field of data analysis. One of the most frequent question I get is: “What programming language shoud I learn to get my first data analysis job? Should I learn R or Python?”. My answer to them is: none of these for now. First learn SQL.

Why SQL?

SQL is the most common denominator for data analysis. It’s a special-purpose programming language designed to interact with databases. It is ubiquitous. As a data analyst you will inevitably interact with an SQL interface on a frequent basis.

SQL is a simple language that gets the job done. You can learn it’s basics in a few hours and with those concepts you can powerfully wrangle your data, whether it’s a few rows or millions of data points.

With the rise of SQL-based business intelligence tools such as Looker, Periscope and Mode Analytics, it is now, more than ever, an indispensable tool for a data analyst. The big data ecosystem is also being ruled by SQL. Hadoop, Redshift and other massive parallel processing data warehouse are interfacing in SQL. As a data analyst, you have no excuses not to master it.

Get your hands dirty

As with many skills, the best way to learn SQL is by getting your hands dirty. To get started, I’d suggest going through Learn SQL The Hard Way. While the book is not complete yet, it is a good hands-on reference to get started. It helps you get setup with a local version of SQLite and teaches you some basic commands to create, retrieve, update and delete data.

Once you get more familiar with SQL, load up one of those awesome public data sets in your local SQLite instance. Pick one dataset that you are interested in, formulate some questions about it and analyze it. Publish the results of your analysis, including your code, on Github. This will enable you to practice your analysis and SQL skills while building your own portfolio. Even if it’s not perfect, it great to start building out some assets out in the wild.

Getting Started with Statistics for Data Science

Learning statistics can be a daunting journey for aspiring data scientists that are not coming from a quantitative field. Whether you are a computer science undergrad, a developer in seek of a career change or a MBA graduate, it seems that the statistical part of data science is often the most intimidating one. As a business school graduate, it was for me.

Statistics are a serious discipline, some people spend their live studying them. As an aspiring data scientist, how should you approach learning stats? What do you need to know? What’s the best way to learn about stats? Here’s how you should go about this.

Start Simple

You can get tremendous value from understanding simple statistical concepts. In many data science projects, you don’t need advanced stats knowledge to draw significant conclusions. For this reason, you should focus on learning the basics of statistics, applying them to your work and expanding from there.

The two main branches of statistics that you need know are descriptive statistics and inferential statistics. You can get a ton of value by understanding those properly.

Descriptive Statistics

Descriptive statistics describe quantitatively a collection of information. They summarize the observed data. Contrarily to inferential statistics, they are not deducing facts about the greater population. They are only describing the collected data set.

You have surely interacted with those statistics in the past. Some common measurements in descriptive statistics gauge the central tendency (mean, median, mode…) and others the variability (standard deviation…) of the data set.

Inferential Statistics

Inferential statistics enables us to infer properties about a population based on a sample data set. They use the sample to form conclusions beyond the collected data.

In practical data science, inferential statistics are heavily used when comparing conversion rates, analyzing an experiment such as an A/B test, etc.

Online Courses

For me, online classes worked like a charm to learn the basics:

These classes are interactive, include exercices and videos. I find they are a very good way to get started in this field. They will provide you just enough knowledge so you can start getting more comfortable with statistics.

Books

On a general note, I recommend the book Naked Statistics: Stripping the Dread from the Data. This book by Charles Wheelan covers, amongst others, the topics of descriptive/inferential statistics and provides a good overview of each field. It demystifies statistics through some very concrete and cheerful examples.

Build from there

Remember, the best way to learn these concepts is by applying your knowledge to concrete examples. Once you have started to integrate those concepts in your analyses, I recommend you pick up a statistics manual, such as All of Statistics and deepen your knowledge.

Goals for 2016

Personal

Running & Health

  • Run all year long, at least twice per week.
  • Run 365 miles during the year.
  • Run one half-marathon in 1h40 or below.
  • Run a marathon in 3h40 or below.
  • Run a ultramarathon.
  • Target Weight: 166-170 pounds

Finance

  • Save 24% of annual salary

Writting/Reading

  • Write 12 essays, at least once per month.
  • Read 12 books, at least once per month.

Productivity

  • Measure how much time I am wasting watching/reading useless stuff.
  • Reduce the amount of time I spend on this stuff by 50%.

Profesionnal

Skills

  • Take 2 classes about Machine Learning
  • Develop skills in Customer Development

Business

  • Experiment with at least 3 business ideas.
  • Generate 1,000$ revenues with a business project.

Work

  • Measure time spent in meetings and reduce it.

Think Smaller

For the last few years, I’ve been participating in Hackathons, a type of event where hackers and designers meet to build a project in a limited amount of time. I’ve seen awesome products get built, but I also witnessed a great deal of failures. These failures were generally not caused by a lack of work or talent in the team. These projects failed because the team aimed too high. Their vision was so big they didn’t know where to start from, resulting in an unclear product solving no problem.

As a first step to any project, ask yourself what is the core element of it. What is the one thing your app needs to do to be useful? Think small. You will have to say no to many features in order to stay focused. As humans, we get motivated when we feel we’re having an impact. Building a product in small steps is incredibly motivating. A small feature that is going to affect people tomorrow is a lot more engaging to work on than one that will be released a year from now. Furthermore, taking small steps is a great way of minimizing the risk of building stuff you don’t need.

This philosophy of taking small realistic steps is not limited to Hackathons or web products. You can apply this to almost any kind of endeavour. Whatever you have in mind right now, think about the first smallest step you could do in order to get closer to your goal. You want to build an international food chain? Why not start that by cooking a meal for your friends? Don’t try to drink the river.