Jean-Nicholas Hould Data Science & Business Bootstrapping

Tidy Data in Python

I recently came across a paper named Tidy Data by Hadley Wickham. Published back in 2014, the paper focuses on one aspect of cleaning up data, tidying data: structuring datasets to facilitate analysis. Through the paper, Wickham demonstrates how any dataset can be structured in a standardized way prior to analysis. He presents in detail the different types of data sets and how to wrangle them into a standard format.

As a data scientist, I think you should get very familiar with this standardized structure of a dataset. Data cleaning is one the most frequent task in data science. No matter what kind of data you are dealing with or what kind of analysis you are performing, you will have to clean the data at some point. Tidying your data in a standard format makes things easier down the road. You can reuse a standard set of tools across your different analysis.

In this post, I will summarize some tidying examples Wickham uses in his paper and I will demonstrate how to do so using the Python pandas library.

Defining tidy data

The structure Wickham defines as tidy has the following attributes:

  • Each variable forms a column and contains values
  • Each observation forms a row
  • Each type of observational unit forms a table

A few definitions:

  • Variable: A measurement or an attribute. Height, weight, sex, etc.
  • Value: The actual measurement or attribute. 152 cm, 80 kg, female, etc.
  • Observation: All values measure on the same unit. Each person.

An example of a messy dataset:

Treatment A Treatment B
John Smith - 2
Jane Doe 16 11
Mary Johnson 3 1

An example of a tidy dataset:

Name Treatment Result
John Smith a -
Jane Doe a 16
Mary Johnson a 3
John Smith b 2
Jane Doe b 11
Mary Johnson b 1

Tidying messy datasets

Through the following examples extracted from Wickham’s paper, we’ll wrangle messy datasets into the tidy format. The goal here is not to analyze the datasets but rather prepare them in a standardized way prior to the analysis. These are the five types of messy datasets we’ll tackle:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

Note: All of the code presented in this post is available on Github.

Column headers are values, not variable names

Pew Research Center Dataset

This dataset explores the relationship between income and religion.

Problem: The columns headers are composed of the possible income values.

import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
import glob
import re

df = pd.read_csv("./data/pew-raw.csv")
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
Agnostic 27 34 60 81 76 137
Atheist 12 27 37 52 35 70
Buddhist 27 21 30 34 33 58
Catholic 418 617 732 670 638 1116
Dont know/refused 15 14 15 11 10 35
Evangelical Prot 575 869 1064 982 881 1486
Hindu 1 9 7 9 11 34
Historically Black Prot 228 244 236 238 197 223
Jehovahs Witness 20 27 24 24 21 30
Jewish 19 19 25 25 30 95

A tidy version of this dataset is one in which the income values would not be columns headers but rather values in an income column. In order to tidy this dataset, we need to melt it. The pandas library has a built-in function that allows to do just that. It “unpivots” a DataFrame from a wide format to a long format. We’ll reuse this function a few times through the post.

formatted_df = pd.melt(df,
formatted_df = formatted_df.sort_values(by=["religion"])

This outputs a tidy version of the dataset:

religion income freq
Agnostic <$10k 27
Agnostic $30-40k 81
Agnostic $40-50k 76
Agnostic $50-75k 137
Agnostic $10-20k 34
Agnostic $20-30k 60
Atheist $40-50k 35
Atheist $20-30k 37
Atheist $10-20k 27
Atheist $30-40k 52

Billboard Top 100 Dataset

This dataset represents the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks.


  • The columns headers are composed of values: the week number (x1st.week, …)
  • If a song is in the Top 100 for less than 75 weeks, the remaining columns are filled with missing values.
df = pd.read_csv("./data/billboard.csv", encoding="mac_latin2")
year artist.inverted track time genre date.entered date.peaked x1st.week x2nd.week ...
2000 Destiny's Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 78 63.0 ...
2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 15 8.0 ...
2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 71 48.0 ...
2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 41 23.0 ...
2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 57 47.0 ...
2000 Janet Doesn't Really Matter 4:17 Rock 2000-06-17 2000-08-26 59 52.0 ...
2000 Destiny's Child Say My Name 4:31 Rock 1999-12-25 2000-03-18 83 83.0 ...
2000 Iglesias, Enrique Be With You 3:36 Latin 2000-04-01 2000-06-24 63 45.0 ...
2000 Sisqo Incomplete 3:52 Rock 2000-06-24 2000-08-12 77 66.0 ...
2000 Lonestar Amazed 4:25 Country 1999-06-05 2000-03-04 81 54.0 ...

A tidy version of this dataset is one without the week’s numbers as columns but rather as values of a single column. In order to do so, we’ll melt the weeks columns into a single date column. We will create one row per week for each record. If there is no data for the given week, we will not create a row.

# Melting
id_vars = ["year",

df = pd.melt(frame=df,id_vars=id_vars, var_name="week", value_name="rank")

# Formatting 
df["week"] = df['week'].str.extract('(\d+)', expand=False).astype(int)
df["rank"] = df["rank"].astype(int)

# Cleaning out unnecessary rows
df = df.dropna()

# Create "date" columns
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)

df = df[["year", 
df = df.sort_values(ascending=True, by=["year","artist.inverted","track","week","rank"])

# Assigning the tidy dataset to a variable for future usage
billboard = df


A tidier version of the dataset is shown below. There is still a lot of repetition of the song details: the track name, time and genre. For this reason, this dataset is still not completely tidy as per Wickham’s definition. We will address this in the next example.

year artist.inverted track time genre week rank date
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 1 87 2000-02-26
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 2 82 2000-03-04
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 3 72 2000-03-11
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 4 77 2000-03-18
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 5 87 2000-03-25
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 6 94 2000-04-01
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 7 99 2000-04-08
2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... 3:15 R&B 1 91 2000-09-02
2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... 3:15 R&B 2 87 2000-09-09
2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... 3:15 R&B 3 92 2000-09-16

Multiple types in one table

Following up on the Billboard dataset, we’ll now address the repetition problem of the previous table.


  • Multiple observational units (the song and its rank) in a single table.

We’ll first create a songs table which contains the details of each song:

songs_cols = ["year", "artist.inverted", "track", "time", "genre"]
songs = billboard[songs_cols].drop_duplicates()
songs = songs.reset_index(drop=True)
songs["song_id"] = songs.index
year artist.inverted track time genre song_id
2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) 4:22 Rap 0
2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... 3:15 R&B 1
2000 3 Doors Down Kryptonite 3:53 Rock 2
2000 3 Doors Down Loser 4:24 Rock 3
2000 504 Boyz Wobble Wobble 3:35 Rap 4
2000 98� Give Me Just One Night (Una Noche) 3:24 Rock 5
2000 A*Teens Dancing Queen 3:44 Pop 6
2000 Aaliyah I Don't Wanna 4:15 Rock 7
2000 Aaliyah Try Again 4:03 Rock 8
2000 Adams, Yolanda Open My Heart 5:30 Gospel 9

We’ll then create a ranks table which only contains the song_id, date and the rank.

ranks = pd.merge(billboard, songs, on=["year","artist.inverted", "track", "time", "genre"])
ranks = ranks[["song_id", "date","rank"]]
song_id date rank
0 2000-02-26 87
0 2000-03-04 82
0 2000-03-11 72
0 2000-03-18 77
0 2000-03-25 87
0 2000-04-01 94
0 2000-04-08 99
1 2000-09-02 91
1 2000-09-09 87
1 2000-09-16 92

Multiple variables stored in one column

Tubercolosis Records from World Health Organization

This dataset documents the count of confirmed tuberculosis cases by country, year, age and sex.


  • Some columns contain multiple values: sex and age.
  • Mixture of zeros and missing values NaN. This is due to the data collection process and the distinction is important for this dataset.
df = pd.read_csv("./data/tb-raw.csv")
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
AD 2000 0 0 1 0 0 0 0 NaN NaN
AE 2000 2 4 4 6 5 12 10 NaN 3
AF 2000 52 228 183 149 129 94 80 NaN 93
AG 2000 0 0 0 0 0 0 1 NaN 1
AL 2000 2 19 21 14 24 19 16 NaN 3
AM 2000 2 152 130 131 63 26 21 NaN 1
AN 2000 0 0 1 2 0 0 0 NaN 0
AO 2000 186 999 1003 912 482 312 194 NaN 247
AR 2000 97 278 594 402 419 368 330 NaN 121
AS 2000 NaN NaN NaN NaN 1 1 NaN NaN NaN

In order to tidy this dataset, we need to remove the different values from the header and unpivot them into rows. We’ll first need to melt the sex + age group columns into a single one. Once we have that single column, we’ll derive three columns from it: sex, age_lower and age_upper. With those, we’ll be able to properly build a tidy dataset.

df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")

# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})")    

# Name columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]

# Create `age`column based on `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]

# Merge 
df = pd.concat([df, tmp_df], axis=1)

# Drop unnecessary columns and rows
df = df.drop(['sex_and_age',"age_lower","age_upper"], axis=1)
df = df.dropna()
df = df.sort(ascending=True,columns=["country", "year", "sex", "age"])

This results in a tidy dataset.

country year cases sex age
AD 2000 0 m 0-14
AD 2000 0 m 15-24
AD 2000 1 m 25-34
AD 2000 0 m 35-44
AD 2000 0 m 45-54
AD 2000 0 m 55-64
AE 2000 3 f 0-14
AE 2000 2 m 0-14
AE 2000 4 m 15-24
AE 2000 4 m 25-34

Variables are stored in both rows and columns

Global Historical Climatology Network Dataset

This dataset represents the daily weather records for a weather station (MX17004) in Mexico for five months in 2010.


  • Variables are stored in both rows (tmin, tmax) and columns (days).
df = pd.read_csv("./data/weather-raw.csv")
id year month element d1 d2 d3 d4 d5 d6 d7 d8
MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN
MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN
MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN
MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN
MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN

In order to make this dataset tidy, we want to move the three misplaced variables (tmin, tmax and days) as three individual columns: tmin. tmax and date.

# Extracting day
df["day"] = df["day_raw"].str.extract("d(\d+)", expand=False)  
df["id"] = "MX17004"

# To numeric values
df[["year","month","day"]] = df[["year","month","day"]].apply(lambda x: pd.to_numeric(x, errors='ignore'))

# Creating a date from the different columns
def create_date_from_year_month_day(row):
    return datetime.datetime(year=row["year"], month=int(row["month"]), day=row["day"])

df["date"] = df.apply(lambda row: create_date_from_year_month_day(row), axis=1)
df = df.drop(['year',"month","day", "day_raw"], axis=1)
df = df.dropna()

# Unmelting column "element"
df = df.pivot_table(index=["id","date"], columns="element", values="value")
df.reset_index(drop=False, inplace=True)
id date tmax tmin
MX17004 2010-02-02 27.3 14.4
MX17004 2010-02-03 24.1 14.4
MX17004 2010-03-05 32.1 14.2

One type in multiple tables

Dataset: Illinois Male Baby Names for the year 2014/2015.


  • The data is spread across multiple tables/files.
  • The “Year” variable is present in the file name.

In order to load those different files into a single DataFrame, we can run a custom script that will append the files together. Furthermore, we’ll need to extract the “Year” variable from the file name.

def extract_year(string):
    match = re.match(".+(\d{4})", string) 
    if match != None: return
path = './data'
allFiles = glob.glob(path + "/201*-baby-names-illinois.csv")
frame = pd.DataFrame()
df_list= []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    df.columns = map(str.lower, df.columns)
    df["year"] = extract_year(file_)
df = pd.concat(df_list)
rank name frequency sex year
1 Noah 837 Male 2014
2 Alexander 747 Male 2014
3 William 687 Male 2014
4 Michael 680 Male 2014
5 Liam 670 Male 2014

Final Thoughts

In this post, I focused on one aspect of Wickham’s paper, the data manipulation part. My main goal was to demonstrate the data manipulations in Python. It’s important to mention that there is a significant section of his paper that covers the tools and visualizations from which you can benefit by tidying your dataset. I did not cover those in this post.

Overall, I enjoyed preparing this post and wrangling the datasets into a streamlined format. The defined format makes it easier to query and filter the data. This approach makes it easier to reuse libraries and code across analysis. It also makes it easier to share a dataset with other data analysts.

Getting your first job in data science

A few weeks ago, I came across a great post from David Robinson about his first year as a data scientist at Stack Overflow. The post went into great details about how David landed his job there and the things he’s been working on since then.

In a section of the post, David advised graduate students who wish to get into data science to create public artifacts. David landed his job partly because of some public artifacts he created: blog posts and answers to questions on StackExchange.

His advice really resonated with me. Many of the great things that happened to me in the last few years are the result of making my work public: meeting new people, landing a job at PasswordBox, creating and selling a side project. The best way to get a job if you don’t have any experience is to make your work public.

Making My Work Public

I learned to code fairly late by the tech-world standards. I was 24 years old. At that age, I was temporarily living in Chicago to attend the Starter League, a three months’ intensive coding boot camp. As part of the program, we had a final project where we had to form a small team and ship a web project of our choice. At the time, I was already passionate about data. I had a few years of experience as a digital analytics consultant at a creative agency under my belt.

At the Starter League, I met Sam and Enrique. Two great guys who eventually became my teammates for the final project. After a few iterations on the idea, we decided to build an analytics platform for Tumblr. We called it MountainMetrics. The project was solving a pain Enrique was experiencing managing the Chicago History Museum Tumblr account: tracking the number of followers over time.

At the end of the three months boot camp, we had a fully functional product and at least one user, the Chicago History Museum. We open sourced the code. Little did I know at the time, but the project would be featured in Hacker News and attract the likes of many interesting people, including the Tumblr engineering team. More importantly, this project helped me land a job at PasswordBox in data science.

MountainMetrics wasn’t in any way a technological feat. It was a simple Rails web application that queried data from multiple API’s and reported back the data in a sensible way to the end-user. However, it demonstrated that I had a few very important skills in data science: I can ship, I am passionate about data and I have some tech skills to make things happen.

Done is better than perfect

We always want to show our best side. We fear of getting criticized. Psychologically, we humans want to be loved and accepted. This is one of the reasons why we want our work to be perfect before showing it to the world. This is also why so many people struggle shipping anything.

In his post, David Robinson talks about that how we used to work on scientific papers during his Ph.D. Those papers need to be “perfect” before they are published. They need to go through a slow revision process and often times are never made public.

The good news is that you don’t have to make your work perfect before making it public. What you ship is not set in stone. You can come back and improve it. Don’t get lost in the details, just get some interesting work out of the door. The worst that can happen is that nobody notices.

What should you share?

Share things that can provide value to people. Don’t take for granted that everybody knows what you know. It might be trivial for you to write about statistical concepts like the Beta Distribution, but it’s not the case for everyone.

Here are a few ideas on what you can do:

  • Write a post about a new concept you learned
  • Analyze open data sets.
  • Open source some code you wrote
  • Answer questions on public forums

The list could go and on. What matters is that you start small and that you deliver.

A long journey starts with a single step

If you are not willing to play the long game, stop now. There will always be something new to learn in our field. You need to embrace that. Every week, there is a new skill you can pick up, a new paper on machine learning, a new technology that you could learn. Don’t try to learn everything before starting to apply your knowledge.

People that want to get into data science generally want to know all of the skills they should learn before getting a job. They spend an absurd amount of time discussing on forums the skills they should learn to get a job. I think that’s a form of procrastination. Start applying what you know. Learn to extract value from data, no matter how you do it.

Whether you are transitioning from another career or you are just starting out, leverage your experiences. If you have worked as an accountant in the past, how can you use those skills to transition into data science? Perhaps there are startups out there that need a data analyst to understand the financials of their marketing acquisition channels. Over time, you can incorporate more advanced techniques to your work.

There are many jobs that involve working with data to make better decisions. They are not labelled “data scientist” or “data analyst”. Cast your net wide. The transition from being an accountant to a data scientist building predictive models is generally not done in a single step. You have to progress your way there. Find a way to progressively make that transition.

As you progress in your career, making your work public will help you create new opportunities, meet new people and get external feedback on your work. Take a moment and think about the people you look up to in data science. They most likely have one thing in common: they created public artifacts.

What I Wish I Knew About Data For Startups

For the last 4 years, I’ve been through quite a startup journey at PasswordBox. We went through multiple phases: launching our consumer product, raising a Series A, improving our product and optimizing, growing to millions of users and eventually getting acquired.

Throughout the journey, I’ve been part of the data team where I’ve been doing all things data: analysis, presentations, engineering, training, etc. We started with a very rudimentary stack to track and understand our product usage. Over time, we improved our infrastructure and our tooling. We built dashboards, data pipelines, reports etc. We invested heavily on the data culture. We trained our colleagues to leverage our tools to make better decisions with data.

While we are now working harder than ever to improve our data stack, I think we are at a stage where we can take a step back and reflect on our past.

Today, I want to share some learnings I have made during the last four years.

Empower People

Data science is a team sport. The biggest challenge in building a data-informed culture is not a technology one. You can have all the best tools in the world but if nobody uses them to take decision, you have failed.

Invest in People

It will help you scale data analysis and insights extraction. You can create much more value in your data analysis by involving product managers, user experience designers, developers, marketers etc. These people have their unique world views and they can bring another angle to an analysis. You want people that are not experts in data to perform their own analysis and generate value from it.

No matter what size your team is, start empowering people with data now. This investment can take many shapes or forms. When you are just starting out, it might be as simple as making third-party data tools, such as Mixpanel, available to the team and training your colleagues to use them.

Personally, I really enjoy doing weekly office hours where I sit down with a colleague to get a question answered. During this time, we can review the question, understand how we can get it answered and then work our way through it with our tools. This is a unique opportunity to understand their challenges, teach them what I know about data and help them use the tools at their disposal.

Your mission is not only to democratize data access for all but also teach your colleagues about data science. You are a guide, not a gatekeeper.

Learn to Communicate

Communication is one of the most important part of data analysis. You can spend your days building sophisticated data models but if you can’t explain your analysis to product managers or designers, it’s worthless.

When presenting, don’t hesitate to explain fundamentals. Present what the data tells and what it doesn’t tell. Experiment with different mediums to share your learnings. Write an internal post. Create an internal newsletter. Do a weekly presentation to your team on your findings. Whatever you do, make sure presenting data is part of your routine and that you are getting better at it.

Deliver Value

Imagine if you could stop time, build all the tools and infrastructure that you need and then come back with those assets for the business. Wouldn’t that be awesome? Surely. But, that’s not real life.

I found one of the most challenging aspect of working in a startup data team was to conjugate the infrastructure ground work and consistently delivering value to the business through data analysis. You don’t want to disappear for 6 months building tooling but you also don’t want to keep delivering analysis with rudimentary tools. Your time is a limited resource. Where should you invest it?

Know Your Customer

When building any product, your goal is to solve a problem. With internal data products, it’s no different. Your core customer is the business.

How can you help the business with data? What are the main problems to solve? The business needs will help you prioritize your engineering and analysis work. There is nothing worst than a data team going off-the-grid for months to build an internal tool that doesn’t solve any core problems.

Ship often

One way to make sure you are aligned with the business needs is to ship often by small increments. It will help you validate that what you are delivering brings value to the team and that you are on the right track.

In most startups, the data team is a support team. You are there to make your colleagues life easier. At the end of each work day, ask yourself if you helped people?

Time is Money

If you find yourself manually creating the same report more than once, you should automate it. When you are doing data analysis and research, make sure it is easily reproducible. Jupyter Notebooks are great for this. In the future, you will be very happy you can update your analysis in a few minutes. Sure, it’s an investment upfront but it will pay off. You will save time on the long run and you will be much more confident on your reporting.

Also, when doing any engineering work, always consider existing products as an alternative. Build vs Buy. I’ve written about why you shouldn’t build a dashboard from scratch and why you should consider existing solutions. This philosophy is not limited to analytics dashboards. It can be applicable to your whole data stack: tracking, pipelines, reporting etc. It is generally always cheaper to buy an existing solution than building and maintaining a custom one indefinitely.

Stick With Boring Technology

As Martin Weiner, Reddit’s CTO, puts it, stick with boring technology when building your startup. Mature technology will be more stable and offer a wider pool of talent. Unless you really need it, stick with technologies that have a proven track-record.

Don’t try to build the ideal infrastructure from the ground up. Iterate. Asana’s initial data warehouse was MySQL. Over time, they transitioned to Redshift and a state of the art data infrastructure. They iteratively built and improved their data infrastructure as they scaled.

About Data

Document Tracking

I’m surprised by how little literature has been written about documentation for tracking events. I know, this is a boring topic. But, this is so critical to the success of a data team, both on the engineering and the culture side.

If you are building a consumer application, you will most likely have to support different platforms: iOS, Android, web, etc. Each one of those platforms will have a tracking implementation. How will you make sure the naming conventions are respected throughout those implementations? Where do you document this?

Furthermore, if some people that are not familiar with your data implementation want to dig in the tools, where can they find a definition for each one of your tracking events? What properties should be tracked with each of those events? What are their data types?

Build a central repository where you document all of the tracking events and their properties. The simplest solution to get started is to document this in an Excel worksheet. At some point, I recommend that this documentation should be in a machine-readable format so it can be re-used for testing, schema generation, etc.

Quality over Quantity

“Every single company I’ve worked at and talked to has the same problem without a single exception so far — poor data quality, especially tracking data. Either there’s incomplete data, missing tracking data, duplicative tracking data.” - DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. This is true. Data quality is a challenging aspect of working with data. Bad data quality can take many forms at different stage during the data life cycle. Here are a few examples:

  • Tracking events triggered with invalid name, properties or property value
  • Tracking events triggered at the wrong time
  • Database columns with inconsistent values
  • Incorrect transformation of the data in an ETL process
  • Invalid calculation in reporting

Implement a data quality process early on. At every stage of the data life cycle, you want to be testing: tracking, ETL, reporting. This is costly. You will need to invest time and efforts in making sure you are not collecting garbage data or generating erroneous reports.

I strongly recommend you automate the testing process of your data. You should be monitoring your data quality just like you are monitoring the KPI’s in your product. If you can’t automate it, make sure it gets tested manually to some degree.

Focus on collecting less data but better data. Some might say that since the cost of storage is so low, you should try to track everything in your product. I disagree. Even if the cost of data storage is cheap, the overhead of dealing with volumes of garbage data isn’t worth it. Prioritize quality over quantity.

If you think you don’t have data quality problems, you are probably screwed.

Control & Ownership

Make sure you are able to easily start and stop sending your tracking events to any sources when you wish to, without requiring any changes in your product. This will allow you to independently activate new tracking tools without having to change anything in the tracking implementation.

Also, make sure you have access to all of your raw tracking data. At first, you might not have to resources to crunch all of this data easily. As you grow, you will develop those tools and knowledge. Being able to crunch the raw data will enable you to answer more complex analysis.

If you don’t want to build a custom tool for this, I recommend Segment.

Closing Thoughts

Over the last 4 years, I’ve made many learnings through experiences. From choosing technology to investing in data quality, those learnings have been made with trial and error.

Many things have changed during this period. Technology has solved many of the storage and computing challenges we once had with “big” data. A startup can now, in a matter of hours, have a managed data warehouse up and running, plugged into powerful analytical tools. This wasn’t the case when we started out. For most startups, the challenge is now less a technology one but rather a human one.

The Theorem Every Data Scientist Should Know (Part 2)

Last week, I wrote a post about the Central Limit Theorem. In that post, I explained through examples what the theorem is and why it’s so important when working with data. If you haven’t read it yet, go do it now. To keep the post short and focused, I didn’t go into many details. The goal of that post was to communicate the general concept of the theorem. In the days following it’s publication, I received many messages. People wanted me to go into more details.

Today, I’ll dive into more specifics. I’ll be focusing on answering the following question: How do we calculate confidence intervals and margins of error with the CLT?

By the end of this post, you should be able to explain how we calculate confidence intervals to your colleagues.

More Details On The CLT

The theorem states that if we collect a large enough sample from a population, the sample mean should be equal to, more or less, the population mean. If we collect a large number of different samples mean, the distribution of those samples mean should take the shape of a normal distribution no matter what the population distribution is. We call this distribution of means the sampling distribution.

Knowing that the sampling distribution will take the shape of a normal distribution is what makes the theorem so powerful. With a few information about a sample, we are able to calculate the probability that the sample mean will differ from the population mean and by how much it will differ. Sounds familiar? Well the Central Limit Theorem is foundational to the concept of confidence intervals and margins of error in frequentist statistics.

When explaining the theorem, we keep referring to two distribution: the population distribution and the sampling distribution of the mean. The reason we keep referring to those two distribution is because they are connected:

Population Distribution

  1. The mean of the sampling distribution will cluster around the population mean.
  2. The standard deviation of the population distribution is tied with the standard deviation of the sampling distribution. With the standard deviation of the sampling distribution and the sample size, we are able to calculate the standard deviation of the population distribution. The standard deviation of the sampling distribution is called the standard error.

Ok, so technically, how do calculating confidence intervals work?

Beer, beer, beer…

Let’s go back to the beer example from my previous post. Say we are studying the American beer drinkers and we want to know the average age of the US beer drinker population. We hire a firm to conduct a survey on 100 random American beer drinkers. From that sample, we get the following (totally made up) results:

  • n (sample size): 100
  • Standard Deviation of Age: 15
  • Arithmetic Mean of Age: 40

What can we infer from the population with this information? Quite a lot, actually.

With this data at hand and based on what we learned about the CLT, our best guess is that the population mean is more or less equal to 40, the mean of our sample. However, how can we be confident about this number? What are the chances that we are wrong?

What is the probability that the mean age of the US beer drinker population is between 38 and 42? (I selected those values to keep the example simple. By the of the post, you should be able to calculate this for any range.)

Standard Error & Standard Deviation

Here’s an important bit information I haven’t provided you with yet. This formula describes the relation ship between the Standard Error of the Mean and the Standard Deviation of the Population. It is necessary to use this formula in order to calculate confidence intervals and margins of error.

Standard Error of the Mean = Standard Deviation of Population / √n

The challenge is that with the data provided above, neither do we have the Standard Error nor the Standard Deviation of the Population. To solve this, alternatively to the Standard Deviation of the Population, we can use our best estimator for that value. In this case, our best estimator is the sample standard deviation.

Standard Error of the Mean = 15 / √100 = 1.5

We now know that our best estimate for the Standard Error of the mean is 1.5. This is equivalent to saying the standard deviation of the sampling distribution of the mean is 1.5. This value is essential in calculating the probability of us being wrong.

Probability of an observation

Armed with the standard error, we can now calculate the probability of our population mean being between 38 and 42. When working with a distribution such as the normal distribution, we generally want to normalize absolute values in terms of standard deviations. What does the range of 2 year above and below our arithmetic mean represents in terms of standard deviation? We can normalize this range by diving the 2 years by the standard deviation. It represents 2 / 1.5 or 1.33 standard deviation above or below the sample mean.

Since the normal distribution is a distribution of probabilities and it has been studied extensively, there is a table called the Z-Table that documents the probability that a statistic is observed. With the Z-Table, we can easily know the probability that an observation will occur above or below a certain standard deviation. We can lookup the information in the Z-Table to understand the probability of an observation being within 1.33 standard deviations from our mean.

In this case, the table tells us that the probability that the mean age of the US beer drinker population is between 38 and 42 is 81.64%. This is similar to saying that we are confident at approximately 81.64% that the population is more or less 2 years of our sample mean. There you have it, a confidence interval and a margin of error.

This example if fairly simple, I agree. It’s important to remember that a good portion of the data scientist work is just arithmetic. Understanding the fundamentals is essential if you want to interpret data. It will also help you do a better job at teaching your colleagues about it. As a data scientist, a major part of your job is to communicate clearly statistical concepts to people with various level of statistical knowledge.

The Theorem Every Data Scientist Should Know

Yesterday, I was reading a thread on Quora. The people in this thread where answering the following question: What are 20 questions to detect fake data scientists?. The most upvoted answer contained a list of questions that could leave a good number of data scientists off guard.

In that thread, my attention was drawn to one particular question. Not because it was specifically hard but because I doubt many data scientists can answer that question. Yet, most of them, whether they know it or not, are using this concept on a daily basis.

The question was: What is the Central Limit Theorem? Why is it important?

Explain the Theorem Like I’m Five

Let’s say you are studying the population of beer drinkers in the US. You’d like to understand the mean age of those people but you don’t have time to survey the entire US population.

Instead of surveying the whole population, you collect one sample of 100 beer drinkers in the US. With this data, you are able to calculate an arithmetic mean. Maybe for this sample, the mean age is 35 years old. Say you collect another sample of 100 beer drinkers. For that new sample, the mean age is 39 years old. As you collect more and more means of those samples of 100 beer drinkers, you get what is called a sampling distribution. The sampling distribution is the distribution of the samples mean. In this example, 35 and 39 would be two observations in that sampling distribution.

The statement of the theorem says that the sampling distribution, the distribution of the samples mean you collected, will approximately take the shape of a bell curve around the population mean. This shape is also known as a normal distribution. Don’t get the statement wrong. The CLT is not saying that any population will have a normal distribution. It says the sampling distribution will.

As your samples get bigger, the sampling distribution will tend to look more and more like a normal distribution. The Theorem holds true for any populations, regardless of their distribution*. There are some important conditions for the Theorem to hold true but I won’t cover them in this post.

Why is it important?

The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data.

The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it.

The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on.

The CLT is not limited to making inferences from a sample about a population. There are four kinds of inferences we can make based on the CLT

  1. We have the information of a valid sample. We can make accurate assumptions about it’s population.
  2. We have the information of the population. We can make accurate assumptions about a valid sample from that population.
  3. We have the information of a population and a valid sample. We can accurately infer if the sample was drawn from that population.
  4. We have the information about two different valid samples. We can accurately infer if the two samples where drawn from the same population.

As a data scientist, you should be able to deeply understand this theorem. You should be able to explain it and understand why it’s so important. This post skips many important aspects of the theorems such as it’s mathematical demonstration, the criteria for it to be valid and the details about the statistical inferences that can be made from it. These elements are material for another post.