12 Aug 2016
A few weeks ago, I came across a great post from David Robinson about his first year as a data scientist at Stack Overflow. The post went into great details about how David landed his job there and the things he’s been working on since then.
In a section of the post, David advised graduate students who wish to get into data science to create public artifacts. David landed his job partly because of some public artifacts he created: blog posts and answers to questions on StackExchange.
His advice really resonated with me. Many of the great things that happened to me in the last few years are the result of making my work public: meeting new people, landing a job at PasswordBox, creating and selling a side project. The best way to get a job if you don’t have any experience is to make your work public.
Making My Work Public
I learned to code fairly late by the tech-world standards. I was 24 years old. At that age, I was temporarily living in Chicago to attend the Starter League, a three months’ intensive coding boot camp. As part of the program, we had a final project where we had to form a small team and ship a web project of our choice. At the time, I was already passionate about data. I had a few years of experience as a digital analytics consultant at a creative agency under my belt.
At the Starter League, I met Sam and Enrique. Two great guys who eventually became my teammates for the final project. After a few iterations on the idea, we decided to build an analytics platform for Tumblr. We called it MountainMetrics. The project was solving a pain Enrique was experiencing managing the Chicago History Museum Tumblr account: tracking the number of followers over time.
At the end of the three months boot camp, we had a fully functional product and at least one user, the Chicago History Museum. We open sourced the code. Little did I know at the time, but the project would be featured in Hacker News and attract the likes of many interesting people, including the Tumblr engineering team. More importantly, this project helped me land a job at PasswordBox in data science.
MountainMetrics wasn’t in any way a technological feat. It was a simple Rails web application that queried data from multiple API’s and reported back the data in a sensible way to the end-user. However, it demonstrated that I had a few very important skills in data science: I can ship, I am passionate about data and I have some tech skills to make things happen.
Done is better than perfect
We always want to show our best side. We fear of getting criticized. Psychologically, we humans want to be loved and accepted. This is one of the reasons why we want our work to be perfect before showing it to the world. This is also why so many people struggle shipping anything.
In his post, David Robinson talks about that how we used to work on scientific papers during his Ph.D. Those papers need to be “perfect” before they are published. They need to go through a slow revision process and often times are never made public.
The good news is that you don’t have to make your work perfect before making it public. What you ship is not set in stone. You can come back and improve it. Don’t get lost in the details, just get some interesting work out of the door. The worst that can happen is that nobody notices.
What should you share?
Share things that can provide value to people. Don’t take for granted that everybody knows what you know. It might be trivial for you to write about statistical concepts like the Beta Distribution, but it’s not the case for everyone.
Here are a few ideas on what you can do:
The list could go and on. What matters is that you start small and that you deliver.
A long journey starts with a single step
If you are not willing to play the long game, stop now. There will always be something new to learn in our field. You need to embrace that. Every week, there is a new skill you can pick up, a new paper on machine learning, a new technology that you could learn. Don’t try to learn everything before starting to apply your knowledge.
People that want to get into data science generally want to know all of the skills they should learn before getting a job. They spend an absurd amount of time discussing on forums the skills they should learn to get a job. I think that’s a form of procrastination. Start applying what you know. Learn to extract value from data, no matter how you do it.
Whether you are transitioning from another career or you are just starting out, leverage your experiences. If you have worked as an accountant in the past, how can you use those skills to transition into data science? Perhaps there are startups out there that need a data analyst to understand the financials of their marketing acquisition channels. Over time, you can incorporate more advanced techniques to your work.
There are many jobs that involve working with data to make better decisions. They are not labelled “data scientist” or “data analyst”. Cast your net wide. The transition from being an accountant to a data scientist building predictive models is generally not done in a single step. You have to progress your way there. Find a way to progressively make that transition.
As you progress in your career, making your work public will help you create new opportunities, meet new people and get external feedback on your work. Take a moment and think about the people you look up to in data science. They most likely have one thing in common: they created public artifacts.
04 Aug 2016
For the last 4 years, I’ve been through quite a startup journey at PasswordBox. We went through multiple phases: launching our consumer product, raising a Series A, improving our product and optimizing, growing to millions of users and eventually getting acquired.
Throughout the journey, I’ve been part of the data team where I’ve been doing all things data: analysis, presentations, engineering, training, etc. We started with a very rudimentary stack to track and understand our product usage. Over time, we improved our infrastructure and our tooling. We built dashboards, data pipelines, reports etc. We invested heavily on the data culture. We trained our colleagues to leverage our tools to make better decisions with data.
While we are now working harder than ever to improve our data stack, I think we are at a stage where we can take a step back and reflect on our past.
Today, I want to share some learnings I have made during the last four years.
Data science is a team sport. The biggest challenge in building a data-informed culture is not a technology one. You can have all the best tools in the world but if nobody uses them to take decision, you have failed.
Invest in People
It will help you scale data analysis and insights extraction. You can create much more value in your data analysis by involving product managers, user experience designers, developers, marketers etc. These people have their unique world views and they can bring another angle to an analysis. You want people that are not experts in data to perform their own analysis and generate value from it.
No matter what size your team is, start empowering people with data now. This investment can take many shapes or forms. When you are just starting out, it might be as simple as making third-party data tools, such as Mixpanel, available to the team and training your colleagues to use them.
Personally, I really enjoy doing weekly office hours where I sit down with a colleague to get a question answered. During this time, we can review the question, understand how we can get it answered and then work our way through it with our tools. This is a unique opportunity to understand their challenges, teach them what I know about data and help them use the tools at their disposal.
Your mission is not only to democratize data access for all but also teach your colleagues about data science. You are a guide, not a gatekeeper.
Learn to Communicate
Communication is one of the most important part of data analysis. You can spend your days building sophisticated data models but if you can’t explain your analysis to product managers or designers, it’s worthless.
When presenting, don’t hesitate to explain fundamentals. Present what the data tells and what it doesn’t tell. Experiment with different mediums to share your learnings. Write an internal post. Create an internal newsletter. Do a weekly presentation to your team on your findings. Whatever you do, make sure presenting data is part of your routine and that you are getting better at it.
Imagine if you could stop time, build all the tools and infrastructure that you need and then come back with those assets for the business. Wouldn’t that be awesome? Surely. But, that’s not real life.
I found one of the most challenging aspect of working in a startup data team was to conjugate the infrastructure ground work and consistently delivering value to the business through data analysis. You don’t want to disappear for 6 months building tooling but you also don’t want to keep delivering analysis with rudimentary tools. Your time is a limited resource. Where should you invest it?
Know Your Customer
When building any product, your goal is to solve a problem. With internal data products, it’s no different. Your core customer is the business.
How can you help the business with data? What are the main problems to solve? The business needs will help you prioritize your engineering and analysis work. There is nothing worst than a data team going off-the-grid for months to build an internal tool that doesn’t solve any core problems.
One way to make sure you are aligned with the business needs is to ship often by small increments. It will help you validate that what you are delivering brings value to the team and that you are on the right track.
In most startups, the data team is a support team. You are there to make your colleagues life easier. At the end of each work day, ask yourself if you helped people?
Time is Money
If you find yourself manually creating the same report more than once, you should automate it. When you are doing data analysis and research, make sure it is easily reproducible. Jupyter Notebooks are great for this. In the future, you will be very happy you can update your analysis in a few minutes. Sure, it’s an investment upfront but it will pay off. You will save time on the long run and you will be much more confident on your reporting.
Also, when doing any engineering work, always consider existing products as an alternative. Build vs Buy. I’ve written about why you shouldn’t build a dashboard from scratch and why you should consider existing solutions. This philosophy is not limited to analytics dashboards. It can be applicable to your whole data stack: tracking, pipelines, reporting etc. It is generally always cheaper to buy an existing solution than building and maintaining a custom one indefinitely.
Stick With Boring Technology
As Martin Weiner, Reddit’s CTO, puts it, stick with boring technology when building your startup. Mature technology will be more stable and offer a wider pool of talent. Unless you really need it, stick with technologies that have a proven track-record.
Don’t try to build the ideal infrastructure from the ground up. Iterate. Asana’s initial data warehouse was MySQL. Over time, they transitioned to Redshift and a state of the art data infrastructure. They iteratively built and improved their data infrastructure as they scaled.
I’m surprised by how little literature has been written about documentation for tracking events. I know, this is a boring topic. But, this is so critical to the success of a data team, both on the engineering and the culture side.
If you are building a consumer application, you will most likely have to support different platforms: iOS, Android, web, etc. Each one of those platforms will have a tracking implementation. How will you make sure the naming conventions are respected throughout those implementations? Where do you document this?
Furthermore, if some people that are not familiar with your data implementation want to dig in the tools, where can they find a definition for each one of your tracking events? What properties should be tracked with each of those events? What are their data types?
Build a central repository where you document all of the tracking events and their properties. The simplest solution to get started is to document this in an Excel worksheet. At some point, I recommend that this documentation should be in a machine-readable format so it can be re-used for testing, schema generation, etc.
Quality over Quantity
“Every single company I’ve worked at and talked to has the same problem without a single exception so far — poor data quality, especially tracking data. Either there’s incomplete data, missing tracking data, duplicative tracking data.” - DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. This is true. Data quality is a challenging aspect of working with data. Bad data quality can take many forms at different stage during the data life cycle. Here are a few examples:
- Tracking events triggered with invalid name, properties or property value
- Tracking events triggered at the wrong time
- Database columns with inconsistent values
- Incorrect transformation of the data in an ETL process
- Invalid calculation in reporting
Implement a data quality process early on. At every stage of the data life cycle, you want to be testing: tracking, ETL, reporting. This is costly. You will need to invest time and efforts in making sure you are not collecting garbage data or generating erroneous reports.
I strongly recommend you automate the testing process of your data. You should be monitoring your data quality just like you are monitoring the KPI’s in your product. If you can’t automate it, make sure it gets tested manually to some degree.
Focus on collecting less data but better data. Some might say that since the cost of storage is so low, you should try to track everything in your product. I disagree. Even if the cost of data storage is cheap, the overhead of dealing with volumes of garbage data isn’t worth it. Prioritize quality over quantity.
If you think you don’t have data quality problems, you are probably screwed.
Control & Ownership
Make sure you are able to easily start and stop sending your tracking events to any sources when you wish to, without requiring any changes in your product. This will allow you to independently activate new tracking tools without having to change anything in the tracking implementation.
Also, make sure you have access to all of your raw tracking data. At first, you might not have to resources to crunch all of this data easily. As you grow, you will develop those tools and knowledge. Being able to crunch the raw data will enable you to answer more complex analysis.
If you don’t want to build a custom tool for this, I recommend Segment.
Over the last 4 years, I’ve made many learnings through experiences. From choosing technology to investing in data quality, those learnings have been made with trial and error.
Many things have changed during this period. Technology has solved many of the storage and computing challenges we once had with “big” data. A startup can now, in a matter of hours, have a managed data warehouse up and running, plugged into powerful analytical tools. This wasn’t the case when we started out. For most startups, the challenge is now less a technology one but rather a human one.
13 Jul 2016
Last week, I wrote a post about the Central Limit Theorem. In that post, I explained through examples what the theorem is and why it’s so important when working with data. If you haven’t read it yet, go do it now. To keep the post short and focused, I didn’t go into many details. The goal of that post was to communicate the general concept of the theorem. In the days following it’s publication, I received many messages. People wanted me to go into more details.
Today, I’ll dive into more specifics. I’ll be focusing on answering the following question: How do we calculate confidence intervals and margins of error with the CLT?
By the end of this post, you should be able to explain how we calculate confidence intervals to your colleagues.
More Details On The CLT
The theorem states that if we collect a large enough sample from a population, the sample mean should be equal to, more or less, the population mean. If we collect a large number of different samples mean, the distribution of those samples mean should take the shape of a normal distribution no matter what the population distribution is. We call this distribution of means the sampling distribution.
Knowing that the sampling distribution will take the shape of a normal distribution is what makes the theorem so powerful. With a few information about a sample, we are able to calculate the probability that the sample mean will differ from the population mean and by how much it will differ. Sounds familiar? Well the Central Limit Theorem is foundational to the concept of confidence intervals and margins of error in frequentist statistics.
When explaining the theorem, we keep referring to two distribution: the population distribution and the sampling distribution of the mean. The reason we keep referring to those two distribution is because they are connected:
- The mean of the sampling distribution will cluster around the population mean.
- The standard deviation of the population distribution is tied with the standard deviation of the sampling distribution. With the standard deviation of the sampling distribution and the sample size, we are able to calculate the standard deviation of the population distribution. The standard deviation of the sampling distribution is called the standard error.
Ok, so technically, how do calculating confidence intervals work?
Beer, beer, beer…
Let’s go back to the beer example from my previous post. Say we are studying the American beer drinkers and we want to know the average age of the US beer drinker population. We hire a firm to conduct a survey on 100 random American beer drinkers. From that sample, we get the following (totally made up) results:
- n (sample size):
- Standard Deviation of Age:
- Arithmetic Mean of Age:
What can we infer from the population with this information? Quite a lot, actually.
With this data at hand and based on what we learned about the CLT, our best guess is that the population mean is more or less equal to
40, the mean of our sample. However, how can we be confident about this number? What are the chances that we are wrong?
What is the probability that the mean age of the US beer drinker population is between
42? (I selected those values to keep the example simple. By the of the post, you should be able to calculate this for any range.)
Standard Error & Standard Deviation
Here’s an important bit information I haven’t provided you with yet. This formula describes the relation ship between the Standard Error of the Mean and the Standard Deviation of the Population. It is necessary to use this formula in order to calculate confidence intervals and margins of error.
Standard Error of the Mean =
Standard Deviation of Population /
The challenge is that with the data provided above, neither do we have the Standard Error nor the Standard Deviation of the Population. To solve this, alternatively to the Standard Deviation of the Population, we can use our best estimator for that value. In this case, our best estimator is the sample standard deviation.
Standard Error of the Mean =
We now know that our best estimate for the Standard Error of the mean is
1.5. This is equivalent to saying the standard deviation of the sampling distribution of the mean is
1.5. This value is essential in calculating the probability of us being wrong.
Probability of an observation
Armed with the standard error, we can now calculate the probability of our population mean being between
42. When working with a distribution such as the normal distribution, we generally want to normalize absolute values in terms of standard deviations. What does the range of
2 year above and below our arithmetic mean represents in terms of standard deviation? We can normalize this range by diving the
2 years by the standard deviation. It represents
1.33 standard deviation above or below the sample mean.
Since the normal distribution is a distribution of probabilities and it has been studied extensively, there is a table called the Z-Table that documents the probability that a statistic is observed. With the Z-Table, we can easily know the probability that an observation will occur above or below a certain standard deviation. We can lookup the information in the Z-Table to understand the probability of an observation being within 1.33 standard deviations from our mean.
In this case, the table tells us that the probability that the mean age of the US beer drinker population is between
81.64%. This is similar to saying that we are confident at approximately
81.64% that the population is more or less
2 years of our sample mean. There you have it, a confidence interval and a margin of error.
This example if fairly simple, I agree. It’s important to remember that a good portion of the data scientist work is just arithmetic. Understanding the fundamentals is essential if you want to interpret data. It will also help you do a better job at teaching your colleagues about it. As a data scientist, a major part of your job is to communicate clearly statistical concepts to people with various level of statistical knowledge.
04 Jul 2016
Yesterday, I was reading a thread on Quora. The people in this thread where answering the following question:
What are 20 questions to detect fake data scientists?. The most upvoted answer contained a list of questions that could leave a good number of data scientists off guard.
In that thread, my attention was drawn to one particular question. Not because it was specifically hard but because I doubt many data scientists can answer that question. Yet, most of them, whether they know it or not, are using this concept on a daily basis.
The question was: What is the Central Limit Theorem? Why is it important?
Explain the Theorem Like I’m Five
Let’s say you are studying the population of beer drinkers in the US. You’d like to understand the mean age of those people but you don’t have time to survey the entire US population.
Instead of surveying the whole population, you collect one sample of 100 beer drinkers in the US. With this data, you are able to calculate an arithmetic mean. Maybe for this sample, the mean age is 35 years old. Say you collect another sample of 100 beer drinkers. For that new sample, the mean age is 39 years old. As you collect more and more means of those samples of 100 beer drinkers, you get what is called a sampling distribution. The sampling distribution is the distribution of the samples mean. In this example, 35 and 39 would be two observations in that sampling distribution.
The statement of the theorem says that the sampling distribution, the distribution of the samples mean you collected, will approximately take the shape of a bell curve around the population mean. This shape is also known as a normal distribution. Don’t get the statement wrong. The CLT is not saying that any population will have a normal distribution. It says the sampling distribution will.
As your samples get bigger, the sampling distribution will tend to look more and more like a normal distribution. The Theorem holds true for any populations, regardless of their distribution*. There are some important conditions for the Theorem to hold true but I won’t cover them in this post.
Why is it important?
The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data.
The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it.
The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on.
The CLT is not limited to making inferences from a sample about a population. There are four kinds of inferences we can make based on the CLT
- We have the information of a valid sample. We can make accurate assumptions about it’s population.
- We have the information of the population. We can make accurate assumptions about a valid sample from that population.
- We have the information of a population and a valid sample. We can accurately infer if the sample was drawn from that population.
- We have the information about two different valid samples. We can accurately infer if the two samples where drawn from the same population.
As a data scientist, you should be able to deeply understand this theorem. You should be able to explain it and understand why it’s so important. This post skips many important aspects of the theorems such as it’s mathematical demonstration, the criteria for it to be valid and the details about the statistical inferences that can be made from it. These elements are material for another post.
22 Jun 2016
If you are working at a startup and you are the data guy/girl, you’re most likely doing some form of ETL’s / data pipelines. Perhaps you are pulling data from a database, aggregating it and storing that as CSV’s?
As the data needs of your team grow, you will be managing more and more of those jobs. If you don’t have a proper solution to manage those, soon you will be drowning under a mountain of those ETL’s that can fail at any point without you knowing.
Nowadays, there are many solutions to automate and manage data pipelines, ranging from full-blown enterprise solutions to hardly maintained open source projects. With all of those choices, you are probably asking yourself which tool should I use for my data pipelines?
I’ve recently gone through the selection and setup process of such a tool with my team. Here’s my take on this.
Solution #1: The Best Code is No Code At All
I am a believer in the saying “the best code is no code at all”. Code requires maintenance. Maintenance requires time. Time is money. I try to stay away from building custom solutions when it’s possible. It’s easy to sink a whole lot of time in creating and maintaining those and it’s not the best use of your time.
I came across two potential solution I would definitely consider seriously if I was setting up my data pipelines from scratch today, especially at an early stage startup:
Both of these solutions are SaaS products that pull your data from different third-party providers and store it in your data warehouse. RJ Metrics currently only supports Amazon Redshift as data warehouse and Segment Sources supports both Redshift and Postgres.
If you are heavily using third-party products such as Mixpanel, Stripe, Intercom or Zendesk to collect data, those solutions can be very interesting. They will save you the hassle of coding and maintaining data pipelines to extract and load this data. There is very little you have to do to get up and running.
Those solutions currently have a limited number integrations. There’s a good chance some of your data sources aren’t currently supported. Also, those solutions are focused on extracting and loading up your data in a warehouse. If you want to any transformations on your data prior to loading, you are out of luck.
Solution #2: Airflow
Airflow is a workflow management platform that has been open sourced last year by AirBnB. It’s coded in Python, it’s actively worked on and it has become a serious option to manage your batch tasks. It is basically an orchestrator for your ETL’s. It not only schedules and executes your tasks but it also helps manages the sequence and dependencies of those tasks through workflows.
The basic setup of Airflow is fairly easy to complete if you have some technical chops. If you have some experience with Python, writing your first jobs in Airflow shouldn’t be a problem.
Once you are setup, the web interface is user-friendly and can provide the status on each tasks that has ran. You can trigger new tasks to run from the interface and visually see all the dependencies between the pipelines.
Airflow is a very complete solution. Out of the box, it can connect to a wide variety of databases. It can alert you by email if a task fail, write a message on Slack when a task has finished running etc. You might not need everything it has to offer out of the box.
Also, Airflow is designed to scale. The workers can be distributed across different nodes. Generally, there shouldn’t much hardcore computing done by the workers but being able to scale the workers will provide some head room for your tasks to run smoothly. The setup if you want to scale workers is more advanced.
If you don’t have an engineer in your team or someone who’s technical, this type of solution is not a good choice for you. While the setup is easy, you need to have someone who can manage and maintain your Airflow instance. Also, if you are running on Windows environment, you might run into troubles setting up Airflow.
Getting started with Airflow
If you already have some data pipelines, I recommend you pick the simplest one and get it running through Airflow. You can install Airflow in a few commands. I strongly encourage you to go through the whole tutorial, it will enable you to grasp the core concepts with ease.
Choosing a data pipeline solution is an important choice because you’ll most likely live with it for a while. This post is in no way an exhaustive list of tools for managing ETL’s.
No matter what tool you choose, one thing to remember is that you want to choose based on your own resources and requirements. Don’t set yourself for failure by choosing a solution you can’t maintain and that will create more workload on yourself.