How many times have you heard someone say they are data-driven or data centric or that “data is the heart of everything they do”? I’ve heard this so many times that the urge to wince is becoming irresistible. This sentiment is emblematic of a vanity that exists in the big data era – we are scientific, evidence based and use data to make our decisions. 

This slogan is heard at every company I’ve worked at and from a range of people whether they be CEOs or “thought leaders”. In this post, I want to explain why we’re not as data-driven as we’d like to think. There are mistakes I’ve noticed some otherwise very smart people make when it comes to looking at and making decisions based on data and I want to discuss some of them with the hopes of injecting some humility and skepticism into the discourse. 

Before I begin, let me give the benefit of the doubt to those people who express this sentiment earnestly. I’m aware that people are well intentioned when they say this – they want to express the fact that they’re not exclusively guided by their experience, hearsay or whim when making decisions. They want to consider facts from the outside world and let this guide their thought process about a particular issue. As far as this goes – fine, but I contend that this sentiment is so general and fair-seeming that it goes without saying. Not many people would openly admit that they don’t like to consider the facts before making a decision. 

I would go further and say those people who frequently want you to know how data savvy or data focused they are, do so as a way of sounding clever and giving legitimacy to whatever it is that they’re about to say next. Whenever someone tells me that they’re data-driven I inevitably lean in to hear what silly comment couched in technical sounding language is going to follow – it has the inverse effect that the author intended.

Let’s explore some of the common blunders made by data-driven decision makers. 

No one has ever let data speak for itself – ever

If a data tree falls in a (random?) forest, does it make a sound? Easy – no. This is an error I see made by non technical people or those who come from non scientific fields and love the idea of objective and bias free analysis. Too bad no one has ever done that in data science. Data cannot speak without an interpreter and that interpreter is always mischievous, seeking to mix in its own words. 

All data have assumptions and biases. Where we get our data from, the quality of our data and how we use the data for analysis all imbue data with a structure of which we have to be mindful. Even activities as simple as looking at a line graph carry assumptions that can trip us up if unchecked. There could be missing data, the time scale could not be homogeneous (have we skipped days?) or the data recorded could change scale halfway through. Even choosing which range of values on the x and y axis to show will have an impact on the user. These assumptions must be considered and made explicit if it is feared that they could mislead the end user. If not feared – do it all the same.

Where your data comes from is as important as how much data you have

The sad truth is that in practice, most of the data we have as data scientists is junk. Choosing what to ignore is important for meaningful analysis. More data does not equate to better analysis – especially if it comes from a source which is biased and unreliable, Garbage In Garbage Out. 

I was once asked to model how some process evolves over time. The problem was that each data point was not measuring the same unit; our observation of some phenomena at time epoch 2 was not the observation of the same phenomena at time epoch 1. Therefore you were not measuring evolution of anything over time but rather making statements about independent data on different days. To illustrate this, consider the problem of measuring daily occupancy in a hotel – we count the number of guests in rooms every day and plot. However if we had daily occupancy data but each day reports a different hotel, then we cannot reasonably plot this on a (time indexed line) graph since we would not be looking at the occupancy rate of the same hotel over time. 

This distinction is important and fundamentally shifts the robustness of our whole analysis. We have to interrogate carefully where our data comes from, what biases it might have and whether it is in the format we need to test our hypotheses.

Just being skeptical doesn’t make you smart

It is true that confirmation bias is real and there are many trigger-happy stakeholders who will make decisions based on one significant test in a sea of other non significant results (my views on significance testing in business might be in a different post). It pays dividends to be careful, demand more data and adopt a wait-and-see approach. A lot of times we build models and make predictions and the best thing to do is to wait for more data once we feel like we have effectively modelled a problem.

However there are people who sit on the other extreme and no amount of analysis is sufficient for them to act. They try to show how level-headed they are by making comments like “we need other data sources to better understand the problem”, “we need to synthesise our data to separate signal from noise” or “the data is out of date”. All these remarks are reasonable but in many cases they’re used as a justification for inaction. 

The purpose of data work and experimentation should be to inform action. If you are hiding behind these excuses to not make changes which seem otherwise reasonable then you’re not being cautious, you’re being cowardly. If you don’t have a measurable limit of how much data will be enough for you to act (how this is determined is a different problem) or how you intend to separate signal from noise then you’re making this mistake. 

It’s a tricky blunder to spot because it disguises itself as wisdom. If data analysis is used to repeatedly say no because stakeholders see no evidence that an action will be successful, they should go the whole way and not take any action – there is never going to be evidence of efficacy until after the fact. One telltale sign of this behaviour is how people conveniently depend on less data when there is a decision they want to make already. The criterion of evidence for ideas they dislike is far higher compared to the ideas they are favourable towards.

Automation is not a substitute for thinking

When you write code it helps to automate as much as possible. Putting models in production or scheduling scripts are very powerful tools which ultimately save time and effort. However there is a danger when we try to automate those decisions which require a domain specific understanding of a problem. 

Natural Language Processing is full of problems which require a human to determine if the results make sense – from interpreting word2vec similarity outputs to validating the sensibility of text cleaning, there are some problems which require humans in the loop (for the time being). We can make the space for those humans small but we can’t fully automate it away at the risk of making our analysis meaningless. Clustering is another example where we can automate to a great extent but if the output does not make sense for the problem in question, it might be necessary to compromise on some performance metrics in order to gain interpretability. 

If it helps, I’ve formulated Junaid’s 1st law “Automate as far as possible, but not further”. 

Sometimes it’s okay to ignore the data

Sometimes it helps to take a risk and ignore what the data might be presently suggesting and go with your instinct. Frequent A/B testing can sometimes lead to users chasing local optima and not focusing on the big picture. If you gambled on a new product feature and early results have shown that users respond adversely, it might not necessarily be the case that it is good to discontinue it. Think about other benefits it might have for your business and users later down the line. It might be that after users have gotten used to the new feature, they begin to respond positively. 

Ultimately there is nothing wrong with using intuition sometimes. The problem arises when we confuse our intuition as inference from data and behave with a false confidence. I have seen people deliberately masquerade their opinions as if they were inferences from data analysis and refuse to acknowledge that their views originate from their minds vs the external world. We should admit when we’re using our intuition and be open to being challenged if others have differing ideas.

There are more data sins but these are the main ones that come to mind. Hopefully I’ve illustrated the value of thinking through what we’re doing, checking our assumptions and realising that this enterprise is far more complicated than the buzzwords and slogans would have us believe. 

Leave a Reply