Slawomir Laskowski: 17th May 2020
Everyone makes mistakes, even data analysts. Just because a piece of work has been labelled "analysis" does not make it an oracle. Similarly, finding a statistical report on a glossy website awash with stylish infographics and 'Institute of...' emblems is no guarantee of its findings' validity. (This is especially true in the era of fake news.)
Data analysis, like any calculation/interpretation-based study, is prone to errors. And even when the study itself is correct, further problems arise when someone starts to misquote or cherry-pick convenient stats from it. The importance of this is what puts the I – Interpret – into UCOVI.
Analysis projects can produce various types of mistakes. What are the most common ones? How can you avoid them as analyst? What red flags should you look for as a report reader?
Analysts are not always able or willing to check the correctness of the source data they are using. It could have missing values, rounding error, or duplicate records. If unaddressed or unaccounted for, this will produce descriptive statistics that simply do not describe the population or situation.
Data can also be unreliable if "second-hand". Analysis reports are often based on the results of other reports, citing the results of studies by external research or consulting companies. These studies, irrespective of the reputations of the organisations that put them together, may also draw from incomplete data, present invalid assumptions, or be 4 years old. This is not reliable data for the purpose of the analysis you are doing or reading.
In analytics, there is a simple rule: rubbish in, rubbish out.
Data analysts should check the data for multiple rows with the same 'ID' value, or "Average" measurements rounded to whole numbers. Rectify data quality issues before running analysis, or at least disclose them as potential sources of uncertainty, presenting findings in confidence intervals if necessary.
It is also important for report users to take the time to read the italics, small print and disclaimers behind the asterisks.
An inherent element of data analysis is comparing results with an appropriate benchmark. This comparison can be a different period (last month or year) or another entity - data for other products, companies, industries.
But you can't compare the growth dynamics of a large corporation with a small start-up. Neither can you like-for-like compare the results of December sales from July if your business is highly affected by seasonality.
Incorrect benchmarking can hide the true increase behind a superficial decrease, causing the false alarms within a business that make its leaders wish they had gone with their hunch all along.
Producing charts and tables without context is number-crunching, not analysis. Remember that there is a story behind the movements in every dataset. It can be a significant event (the start of new advertising campaign or a recession), seasonality, increased competition, to name but a few. Sometimes a single fact or anecdote, however trivial or idiosyncratic it might seem, changes the entire context of the analysis, and thus how the data should be interpreted. A good piece of analysis is enriched by real-life events presented as constant lines or milestones on a chart, and a data analyst can make this happen by asking a business user for a second pair of eyes on the more unusual trends that the data unearths.
The average value is an enjoyable measure. Just take one period vs another, calculate their averages and compare. Something has risen or fallen – simply say what it is, and the "analysis" is done. Unfortunately, it's not that easy. Segmentation by category and awareness of your data's distribution are both necessary to see if the average is skewed by outlying groups or data points.
Too many analysts and decision makers look at the average and take it for "the truth", "what occurs most often", "what we need to remedy", or all of the above. Say the average score on a test taken by 100 people is 65%. That doesn't mean most people get that score. It could easily be that within the 100 test-takers, there are two clusters getting around the low forties and high eighties respectively, balancing each other out to put the mean half-way between at 65%. The conventional mean average – sum divided by count - is actually quite misleading.
Good analysis swerves around this by using the most appropriate measure of normality. This could be the mean average, but it could also be the median, or grouping values into ranges shown on a histogram measuring each range's frequency, which most effectively highlights bimodal distributions like the test scores example above.
Even better analysis will consider whether normality or "most-common" is what's important. Sometimes the maximum, minimum or distance from a fixed constant needs to be front-and-centre in a report. If you were building flood-defences, you wouldn't base their height on the average sea-level on the nearest coast would you?
Sharp rises or falls from one data point to another on charts attract the report user's attention quickly. The reason they are so impactful is that the user infers from the sharpness of the change that something significant has happened that they need to understand and react to.
If a significant change is due to something other than the core trend the chart intends to report – a change in calculation methodology or an event specific to that data point – then this needs to be explained prominently in the report's commentary.
If not, the intended message or "true story" will be over-shadowed by this unexplained glitch, or worse, the type of false alarm mentioned in point two will be the result.
The final message is essential to the analysis. It is also called application, recommendation, idea. In practice, many consumers of data analysis read only the outcomes - and are not very interested in the rest. On the other hand, many analysts focus on showing as many calculations, tables, charts, infographics and descriptions as possible. This is because being an analyst is quite a thankless role. Hardly anyone is interested in the complexity of a given analytical problem, the difficulty of obtaining data, or the sophistication of the code used to query the database. Showing as comprehensive and detailed a range of stats and trends as possible is the analyst's way to validate the technical loops he has had to jump through.
By designing a report in a way that features as its opening state a simple, uncluttered dashboard of the most important metrics, with detailed, broken-down insight also included in the form of supporting pages or tooltips, the analyst can highlight the full depth of the work he has done in a way that prioritises what the business user needs to know first.
Analysis is susceptible to bias because it is itself a product of the knowledge, perspective and experience of the analyst and the user who requested it. We also encounter other situations when data analysis becomes biased. For example, over-emphasis on one aspect of the problem. Instead of analysing approaches to a given issue from different perspectives, the shortcut is taken to consider only the parameters of the original request, which builds in the confirmation bias of whoever commissioned the analysis.
On the other hand, when some assumptions are treated as off the table for questioning but the data shows otherwise, many analysts prefer not to risk deviating from the starting hypothesis and instead actually adapt the data to fit it. In more advanced data models (machine learning), a particular variation of bias is in trying to show how effective your model is. This is done by over-fitting the model to historical data. In this case, the desire to prove the correctness of your work overshadows the higher goal - namely its accuracy.
A good statistical method to overcome bias in analysing the data around a subject is to actively acknowledge it in the original framing of the question you want to answer – set up a null hypothesis that your analysis project will serve to prove or disprove. "I believe Y happened this quarter because X happened – is this true?" will give you the framework not only to scrutinise the strength of the correlation between them, but also to bring Z into the equation as well.