The UCOVI Blog

An epic fail with Python text analysis

Ned Stratton: 20th June 2020

They say yesterday's news is today's chip paper, and nothing could describe better last week's UCOVI blog post about the COVID-19 NHSX Track and Trace phone app – abandoned as a project by the government this week. (The Political Spectrum of Data quiz is still going nicely though, with a healthy number of responses.)

Fresh in my mind right now is a schoolboy data error from yesterday, which I'll share with you now.

I work in finance in London as a data analyst and report developer - mainly SQL and Power BI with a bit of Python on the side. Every month I send out an email to business users announcing new Power BI dashboards built or edited since the last one.

As we were a bit low on updates for June, I dedicated yesterday to visualise the most common bigrams and trigrams* from our customer call centre CRM notes, which would be sliceable by account type and time periods as to allow the business to see changes over time in the prevalence of issues facing customers.

I would use Python's NLTK library to extract unique bigrams and trigrams from the free-text notes, then write them back with their frequency counts by week and category to a new data warehouse table, which would be the data source for a Power BI model and dashboard.

Super-advanced or what?!

It was all going well; script debugged, half a million rows of notes processed, table built, DAX measures working. "I'm on fire today, Doug", I said to my flatmate (we are both working from home during lockdown) as I opened up Power BI desktop at 4pm ready to transform the business's understanding of what our customers have been telling call centre agents over the past 6 months. If I'd have been near a mirror, I'd have high-fived it.

This is where the fail starts.

So the first visual I create is a simple table of bigrams/two-word pairs and total sum of their occurrences across all notes, sorted in descending order by most common, and filtered to the top 20. As we're in finance, I'm expecting something like 'card replacement' at the top with 15612 (plucking random number from the sky), then 'transfer funds' with somewhere around 1000 fewer mentions, then some other common financial term, or possibly something Coronavirus related, under that with a different, slightly lower total.

But to my bemusement, the bigrams in position 4-7 have EXACTLY THE SAME FREQUENCY COUNT. The ones in 8-11 share the same count as well, and the pattern goes on until about position 17.

Why? It turns out that the 'free-text' notes I'm analysing aren't all that free-text. A significant minority of them are actually fixed questions that the call centre agents write answers to on a form. For example, "Q1: Is the caller the main account holder? No – caller is POA to acct hldr."

So my most common two-word phrases are actually bits of a standardised question that occurs 1000s of times and has been included in the notes data – meaning the business will learn precisely nothing from my Big Data mining. (Well, perhaps that the forms are working…).

This is entirely correctable with a block of code to discount the questions before doing the analysis, and can still make the June newsletter… but only after wasting a whole day.

The lesson is: always start with Exploratory Data Analysis – the Understanding that puts the U in UCOVI.

Exploratory Data Analysis is fancy term for something simple – looking at your data. In SQL, it's a 'SELECT TOP 1000 * FROM', then running counts of rows for category columns you identify, with a max and min of what looks like the date column to see the duration your data spans. In Excel, it's opening the spreadsheet, seeing how long and wide it is, and expanding text columns to see if any go over the page.

Only by doing this will you spot missing values, unclean text, or columns that contain things entirely different from what the column heading suggests.

Luckily I got a few reports done while I was waiting for Python to natural-language-process survey templates for 4 hours.

*Bigrams and trigrams are two-word and three-word sequences. In the sentence "The cat sat on the mat", the cat, cat sat, sat on, on the, the mat are all the possible bigrams, and the cat sat, cat sat on, sat on the, on the mat are all the trigrams.

Excel VBA in the modern tech business: The case for and against (13/07/2020) ⏪ ⏩ Track and Trace and The Political Spectrum of Data - Liberators vs Protectors (12/06/2020)

⌚ Back to Latest Post