The UCOVI Blog

The UCOVI Blog

Beyond SIC codes – web scraping and text mining at the heart of modern industry classification: An interview with Agent and Field's Matt Childs

Ned Stratton: 8th December 2021

B2B publishing, exhibitions and digital marketing are often the breeding-ground industries, at least in the UK, for entry-level data analysts keen to pop their SQL cherries and get their first hands-on experience solving business data challenges with datasets in the millions of records.

These datasets are the marketing databases owned by companies in these industries. Stored in anything from a big spreadsheet to a HubSpot CRM, they contain contact records of all of that company's previous and current customers, enquiry form fills and other sales leads. And of course pre-GDPR contact records bought from marketing list brokers; your uncle Steve who shops too much online and never unticks the mailing preference boxes.

In theory, the marketing database is the holy-grail strategic resource of high-engagement email blasts and business intelligence, but this is often undermined by patchy industry classification. Contacts in automotive companies categorised as banks, or not categorised at all.

This wrecks any chance of either understanding the industry split of a customer base, being able to execute marketing campaigns tailored to the activities of organisations, or building data and prioritising lead gen in relevant organisations absent from the database. Against this, companies stick to marketing and reporting lead flow on a person-by-person basis, even in B2B.

This was my predicament in my first data job at an events business tailored to the UK public sector some five years ago, where 70% of the marketing database were UK government and wider public sector professionals, with the remainder all private sector businesses. The starting base was an uncategorised-by-industry-sector rate of 33%. With fuzzy text matching and up-to-date lists of public bodies, schools, charities and health trusts, my data team and I were able to accurately classify all the public sector data. For the private sector, we wrapped our heads around the 928 UK SIC 2007 codes and their parent sections, pulled the latest extract of Companies House, and managed a two-thirds match rate against our own data.

Because the public sector data was of higher priority than the private, and with the overall uncat rate down to 10% from 33%, our efforts were judged a success, and for years I told myself that I was a data-classification expert. But the SIC codes had a sting in the tail.

Fast forward to a year ago when I started my current data analyst role for an international commodities news service. I genuinely believed that I could replicate my exploits on the public sector database with a fully private sector one which had as its thematic industrial scope the entire global market for supply and purchase of metals, minerals, agricultural and forest products. Not quite.

SIC codes (and their American cousins NAICS codes) are undone both by their own structure and outdatedness and the limitations of the data collected about the companies they are intended to categorise.

SIC codes in the UK were last given an update in 2007, the year the first iPhone was released and two years before Bitcoin's genesis block. As such, it offers separate codes for men's and women's underwear as well as for marine fishing and marine aquaculture, but only offers "Computer facilities management activities" and "Data processing, hosting and relating activities" to cover anything from SME outsourced IT providers to Facebook, AWS, Hyve Managed Hosting and WeTransfer. They are next to redundant as a framework around which to identify and classify companies in modern sectors, and they are as unreliable as they are meaningless. A company's SIC code is picked by whoever submits the annual financial returns to Companies House. There's no legal imperative or financial pressure to choose the correct one, and for Limited Liability Partnerships no requirement to submit one at all – somewhat of a constraint given how many consulting, accountancy and law firms are LLPs.

So if not SIC codes, what else?

Marketing teams do their bit to add industry classification to a database by including it as a dropdown question on enquiry and booking forms. You're in luck as a data analyst if they make it a mandatory field. It's definitely your day when "Other" is not in the options, and you've practically won EuroMillions if they make the industry options consistent with existing segmentation on the marketing database, and make the form single-select so that users don't tick all ten of the options. Even then, you rely on the user actually caring what they put. It's better than nothing I guess.

Other options available are Trade Association membership lists, "Top 100" online articles, and other sources of proxy classification such as the FTSE 100 index or the S & P Fortune 500, which can be web scraped or cut and pasted into a spreadsheet. Though more accurate than form-filled data or SIC, the weakness in this is that it gives you the obvious big-player organisations within an industry, when what you really want is volume and companies you didn't know about already.

A more pioneering approach is the one taken by The Data City. It specialises in building databases of sector-classified companies in emerging technical areas such as FinTech, AgriTech and AI, and does so by web scraping and using data science techniques to identify the sector and relationship to other sectors of a company based on the text content in key areas of that company's corporate website. It has constructed a corpus of 40 or so top-level Real Time Industry Classifications (RTICS) around this, which describes the digital economy in understandable terminology and with enough detail to build niche marketing lists.

While classifying marketing data into arms of the UK public sector in my first role five years ago, I met Matt Childs, a marketing director with a passion for data who would go on to become a market strategist at The Data City. He is now heading up Agent and Field, which specialises in building databases of companies in emerging sectors around the same methodology driven by supervised machine learning and open textual data from company websites. We caught up and discussed the difficulties around accurate classification of company data and the limitations it places on account-based marketing, and compliance with GDPR. Matt offers the insight that classification tools even as sophisticated as The Data City will never be able to give a marketing data team all off the companies in a chosen sector, but nonetheless act as a useful recommender system for closely related emerging sectors, with companies attributable to them. This offers marketing and product teams within businesses the chance to be truly data driven in what they produce and who they sell to.

Upskilling as a data analyst: acquiring knowledge deep, broad and current (31/01/2022) ⏪ ⏩ Should Data Analytics teams sit within Sales/Marketing or IT? (26/10/2021)

⌚ Back to Latest Post