The UCOVI Blog

The UCOVI Blog



Welcome to UCOVI's repository of data discussions and interviews.

➡ Click here if you wish to contribute an article.


Latest Post - Developments in data beyond the AI smokescreen: Microsoft Fabric vs Data Contracts


➡ Go to Previous Articles

Previous Articles



No-code data part II (Ned Stratton: 22nd November 2023)

White Paper: The free-role data analyst (Ned Stratton: 4th September 2023)

Do data analysts need to read books? (Ned Stratton: 10th May 2023)

No code data tools: the complexity placebo (Ned Stratton: 17th March 2023)

The 2023 data job market with Jeremy Wyatt (Ned Stratton: 24th January 2023)

Making up the Numbers - When Data Analysts Go Rogue (Ned Stratton: 2nd December 2022)

Data in Politics Part 2 - Votesource (Ned Stratton: 12th September 2022)

Data in Politics Part 1 - MERLIN (Ned Stratton: 2nd September 2022)

Interview: Adrian Mitchell - Founder, Brijj.io (Ned Stratton: 28th June 2022)

The Joy of Clunky Data Analogies (Ned Stratton: 14th April 2022)

Event Review - SQLBits 2022, London (Ned Stratton: 17th March 2022)

Interview: Susan Walsh - The Classification Guru (Ned Stratton: 21st February 2022)

Upskilling as a data analyst - acquiring knowledge deep, broad and current (Ned Stratton: 31st January 2022)

Beyond SIC codes – web scraping and text mining at the heart of modern industry classification: An interview with Agent and Field's Matt Childs (Ned Stratton: 8th December 2021)

Debate: Should Data Analytics teams sit within Sales/marketing or IT? (Ned Stratton: 26th October 2021)

Event Review: Big Data LDN 2021 (Ned Stratton: 27th September 2021)

The Swiss Army Knife of Data - IT tricks for data analysts (Ned Stratton: 9th September 2021)

UK Google Trends - Politics, Porn and Pandemic (Ned Stratton: 15th October 2020)

How the UK broadcast media have misreported the data on COVID-19 (Ned Stratton: 7th October 2020)

The Power BI End Game: Part 3 – Cornering the BI market (Ned Stratton: 21st September 2020)

The Power BI End Game: Part 2 – Beyond SSAS/SSIS/SSRS (Ned Stratton: 28th August 2020)

The Power BI End Game: Part 1 – From Data Analyst to Insight Explorer (Ned Stratton: 14th August 2020)

Excel VBA in the modern business - the case for and against (Ned Stratton: 13th July 2020)

An epic fail with Python Text Analysis (Ned Stratton: 20th June 2020)

Track and Trace and The Political Spectrum of Data - Liberators vs Protectors (Ned Stratton: 12th June 2020)

Defining the role of a Data Analyst (Slawomir Laskowski: 31st May 2020)

The 7 Most Common Mistakes Made in Data Analysis (Slawomir Laskowski: 17th May 2020)

COVID-19 Mortality Rates - refining media claims with basic statistics (Ned Stratton: 10th May 2020)


A look at the key talking points of the data community in 2023/4 that don't involve ChatGPT

Ned Stratton: 30th January 2024

I like to keep my finger on the pulse of data stuff by attending meetups and conferences in London and the UK. A quirk of the data meetup and conference scene I've noticed is that it divides itself into two camps: Microsoft vs everything else.

Why is this? Microsoft pumps out new data products and renames of existing products each year, with Fabric in 2023 being no exception, then uses its Most Valuable Professional (MVP) scheme to reward people who blog about and speak on these data products ad nauseam with MVP status. This gives them cheaper rates on cloud computing, a nice plaque with MVP written on it, and other benefits in kind/ego scratches. The fierce competition for this among Microsoft data folk keeps the Microsoft community turning on its axis.

That accounts for the Microsoft events (SQLBits, PASS Data Summit etc.) and their popularity. What do I mean by everything else? I'm getting at modern data stack tools; Snowflake, dbt, DataBricks, and other tools made by brands you'll see with exhibitor stands at BigData LDN. These tools do the same things as Microsoft's tools for people doing the same range of data jobs, but they're just not made by Microsoft. Therefore, they sponsor all and any data-themed conference going that's not Microsoft affiliated.

Explained like this, it sounds like the data profession and community are a permanent "Europe vs Rest of The World" charity football match, which is ridiculous. But it's sort of true, and exemplified by two meetups in London I attend: London Business Analytics Group and London Analytics Group. Same names (almost), same format (100-150 people in central London on a weekday evening once a month getting free pizza and a 45-minute talk), same target audience (data analysts/engineers), same subject matter (data stuff). Same everything really, except the former leads on Microsoft Analytics tools and approaches (Excel/Power BI/SQL/Fabric), and the latter covers analytics without ever mentioning the M-word. As such, zero actual crossover of people that attend both.

Looping back eventually to the title and first sentence, I make a point of going to both to get each one's zeitgeist of data industry developments. Predictably, the rise of AI and ChatGPT has been the prominent talking point for both sides in 2023 and 2024 so far. The interesting bit is their respective second specialist subjects of the past 12 months.

The Microsoft-ers have been all about Fabric, which was announced in May 2023 and is essentially Power BI, Azure, SQL Server in the cloud and ersatz Jupyter notebooks all in one toothpaste-green branded mothership. It's the technological development that's dominating the collective conversation, attention, and emotions (a mixture of excitement and resigned dread at all the new names to learn) of the Microsoft Data Community.

On the other side, the everything else-ers have been getting slowly excited about Data Contracts.

The brainchild of Andrew Jones (Principal Data Engineer at leading fintech GoCardless), Data Contracts are a data governance and quality management software offering underpinned by a new way of thinking about who is responsible for managing data. He's written it up in his book Driving Data Quality with Data Contracts published in June 2023, which is an accessible 180-page read and explains the concept and its options for implementation well. It also has a useful potted history of warehouses/lakehouses/ETL/ ELT, and a good explainer on change data capture.

The basic idea behind it is that businesses aren't managing their data well enough for their data analysts to go beyond descriptive statistics because the time and cost of data engineering teams transforming all of it from source into a semantic layer is too restrictive, and that the way to solve this is to shift left. This means tasking the software developers who make data-generating apps with documentation and support for this data for the benefit of analysts and business stakeholders who are downstream of them. This is formalised by the data contract, which is a metadata document in YAML or JSON (or anything human and machine readable) that describes the data produced by any software app in terms of what fields it has (nullability, data type, anonymisation method if PII), where it is stored, how long it can be kept, how frequently it gets updated, and anything else useful.

The reason it's called a contract and not a simple document is that it sets the expectations of whoever uses the data for derivative business products, further analytics or AI, and binds that software application to the continued creation of the same data as specified in the contract. So if a future release of the app causes it to generate data that misses one of the columns included in the original contract, a warning is triggered or something else that something might actually do something about before a weekly report crashes embarrassingly in front of the CEO.

Cynical and long inured to buzzwords, my initial reaction was to be sceptical, on the lookout for flaws, and to speculate that Jones might be another under-utilized perm-role data guy channelling professional satisfaction into a personal project. (See recent UCOVI blog post on free roles). A bit of me also resented the idea that data engineering teams and the pipelines they build are “bottlenecks” - a signal of the self-loathing and original sin that the data profession seems to indulge in like no other. (Dentists aren't referred to as "bottlenecks in the process of alleviating chronic toothache" because root canal work costs a bit and might take an afternoon out of your time. They're just referred to as dentists.)

But in a world where so much innovation and new technology and approaches are actually rehashes of things that have already come and gone, data contracts might genuinely be problem solving and transformative. They kill two birds with one stone; documentation and data quality control. The contract describes and defines what the important data is at source for the benefit of the business, and keeps the source accountable for its continued, non-degraded production. They also sit nicely with the UCOVI principle that much of data sits within IT, and that tech teams should expand themselves to incorporate the management and first steps of analytics on the data they generate as a byproduct.

The possibility that data contracts might just work is borne out by two things. Firstly, Andrew Jones has actually implemented them at GoCardless, where he's been for nearly 7 years. Secondly, my own experience in data roles has shown me that poor documentation and flimsy data governance tend to be joined at the hip. I've seen sudden changes in the content of a source data column - which only the app developer knew was about to happen and which they had no platform to communicate to others - flow unchecked into onward analytical datasets and discovered by failed warehouse loads and wrong reports. Data contracts unify the document and the control into one system that machines can automate from and humans can refer to, which would seem to be the solution to this.

Some would question the core assumption underpinning the viability of data contracts that app developers would be willing to incorporate accountability for their apps' data byproduct and responsibility for its documentation and analytical ease of use into their core roles with no fuss. I disagree; developers (or good ones anyway) gain job satisfaction from using their coding and design skills to build products that have defined and lasting value. High-quality, usable data emanating from what they have built is an extension of that value.

The major room for improvement in Andrew Jones's conception of data contracts – both in the theory and the technical blueprints in his book (JSON/YAML scripts for each dataset forming part of a consolidated repo to be used to enforce rules and fire warnings) – is in taking them beyond telemetry data from software and apps.

Yes, much of the data that has analytical or AI value for a given business does come from the logs generated by their software products.

But what about crap Salesforce data, which is the fault of sales and finance leaders asking for superfluous new columns in Order tables that make no sense and/or allowing shoddy record keeping by their teams, aided and abetted by a CRM team offering no pushback on this and making no attempt to preserve data quality?

What about historical customer data owned by a business that's been bought in a merger/corporate acquisition, where the data is immensely valuable to the buying organisation but only if understood properly?

These are both common cases of acute pain for analytics teams and roadblocks to business understanding and insight which should be right in the firing line of an effective system of data contracts. The people responsible for these situations are not the kind you'd expect to know what a YAML file is, and for data contracts to be of help to them, they'll need to be more approachable, less developer-y and built to anticipate different problems.

Even so, I'm more convinced by the durability and game changing potential of Data Contracts than I am of Fabric OneLake.


Previous Articles

No-code data part II (Ned Stratton: 22nd November 2023)

White Paper: The free-role data analyst (Ned Stratton: 4th September 2023)

Do data analysts need to read books? (Ned Stratton: 10th May 2023)

No code data tools: the complexity placebo (Ned Stratton: 17th March 2023)

The 2023 data job market with Jeremy Wyatt (Ned Stratton: 24th January 2023)

Making up the Numbers - When Data Analysts Go Rogue (Ned Stratton: 2nd December 2022)

Data in Politics Part 2 - Votesource (Ned Stratton: 12th September 2022)

Data in Politics Part 1 - MERLIN (Ned Stratton: 2nd September 2022)

Interview: Adrian Mitchell - Founder, Brijj.io (Ned Stratton: 28th June 2022)

The Joy of Clunky Data Analogies (Ned Stratton: 14th April 2022)

Event Review - SQLBits 2022, London (Ned Stratton: 17th March 2022)

Interview: Susan Walsh - The Classification Guru (Ned Stratton: 21st February 2022)

Upskilling as a data analyst - acquiring knowledge deep, broad and current (Ned Stratton: 31st January 2022)

Beyond SIC codes – web scraping and text mining at the heart of modern industry classification: An interview with Agent and Field's Matt Childs (Ned Stratton: 8th December 2021)

Debate: Should Data Analytics teams sit within Sales/marketing or IT? (Ned Stratton: 26th October 2021)

Event Review: Big Data LDN 2021 (Ned Stratton: 27th September 2021)

The Swiss Army Knife of Data - IT tricks for data analysts (Ned Stratton: 9th September 2021)

UK Google Trends - Politics, Porn and Pandemic (Ned Stratton: 15th October 2020)

How the UK broadcast media have misreported the data on COVID-19 (Ned Stratton: 7th October 2020)

The Power BI End Game: Part 3 – Cornering the BI market (Ned Stratton: 21st September 2020)

The Power BI End Game: Part 2 – Beyond SSAS/SSIS/SSRS (Ned Stratton: 28th August 2020)

The Power BI End Game: Part 1 – From Data Analyst to Insight Explorer (Ned Stratton: 14th August 2020)

Excel VBA in the modern business - the case for and against (Ned Stratton: 13th July 2020)

An epic fail with Python Text Analysis (Ned Stratton: 20th June 2020)

Track and Trace and The Political Spectrum of Data - Liberators vs Protectors (Ned Stratton: 12th June 2020)

Defining the role of a Data Analyst (Slawomir Laskowski: 31st May 2020)

The 7 Most Common Mistakes Made in Data Analysis (Slawomir Laskowski: 17th May 2020)

COVID-19 Mortality Rates - refining media claims with basic statistics (Ned Stratton: 10th May 2020)