A look at the key talking points of the data community in 2023/4 that don't involve ChatGPT
Ned Stratton: 30th January 2024
I like to keep my finger on the pulse of data stuff by attending meetups and conferences in London and the UK. A quirk of the data meetup and conference scene I've noticed is that it divides itself into two camps: Microsoft vs everything else.
Why is this? Microsoft pumps out new data products and renames of existing products each year, with Fabric in 2023 being no exception, then uses its Most Valuable Professional (MVP) scheme to reward people who blog about and speak on these data products ad nauseam with MVP status. This gives them cheaper rates on cloud computing, a nice plaque with MVP written on it, and other benefits in kind/ego scratches. The fierce competition for this among Microsoft data folk keeps the Microsoft community turning on its axis.
That accounts for the Microsoft events (SQLBits, PASS Data Summit etc.) and their popularity. What do I mean by everything else? I'm getting at modern data stack tools; Snowflake, dbt, DataBricks, and other tools made by brands you'll see with exhibitor stands at BigData LDN. These tools do the same things as Microsoft's tools for people doing the same range of data jobs, but they're just not made by Microsoft. Therefore, they sponsor all and any data-themed conference going that's not Microsoft affiliated.
Explained like this, it sounds like the data profession and community are a permanent "Europe vs Rest of The World" charity football match, which is ridiculous. But it's sort of true, and exemplified by two meetups in London I attend: London Business Analytics Group and London Analytics Group. Same names (almost), same format (100-150 people in central London on a weekday evening once a month getting free pizza and a 45-minute talk), same target audience (data analysts/engineers), same subject matter (data stuff). Same everything really, except the former leads on Microsoft Analytics tools and approaches (Excel/Power BI/SQL/Fabric), and the latter covers analytics without ever mentioning the M-word. As such, zero actual crossover of people that attend both.
Looping back eventually to the title and first sentence, I make a point of going to both to get each one's zeitgeist of data industry developments. Predictably, the rise of AI and ChatGPT has been the prominent talking point for both sides in 2023 and 2024 so far. The interesting bit is their respective second specialist subjects of the past 12 months.
The Microsoft-ers have been all about Fabric, which was announced in May 2023 and is essentially Power BI, Azure, SQL Server in the cloud and ersatz Jupyter notebooks all in one toothpaste-green branded mothership. It's the technological development that's dominating the collective conversation, attention, and emotions (a mixture of excitement and resigned dread at all the new names to learn) of the Microsoft Data Community.
On the other side, the everything else-ers have been getting slowly excited about Data Contracts.
The brainchild of Andrew Jones (Principal Data Engineer at leading fintech GoCardless), Data Contracts are a data governance and quality management software offering underpinned by a new way of thinking about who is responsible for managing data. He's written it up in his book Driving Data Quality with Data Contracts published in June 2023, which is an accessible 180-page read and explains the concept and its options for implementation well. It also has a useful potted history of warehouses/lakehouses/ETL/ ELT, and a good explainer on change data capture.
The basic idea behind it is that businesses aren't managing their data well enough for their data analysts to go beyond descriptive statistics because the time and cost of data engineering teams transforming all of it from source into a semantic layer is too restrictive, and that the way to solve this is to shift left. This means tasking the software developers who make data-generating apps with documentation and support for this data for the benefit of analysts and business stakeholders who are downstream of them. This is formalised by the data contract, which is a metadata document in YAML or JSON (or anything human and machine readable) that describes the data produced by any software app in terms of what fields it has (nullability, data type, anonymisation method if PII), where it is stored, how long it can be kept, how frequently it gets updated, and anything else useful.
The reason it's called a contract and not a simple document is that it sets the expectations of whoever uses the data for derivative business products, further analytics or AI, and binds that software application to the continued creation of the same data as specified in the contract. So if a future release of the app causes it to generate data that misses one of the columns included in the original contract, a warning is triggered or something else that something might actually do something about before a weekly report crashes embarrassingly in front of the CEO.
Cynical and long inured to buzzwords, my initial reaction was to be sceptical, on the lookout for flaws, and to speculate that Jones might be another under-utilized perm-role data guy channelling professional satisfaction into a personal project. (See recent UCOVI blog post on free roles). A bit of me also resented the idea that data engineering teams and the pipelines they build are “bottlenecks” - a signal of the self-loathing and original sin that the data profession seems to indulge in like no other. (Dentists aren't referred to as "bottlenecks in the process of alleviating chronic toothache" because root canal work costs a bit and might take an afternoon out of your time. They're just referred to as dentists.)
But in a world where so much innovation and new technology and approaches are actually rehashes of things that have already come and gone, data contracts might genuinely be problem solving and transformative. They kill two birds with one stone; documentation and data quality control. The contract describes and defines what the important data is at source for the benefit of the business, and keeps the source accountable for its continued, non-degraded production. They also sit nicely with the UCOVI principle that much of data sits within IT, and that tech teams should expand themselves to incorporate the management and first steps of analytics on the data they generate as a byproduct.
The possibility that data contracts might just work is borne out by two things. Firstly, Andrew Jones has actually implemented them at GoCardless, where he's been for nearly 7 years. Secondly, my own experience in data roles has shown me that poor documentation and flimsy data governance tend to be joined at the hip. I've seen sudden changes in the content of a source data column - which only the app developer knew was about to happen and which they had no platform to communicate to others - flow unchecked into onward analytical datasets and discovered by failed warehouse loads and wrong reports. Data contracts unify the document and the control into one system that machines can automate from and humans can refer to, which would seem to be the solution to this.
Some would question the core assumption underpinning the viability of data contracts that app developers would be willing to incorporate accountability for their apps' data byproduct and responsibility for its documentation and analytical ease of use into their core roles with no fuss. I disagree; developers (or good ones anyway) gain job satisfaction from using their coding and design skills to build products that have defined and lasting value. High-quality, usable data emanating from what they have built is an extension of that value.
The major room for improvement in Andrew Jones's conception of data contracts – both in the theory and the technical blueprints in his book (JSON/YAML scripts for each dataset forming part of a consolidated repo to be used to enforce rules and fire warnings) – is in taking them beyond telemetry data from software and apps.
Yes, much of the data that has analytical or AI value for a given business does come from the logs generated by their software products.
But what about crap Salesforce data, which is the fault of sales and finance leaders asking for superfluous new columns in Order tables that make no sense and/or allowing shoddy record keeping by their teams, aided and abetted by a CRM team offering no pushback on this and making no attempt to preserve data quality?
What about historical customer data owned by a business that's been bought in a merger/corporate acquisition, where the data is immensely valuable to the buying organisation but only if understood properly?
These are both common cases of acute pain for analytics teams and roadblocks to business understanding and insight which should be right in the firing line of an effective system of data contracts. The people responsible for these situations are not the kind you'd expect to know what a YAML file is, and for data contracts to be of help to them, they'll need to be more approachable, less developer-y and built to anticipate different problems.
Even so, I'm more convinced by the durability and game changing potential of Data Contracts than I am of Fabric OneLake.