Ned Stratton: 10th May 2023
Books about data
In the picture above you can see my data library. It opened in 2016, has cost £250 to date, and totals 5,294 pages, of which I've actually read 3,033.
I've long wanted to do a comprehensive review of the most important books to read for current and prospective data analysts or data-interested people, and this is what will follow. Patient fans of the UCOVI Blog (a hopeful plural there) who read to the end of this will get a full rundown of every book pictured above, covering why I bought and read them (or just bought them), what they cover, and who should read them (if anyone). But first, let me answer my own question.
There are many ways to define the role of a data analyst - I have my own - and there are several routes to acquiring the skills that prop this up. The most obvious are via studying maths, statistics and/or computer science at school or university, and by work experience. Super-keenos can and do supplement this by attending data meetups where they can expect to pick up titbits of info and coding wisdom from speakers, and by doing video courses online.
The technical and practical sides of data are hard to learn passively, so the mainstay of my development has been through my work experience and personal projects, supplemented by weekends and days off spent following video courses on Youtube and Udemy, the success or failure of which depends on actually doing the exercises between the video content.
Given all this, and the fact that we forget 90% of what we read two weeks later, book wormery has a tough fight for a seat at the table of a data person's priorities. These are the benefits I'd point to when making the case for it, in order of serious and obvious to unconsidered and slightly trivial:
- Grasp of statistical theory and jargon: Statistical knowledge is something that producers and consumers of numbers need in breadth more than in depth. Nobody's expected to calculate the standard deviation of a set of survey results and present the answer neatly in an exercise book. But it's helpful to know that it measures how much noise and variability there is in data. And to know when the median average might be more appropriate than the mean, or whether the results collected are sufficiently representative and numerous to derive insights with any confidence, as well as a load of other stuff. The fastest and most reliable way to accumulate this knowledge is from synoptic books about this subject from experts, of which there are many (at least two in my review list) that have stood the test of time, given that this field is slower moving than technologies and programming languages.
- Discovery of new perspectives and important intellectual habits: The best books about data that can't be turned into video courses are the ones that ask whether we should build data science models to decide who is eligible for parole, not whether we can, and those that expand on and provide examples for the ideal, universal traits of well-produced (and well-interpreted) data analysis, which include curiosity, attention-to-detail, creativity and lateral thinking.
- Intellectual vanity and the glow of genius: Reading books allows us to stand on the shoulders of our history's great thinkers. Or failing that, sit in the meeting rooms of our prospective employers and pretentiously slip Hans Rosling quotes into answers about our problem-solving aptitudes. All for the facile pseudo-intellectual delectation of the HR manager and Head of Marketing Insights listening to us and drooling over the cultured erudition and wisdom that complements our Silicon Valley tech-ninjery.
- Giving your screen-battered eyes a rest: Showing my age a bit here, but who else remembers as children being told by their parents that too much TV would give you square eyes? That white lie didn't age well as we entered the era of iPhones, laptops and unbroken screentime. But just in case it's true, reading books for information or entertainment is a way to give your eyes a rest if nothing else, so embrace it.
For data professionals, their managers, users of their services in business, or anyone just looking to think a bit more collectively and rationally about the world or self-wean off TikTok before their brains turn to mush, these are the reasons to read books about data and the selection criteria for which data books to read. For anything else – coding skills, data management best practices etc – look to project work and video courses.
Let's see how my data book library stacks up against these success metrics, in the order they appear in the photo.
1. The Art of Statistics – Learning from Data (David Spiegelhalter, 2019 – 380 pages, £11) Top of my pile, top of the batting line up for success metrics one (grasping statistics) and three (parroting clever people). This is a must read, even if the 380-page mathsathon leaves your info-bulging noggin weighing you down like a top-heavy fraction. Cambridge academic and President of the Royal Statistical Society David Spiegelhalter achieves depth, breadth and readability with The Art of Statistics, covering pretty much everything you need to know if you're or data-interested layman or hands-on data analyst who could do with a stats refresher. The best bits are Chapter 4 (a correlation vs causation checklist for data decision makers) and Spiegelhalter's comparison of Bayesian statistics with methods around hypothesis testing and P-values. Read this for fluency, trying not to worry about grasping everything first time round. The only difficulty is long chapters with titillating as opposed to summative sub-headings, meaning they need to be read in single sittings. Don't try this on your morning commute.
2. Statistics without Tears (Derek Rowntree, 1981 – 193 pages, £10) Like Spiegelhalter's book, but half the length and about a third the breadth of subject matter, focussing in on sampling, descriptive statistics, confidence intervals and correlation vs causation. However, Rowntree's narrative structure and ability to explain concepts with followable examples are excellent, and reading this is after Spiegelhalter would be my advice to solidify concepts and get comfortable.
3. How to Make the World Add Up (Tim Harford, 2020 – 296 pages, £11) This is the best example I've encountered for item two in my data book success metrics (intellectual habits and perspectives). It's structured around a 10-point checklist for properly interpreting and scrutinising any data or stats thrust upon you in a corporate report, scientific study or sensationalised news article. It talks eminent sense about contemporary concerns such as the ethics of algorithms and misinformation, and touches lightly on a few statistical concepts such as sampling and confidence intervals/margin of error, which makes for a good warm up for The Art of Statistics.
4. The Signal and The Noise (Nate Silver, 2012 – 454 pages, 2012) Nate Silver's book is a bit of a chef d'oeuvre in the genre of statistical writing and polemics around data, and its strength is the way he centres the chapters around important data-related stories and domains such as climate change, flu pandemics, baseball, and the 2008 financial crisis. This brings statistical concepts and predictive methods to life and emphasizes how costly making the wrong assumptions from data can be. The book is also a brilliant explainer of and advocate for Bayesian statistics, and covers logarithmic charting and common-sense approaches to interpreting statistics well too. The two slight downsides it has are again long, breathless chapters, and age – the 11 years since the book was published feels like a century ago and Silver could rewrite this three times over with news events since then.
5. Invisible Women – Exploring Data Bias in a World Designed for Men (Caroline Criado Perez, 2020 – 432 pages, £11) This is in the list as a perspective-broadening read and testament to how opinions can be changed by compelling writing that uses statistics and data in a powerful way. Not a data book per se in that it doesn't tell you how to analyse data better, but it's probably the most important, convincing, and quantitative study on gender inequality and feminism going. A must read for business leaders, policy makers, and any data analyst having one of those days where they feel like code monkeys pulling stats that are meaningless and needing a reminder of the importance of their work.
6. Small Data (Martin Lindstrom, 2016 – 251 pages, £11) This is a book about a market research guy who pulls branding solutions out of his backside for businesses and their products around the world by talking to their target audience and watching weird things about them such as how the stack their cups in cupboards after unloading the dishwasher. And it's fascinating. It makes my list as of books relevant to data analysts as a reminder that, when looking for the story in the data or working on an investigative project, anecdotal evidence can be just as valuable as overall trends in big tabular datasets, as can talking to people. Martin Lindstrom takes this to the nth degree - not bad at all for book I bought in Gatwick airport WH Smiths on a two for one.
7. Weapons of Math Destruction (Cathy O'Neil, 2016 – 218 pages, £10) I read this on recommendation from a speaker at SQLBits 2022, and it's often talked about reverentially as an important polemic on the dangers of AI, as well as being quoted from by other authors at the top this list. Cathy O'Neil talks about the dangers and injustices of (and (to a lesser extent the remedies to) biased, black box, and often just shoddy data science algorithms when unleashed as tools of corporate and public-policy decision making. She chooses examples of this across finance, criminal justice and education policy which are correct and thorough in their explanations of the faulty algorithms and sad to read, but you could probably get the content and theory of what O'Neil covers here from chapter seven of How To Make The World Add Up by Tim Harford, so I'd counsel glossing over this one and practicing your SQL instead. The Nate Silver syndrome is also in play here; seven years is a long time in AI.
8. Bad Data (Georgina Sturge, 2022 – 238 pages, £15) This was on my 2022 Christmas holiday reading list as a book about data challenges faced by the UK government, written by a statistician at the House of Commons. In the wake of the 2020 revelation that COVID stats were being kept by the government literally in Excel 97 spreadsheets, I was expecting similar juice and hoping for something like a cross between Nate Silver and The Thick Of It. Overall it lived up to this, highlighting the patchy data available to government decision makers, the problem of policy priorities moving on before data is gathered and analysed, and media misrepresentation. In terms of utility, for aspiring data analysts it touches lightly on statistics must-knows such as margin of error, sampling bias and the severity of its repercussions, and the threshold for establishing causation. For business stakeholders wishing to be data driven it's a good reinforcer of "rubbish in, rubbish out" and should serve as motivation to invest in data collection and good survey/form design, which is often overlooked.
9. Between The Spreadsheets – Classifying and Fixing Dirty Data (Susan Walsh, 2021 – 153 pages, £32.55) I bought this book after seeing its author Susan Walsh – The Classification Guru and British Data Awards 2023 finalist – give a standing-room only talk at BigDataLDN in 2021 on the importance of data quality and accurate industry and product classification in finance data. Her vivacious character and love of clothing analogies shines through in this book. On the face of it, this is a lesson on Excel VLOOKUPs and Pivot Tables in print format but it makes for a quick, fun read and a practical, scalable and future-proofed framework for maintaining a business taxonomy and classification system that can be implemented in SQL, Python and other higher-spec data tools. Some might point out that £32.55 is a punchy RRP for this book given that you could get three lots of The Art of Statistics for it, but I say more fool David Spiegelhalter's publishers for leaving cash on the table.
10. Be Data Literate (Jordan Morrow, 2021 – 214 pages, £17) Jordan Morrow was on my radar from his YouTube footprint of sound public talks on the topic of data literacy, so I made his book my 2022 summer holiday reading to try and get to the nub of this growing but vague paradigm. This was a frustrating read and a missed opportunity. Through the fog of some clumsy and long-winded prose I was just about able to detect that the intended audience were data analysts themselves. (A strange choice - if any group in business or wider society is already data literate and not needing lessons, it's data analysts). The book said little to business leaders on how to foster data culture, obtain the best from data teams, or measure the level of data literacy in their own companies.
11. The Data Garden and Other Allegories (Paul Daniel Jones, 2020 – 76 pages, £13) In the picture this is the thin slither of green between Jordan Morrow and SQL Cookbook. Bought and read as research to put flesh on the bones of my blog piece last year about data analogies, the book is six short moral fables of fantasy "data organisations". They include a driving school, public gardens, and a hospital all as allegories for what a company needs to accomplish to have good data and awareness of data; good documentation, a well-balanced team of analysts, engineers and quality people, and of course investment. A good use of a few lunch breaks for penny pinching CEOs wondering why there are three different versions of this month's sales figures doing the rounds.
12. SQL Cookbook (Anthony Molinaro, 2012 – 590 pages, £27) As we get further down the pile we go further back in time to my early days in 2016. At this point I'd reached the limits of pure Excel, and keen to pop my SQL cherry I marched off to Blackwell's Tottenham Court Road branch and bought myself this as a textbook to self-teach with. I knew no better and wanted to emulate other developer types who had programming language reference books next to their desks. It does what it says on the tin, treating database query problems as meals alongside annotated SQL queries as the cooking steps. It covers plenty, offers solutions in Oracle, MySQL and IBM DB2 as well as SQL Server, and has a useful technique of "walking the string" to achieve word frequency analysis in a SQL query (difficult), but its big drawback is black and white print. The queries Molinaro gives are full of complex nested joins, and without colour coding to distinguish the keywords and function names from the database objects and aliases this book is difficult to use. Stick to Youtube or Stack Overflow for SQL.
13. Excel 2013 – Power Programming with VBA (John Walkenbach, 2013 – 1018 pages, £22) I bought this on the same shopping trip to Blackwell's as SQL Cookbook (doing my back in carrying them both home). I rarely use VBA now, but I do use this reference book on VBA for changing lightbulbs.
14. Advanced Level Statistics (J. Crawshaw and J. Chambers, 2015 – 653 pages, £38) Skip forward two years later to 2018, and I'd moved on from SQL fanboy to "insecure humanities grad hobby coder", having been rejected by a recruiter for lack of a statistics or STEM background. Nothing wrong with the content and layout of this book in the context of its intended purpose (a revision guide for those doing A-level statistics), but my under-use of it serves as a learning curve about what depth of stats knowledge a data analyst needs with access to modern computers and Python.
15. How to Lie with Statistics (Darrell Huff, 1954 – 128 pages, £9) Not in the picture because I'd taken it before I felt duty bound to read this, given that Darrell Huff had been name-dropped in David Spiegelhalter and Tim Harford's books as the original polemicist on "lies, damned, less and statistics". It makes for a slightly tedious read 70 years after publication; Huff's main whipping boys in critiquing statistical malpractice are advertisers and their dubious product satisfaction survey claims, which is the least of our problems in 2023. This is especially when compared to the more recent and substantial literature on statistics from authors further up in my list. But mercifully short at 128 pages.