Stop Blaming Your Data Quality - Start Understanding Your Data Users
If you're tired of hearing 'data quality issues' blamed for project delays and incorrect reports, this article will help you understand what actually makes data 'high quality' and how to get more of it.
Much handwringing and eye-rolling occurs every day as teams debate the usefulness of data. Are these statements familiar?
“I don’t trust that data. It is low quality.”
“The report was wrong because of data quality issues.”
“We spent most of our time this sprint fixing the quality of the data.”
There can be real-world consequences from a mismatch in data expectations and reality. Consider metrics which are usually created to define something valuable. Metric data might be factually ‘correct’ but the understanding can be missed (intentionally or otherwise), as in the case of Facebook calculating video metrics in 2016, resulting in the need for an embarrassing apology.
More dangerous outcomes are possible as well, like when the Mars Orbiter crashed because of a mismatch between measurement units.
Nobody says “please give me low quality data”, except perhaps the most masochistic or flamboyant data engineers.
But what is high quality data?
If you are reading this, you’ve probably heard the often stated axiom, “garbage in garbage out.”
This is fine, by why must data be reduced to being portrayed as garbage, or not? This is an over-simplification. After all…
“One man’s trash is another man’s treasure.”
Wikipedia outlines three dimensions for understanding data quality; the consumer perspective, the business perspective, and the standards-based perspective.
If you look closely, these are all versions of the same thing, which is the consumer perspective. The relevant question to answer is ‘who is the consumer and what are they expecting’? It’s often an ‘aha moment’ when people start treating all users of data as consumers, including business teams and standards organisations.
Data quality can only be usefully understood through the lens of real consumers. This is why data quality is rarely absolute, more often relative. Further, these consumers should be understood as individual humans, rather than an amorphous group (side note - applications/computers can also be treated as consumers but this is a topic for another day).
As people are likely to change opinions and motivations about what they are trying to achieve, and new people will be introduced to existing data, it follows that data will always be in a state of being ‘low quality’ for emerging use cases where consumer need and data attributes have not been aligned. Or put another way…
“You can please some of the people all of the time, you can please all of the people some of the time, but you can’t please all of the people all of the time.”
High quality data in one context, may be low quality in another. While this seems obvious, it’s important to remember that enterprise data is created by, and used by, multiple sets of consumers with differing perspectives. So, my working definition for high quality enterprise data is:
“High quality enterprise data is understood by humans and can be made available to a variety of use cases with confidence and credibility.
As the the variety and value of the use cases increases, it can be said that the quality of the data has increased too.”
Why does the quality of my data erode over time?
Data starts out being useful as it is created and used in its original context. This is often an application of some sort. Then, there are two main paths which lead to data eventually being perceived as low quality.
The path of good intentions.
Data which has been created for a specific reason is accessed by a different set of users and enriched for use in a tangental business process. Then, someone wants to draw insight from the transactional data, and so reshapes it into a format suitable for analytics.
At each step along the way, new value has been generated. Adding fields, especially calculated fields, is often called ‘enrichment’. However, this comes at a cost of increased complication, and users of the more complicated data may become confused and frustrated. Unfortunately, the people most interested in more complicated data are usually the ones making the most impactful decisions, and these are not the folks we want to upset.
This process of enrichment to complication is almost ubiquitous in organisations across the world, and usually happens without much reflection on the impacts.
The path of impossible expectations.
This path is easier to explain and will be familiar to most teams who manipulate data for others.
Consider a scenario where executives want to create a recommendation algorithm using some newly acquired customer data. By when? ASAP, of course.
Avoiding this path can be challenging and involves expectation management, education, trust building, and influence.
How can I increase the quality of my data?
You can see from above that both paths would be mitigated with greater understanding (from the consumer perspective) about the data itself, particularly its lineage. Deep, specific knowledge about data, inclusive of any ommissions, quirks, deficiencies and subtleties, will help build a strong bridge between the data you have and the use case. On this foundation, future data cleanup or quality improvement exercises can be compared and celebrated.
Some of the most important activities you can do to increase data quality are:
Talk to data customers and understand their perspectives.
Inspect your data and describe it in ways which make sense to those customers.
Maintain documentation communication. This should feel like journalism, or marketing. No boring, unread documentation please.
Simplify. Too many cooks spoil the dish.
Of course, all these activities take time and effort, so you must consider your scope. Some positions you can adopt include:
Screw everbody else. Default scope for a single use case. If you find yourself in this situation, try to leave some good breadcrumbs for those who follow to ‘enrich’ or make use of the data for other purposes. It might be future you.
Help everybody else. Also known as “cataloguing all our data” or “building a data warehouse for everyone” or similar. Be prepared for shadow data teams and a slow, inexorable decline in investment $.
Of course, a pragmatic middle ground is best, but how to balance this tightrope?
In recent years, there has been great movement in the industry around making data available as ‘products’. The exact definition of a data product can be debated, but the important thing to apply is a customer/consumer orientation. Giving someone something they can use.
Similarly, paradigms such as data mesh and data fabric are worth exploring as they articulate ways to build up appropriately complicated data systems incrementally and with parallel efforts.
Finally, adopting a rigorous discipline around Continuous Integration and Continuous Delivery will pay off. Adopting even basic activities like version control for your data efforts will help you build credibility while making it easier to turn back the clock when you inevitably take a step in the wrong direction.