At it's core, data lineage allows you to understand how data flows through your ecosystem and to answer the questions: where did the data come from, where does it go, and what happens when it moves between each source?
Data lineage is the process of describing what data exists, where it is stored and how it flows between systems. Data lineage is metadata, that is, it is data that describes the actual data. There are many reasons why data lineage is important but at a high level these can all be boiled down to three categories which we will explore here:
The most common use-case for data lineage is to provide a self-service way for users to be able to understand the source where data comes from. Data scientists, analysts and other data users who are querying data often ask questions such as:
- What upstream process populates this table?
- How is the data transformed?
- What time is this schema made available for consumption?
These are all critically important questions when analyzing data or building operational processes. Data lineage helps to answer them by providing structure to the metadata that depicts how the data moves from it's original source location to a target destination.
However, this example is quite basic as it only considers one side of the data flow: where the data comes from. Both sides of the data flow, where the data comes from and where it goes, must be given equal visibility in order to effectively use data lineage.
Let's consider a real world example - something that you have likely seen at your company as well. In this example, an anayst is being asked to create a dashboard that will be used to show how well a marketing campaign is working. The analyst's company does not currently have a data lineage tool but, after talking to other teammates, the anayst is able to build the following version for how they perceive the data flow for this process and they come up with this:
The analyst learned that there are two sources of data, one marketing database and another customer database and that both of these are extracted into S3 each night. The analyst now plans to query these two locations in S3 and to save a data cube in Redshift which the dashboard will use to serve the data.
Unfortunately, our analyst didn't know who they should talk to in order to understand the data. The developer who built the data extract to S3 also happened to build extracts for some tables from S3 into Redshift, including the ones that the analyst needs, and the current data flow actually looks like this:
If the analyst had known about this extract they may have decided to build their dashboard directly on top of the two tables in Redshift or to create their data cube as a view from the underlying data. By creating their own data extract into Redshift they were duplicating data, creating more code to maintain and further increasing the challenges that their team has with understanding what data flows between systems.
Exploring Data Lineage
Data lineage is a unique piece of metadata in that it there is tremendous value by interacting with it in a dynamic way.
Here, we will take a look at what our analyst would have learned about the data lineage for their process by replicting their data flow within Tree Schema and exploring the data lineage:
As seen here, one of the key benefits of data lineage is being able to uncover the unknown. As it turns out, in this scenario the conversion analytics table that our analyst was planning to build already exists in Redshift.
There has been significant progress made over the past few years to clearly define data privacy laws and standards for handling sensitive data. GDPR, in Europe, and CCPA, in California, are two of the most notable but there are also industry specific regulations as well. For instance, the banking industry is one of the most regulated and additional compliance requirements come from international accords, such as BASEL, and as self-imposed requirements such as the Payment Card Industry Data Security Standards.
While none of these directly states that companies must have data lineage, requests for information from these governing bodies - whether as part of a certification process or in response to a data breach - often come down to proving that you know where and how your data is stored and who can access it.
Data lineage is one of the tools that can enable companies to quickly respond to these types of requests. When used in conjunction with other cataloging techniques, such as tagging, data lineage gives you the ability to find the sensitive data (e.g. credit card numbers, SSNs, customer addresses, etc.) and identify all of the places in your ecosystem where it is saved and the external parterns that you send that data to.
One of the most underappreciated aspects of data lineage is how valuable it can be to so many different roles. Data lineage is often thought of as a tool for data users - that is the data analysts, business analysts, data scientists, etc. - because they spend much of their day using the data and it helps them to understand how to leverage the data more efficiently. However, for the roles that are creating the data such as developers, data engineers, dev ops and other tech resources, data lineage is just as important.
When making change to any data collection or data transformation process engineers generally should be asking questions such as:
- What downstream systems or processes might be impacted?
- Will this cause a breaking change for the consumers of this data?
- Who uses this data?
Without data lineage to help them answer these questions developers will typically fall back to reading through code in Git or talking to other developers who may know the process but neither of these is efficient or reliable.
Data lineage enables development teams to easily find the resources that they are impacting downstream and to identify who the right subject matter expert is as efficiently as possible. The value for data lineage can really be seen here, when a developer is able to research who is impacted by a data stream before making a change then they are able to communicate the change ahead of time. This in turn enables consumers to adapt proactively instead of reactively to changes in data.
Data lineage is an important capability for all companies to have as it facilitates the use of data as well as the maintenance of the data pipelines that move data between schemas in your ecosystem. As more external regulations are added over time they will continue to push forward the need for insights into where data is being stored which will further the value that data lineage provides. Companies can get out in front of these requirements by capturing their data lineage early, before the need arises, and they can increase the value that data lineage provides by using automated techniques to keep the data catalog up to date. The more time that your team saves by not having to research data, ask colleagues the same repetitive questions or develop one-off reports for where your data resides the more value that data lineage will bring.