Automating Data Lineage with Tree Schema & Looker

Automating Data Lineage with Tree Schema & Looker

by Grant
January 4, 2021
Looker & Tree Schema Integration

Tree Schema partners with Looker to bring automated data lineage to shared users.

Analytical and development workflows are highly reliant on understanding where data comes from and how it is used. Data lineage enables engineers and analysts to expedite their research, bringing data products to market faster by providing an end-to-end view of where the data comes from and how it is used in one location.

When an engineer changes a data pipeline they may ask “what are the downstream impacts to removing this field?”. Or an analyst may need to know, “which of the six date fields in this file is the right one to use for my dashboard?”. The answers to these questions traditionally require a search through Git or asking a colleague, both of which take valuable time from the developer and fail to guarantee a comprehensive answer.

Capturing data lineage with Tree Schema and Looker

Data lineage is notoriously difficult to capture and maintain. Let’s consider one example where an analyst wants to understand the lineage from their warehouse to their Looker content. There are three questions that the analyst needs to answer:

  1. How does data move from my warehouse to my Looker Explores?
  2. What Looker Explore fields are used for each of my dashboard elements?
  3. What Looker Explore fields are used for each of my Looks?

The standard method of documenting this data lineage is to painstakingly reconstruct every step within Excel or an internal Wiki that can be used to share insights. This task is generally too time consuming and cumbersome to yield a return on investment and breaks down the moment changes to the data lineage occur and the documentation becomes out of date. This causes companies to be reactive when curating their data lineage which, in turn, leads to incomplete documentation and broken dashboards.

Tree Schema and Looker augment the process of capturing data lineage by automatically extracting the lineage between your Looker content and Looker Explores as well as from your Looker Explores back to your data warehouse. As Tree Schema sits on the outside of your data ecosystem and reads existing metadata, this data lineage can be further connected to upstream data sources to give you a holistic view of your data.

Tree Schema and Looker Ecosystem
Looker & Tree Schema ecosystem

In the example defined above, if the analyst needs to document 20 dashboards, 15 Looks and 3 Explores they may easily spend several days tracing the data flow between all of their Looker content. This task may become even more time consuming when considering how many underlying tables exist in the warehouse that feeds into the Looker Explores.

Nearly all of this effort can be automated through the Tree Schema / Looker integration, allowing you to document your data lineage in minutes with just a few click.

Capturing data lineage with Tree Schema and Looker

Looker provides a comprehensive API that, among other things, enables metadata to be extracted from Looker. One of the more powerful pieces of metadata exposed via the API is the underlying LookML code for each of the fields in your Explores. As seen here, Tree Schema reads these field-level attributes and maps them back to their source location, which is also provided by the Looker API.

  explore_ml = sdk.lookml_model_explore(
      lookml_model_name='tree_schema_model', explore_name='items'
  field = explore_ml.fields.measures[4]
  print('Field name: %s' %
  print('Field SQL: %s' % field.sql)  
  # Field name: tpcds_sf100_tcl_web_sales.profit_cost
  # Field SQL: ${tpcds_sf100_tcl_item.i_wholesale_cost} - ${TABLE}."WS_NET_PROFIT"

The corresponding LookML model and Explore can be seen in the Looker model file:

Tree Schema and Looker Ecosystem

And the field-level SQL is in the associated view:

Tree Schema and Looker Ecosystem

Looker’s powerful substitution operator ($) provides the context for the source location of each field. Using this operator, Tree Schema is able to traverse the Explores within each Look model to build a data flow that maps your lineage back to the database where the underlying data resides.

Similarly, the API provides metadata about the source view and fields for all of the content in Looker. Looking at this graph there are three fields used to create this dashboard element: created date, average price, and total sales price.

Tree Schema and Looker Ecosystem

Because Tree Schema uses the metadata from the Looker API it knows about the fields used within each Looker dashboard and explore.

Tree Schema Looker Dashboard Element Metadata

This enables data lineage to be created which connects the content to the source Explore.

Tree Schema Looker Dashboard Element Metadata

Exploring Looker data lineage within Tree Schema

Once Looker metadata has been synced with Tree Schema data, users will be able to interact with the lineage. Engineers can quickly identify all of the content that will be impacted by potentially removing a field from a table, and can take proactive steps to make adjustments before receiving a call from an executive saying that the dashboard is broken.

Tree Schema and Looker Ecosystem
End to end data lineage

Your analysts can look at this data from the opposite direction. They will be able to review the complete upstream lineage and documentation for each of the six date fields in the source database to ensure that the dashboard they’re building is accurate.

Closing thoughts

By integrating your Looker instance with Tree Schema you can automatically extract and structure your metadata. Your teams will be able to research dependencies and deliver data products faster and with more confidence.

Share this article:

Like this article? Get great articles direct to your inbox