Reading:
Create & Explore Tableau Data Lineage with Tree Schema
Share:

Create & Explore Tableau Data Lineage with Tree Schema

Avatar
by Grant
November 29, 2020
Tree Schema Lineage Header

Exploring Data Lineage with the Tableau Metadata API

I love data. The ways it can be used to curate value and express relationships never ceases to amaze me. To this extent, visualizing data is often one of the most powerful ways to share insights and Tableau certainly is one of - if not the - most popular data visualization tools on the market. It's extremely simple for non-technical users to develop rich and meaningful graphs with a pretty intuitive UI and there are some really nice features under the hood that are used to speed up query performance when extracts are stored within Tableau.

My absolute favorite Tableau feature is that you can query your metadata using the same GraphQL API that Tableau itself uses. A portion of the metadata exposed includes the lineage for the fields, sheets, tables and data stores that exist within your Tableau Site. Exposing the metadata via an extensive API like this is a really forward thinking idea from the team behind Tableau. We have taken this API and integrated it within Tree Schema, allowing users to point Tree Schema towards their Tableau Site where Tree Schema will pull all of the metadata and data lineage from Tableau.

Here is a simple example of the data lineage extracted for a single column of a database in Tableau:

Tableau Data Lineage Visualized

Having the ability to quickly check where your data comes from and how it is being used is imperative. Consider a slightly more complex example, where a database in Tableau has six similar "date" fields in a sheet. How would an analyst know which one to use in their chart?

Data lineage solves this problem. The analyst can backtrack the field to it's original source, reviewing metadata along the way. This feature becomes even more valuable when connecting your Tableau data assets to the rest of your data lineage within Tree Schema. The same data assets can be tracked all the way back to their source in the database where they were originally created.

Tableau Full Data Lineage Visualized

How to Use the Tableau Metadata API

The Tableau Metadata API is exposed via GraphQL and Tableau is wrapped in a python library, the Tableau Server Client. This library is one of the easiest APIs to use - Tableau has simplified all authentication and serialization to allow users to just focus on the query they want to execute.

Pros:

  • The graph enables many different entities and data assets within Tableau to be queried
  • The API performance is really good, even when requesting a large number of multidimensional relationships
  • The Python client is extremely simple and intuitive to use, handling the authentication and serialization for the user

Cons:

  • The documentation is sparse - it's not clear when to expect upstream or downstream data lineage assets to be provided or when they will be null
  • A "full" lineage for each data asset is not available, you can only extract lineage from one step upstream or one step downstream (at least from what I could tell from using the API)
  • Tableau releases a new API version every quarter or so but the docs do not depict which features are available in which version

Let's look at some code that can be used to query your Tableau metadata:

Authentication

You can use the Tableau API by authenticating with your username and password but the more secure and suggested approach is to use a client token. I've also created a simple helper function below to authenticate and execute queries.

            
  import os
  import tableauserverclient as TSC
  
  TOKEN_NAME = os.environ.get('TOKEN_NAME' ,'some-token')
  TOKEN = os.environ.get('TOKEN', 'your-token-value')
  SITE_NAME = os.environ.get('SITE_NAME', 'your-site')
  
  # If using Tableau Online this might be 'https://prod-useast-b.online.tableau.com'
  SERVER = os.environ.get('SERVER', 'your-server')
  SERVER_VERSION = os.environ.get('SERVER_VERSION', '3.9')
  
  tableau_auth = TSC.PersonalAccessTokenAuth(TOKEN_NAME, TOKEN, SITE_NAME)
  server = TSC.Server(SERVER)
  server.version = SERVER_VERSION
  
  # Helper function to run queries
  def run_query(query):
      with server.auth.sign_in(tableau_auth):
          resp = server.metadata.query(query)
          resp = resp['data']
          if isinstance(resp, list):
              resp = resp[0]
          return resp  
            
          
Define the Query

The Tableau Metadata API is a fantastic way to start learning GraphQL since Tableau handles all of the serialization for you and their Graph follows a consistent and easy to understand set of conventions.

The function below executes a query that will return all calculated fields that exist within your Site. The beautiful thing here with GraphQL is that we can simultaneously ask Tableau to return all of the fields that reference each of the calculated fields and we can go even deeper to request all of the sheets for each field that is referencing a calculated field.

            
  def get_all_calculated_fields(batch_size=100):
      all_calculated_fields = []
      has_next = True
      start = 0
      while has_next is True:
          query = """
          {
              calculatedFieldsConnection (first: %s, offset: %s){
                  nodes {
                      id
                      name
                      formula
                      referencedByFields {
                          fields {
                            id
                            name
                            sheets {
                              id
                              name
                            }
                          }
                      }
                  }
                  pageInfo {
                      hasNextPage
                      endCursor
                  }
              }
          }
          """ % (batch_size, start)
          resp = run_query(query)
          all_calculated_fields.extend(resp['calculatedFieldsConnection']['nodes'])
          start = start + batch_size
          if resp['calculatedFieldsConnection']['pageInfo']['hasNextPage'] == False:
              has_next = False

      return all_calculated_fields
        
      
Create Your Data Lineage

Now that you have your metadata from Tableau, how you structure and use the output is completely up to you. Tree Schema defines database column-to-sheet field mappings but you could easily define your own data structure given the flexibility of the Tableau API.

The example below defines the nodes and edges for data lineage:

        
  def format_nodes_and_edges(calc_fields):
      nodes = []
      edges = []
      for calc in calc_fields:
          # Add each calculated field to the nodes
          calc_field_name = 'CalcField - ' + calc['name']
          nodes.append(calc_field_name)

          # For each field that references the calculated field, add a node
          for ref_field in calc['referencedByFields']:
              for field in ref_field['fields']:
                  # Calculated fields may show up under referenced fields, if that 
                  # happens, do not overwrite the existing node
                  if field['id'] not in nodes:
                      field_name = 'Field - ' + field['name']
                      edges.append((calc_field_name, field_name))

                      # Create a reference to each sheet that uses this field
                      for sheet in field['sheets']:
                          sheet_name = 'Sheet - ' + sheet['name']
                          nodes.append(sheet_name)
                          edges.append((field_name, sheet_name))
      return list(set(nodes)), edges

  calculated_fields = get_all_calculated_fields()
  nodes, edges = format_nodes_and_edges(calculated_fields)
          
        

We now have our nodes and edges that can be used in just about any network graphing library:

          
  nodes 
  # [
  #   ...
  #   'CalcField - Click-to-Open',
  #   'Sheet - Sheet 5',
  #   'CalcField - Minutes of Delay per Flight',
  #   'Sheet - Opportunities ',
  #   ...
  # ]

  edges 
  # [
  #   ...
  #   ('CalcField - Difference from Region', 'Field - State'),
  #   ('Field - State', 'Sheet - Obesity Scatter Plot'),
  #   ('Field - State', 'Sheet - Obesity Map'),
  #   ...
  # ]
            
          

Closing Thoughts

I applaud Tableau for enabling this form of data access even though I believe this is a highly underutilized and underleveraged benefit. Many companies do not fully leverage this metadata from Tableau to the full extent. Understanding how data moves and dependencies between data is such a critical feature especially as organizations try to maintain well-managed practices and controls around how their data is used. As you look to leverage Tableau metadata and data lineage within your company, make sure that you're taking the extra step to connect that data lineage with the upstream processes to give a complete and comprehensive perspective of your lineage.


Share this article:

Like this article? Get great articles direct to your inbox