Reading:
Tree Schema 101 - How to Build Your Tree Schema Data Catalog
Share:

Tree Schema 101 - How to Build Your Tree Schema Data Catalog

Avatar
by Asher
September 24, 2020
How To Tree Schema 101

Learn how to use Tree Schema to build an effective data catalog.

Overview

Tree Schema provides you the essential data catalog capabilities required to build a strong data culture. In this tutorial we’ll walk through how to build your Tree Schema data catalog from scratch so that you can get up and running in no time!

Here are the topics and order in which they will be covered. Feel free to skip ahead at any time! This tutorial provides a comprehensive overview for how to populate your Tree Schema data catalog but you can always find more details and step-by-step instructions in our help & documentation as well.

Add your data stores

The very first thing you will want to do in Tree Schema is to add your data stores, nearly all of your other data assets - tables, fields, sample values, etc. - reside within a data store. When you create a data store you have two options, which you will see shortly:

  1. Connect directly to your data store: this will enable Tree Schema to automatically populate your data on your behalf and is the suggested approach. By using this approach you can populate your entire Tree Schema catalog in under 5 minutes!
  2. Create the data store without the connection: there may be times that you cannot connect Tree Schema to your data store, perhaps the data store is owned by one of your vendors and you do not have access or maybe Tree Schema does not yet have an automated connector. Creating a data store without a connection gives you the ability to manually define your data assets within the data store.

When you first log into your Tree Schema account you will be on your organization dashboard, navigate to your Data Stores page on the left-hand navigation bar.

Tree Schema Dashboard

Select the Create Data Store button in the top right corner.

Tree Schema Create New Data Store

The automated connectors that Tree Schema has will be displayed. Select the one that corresponds to your data store.

Tree Schema Data Store Connectors

As a quick aside - if there is a database or any other tool that you need but we don’t have please let us know, we are always adding new connectors!

Select the one that you are connecting to. If there is a tool you’re using but we do not yet have a connector for, we suggest using the Other type.

For this example, we will be using the Postgres connector. After you select your data store type you will need to enter a data store name and users for two roles:

  • Technical point of contact: this is primarily for information purposes within Tree Schema, if your users need help connecting to the database, downloading drivers, or have general technical questions the technical point of contact will help your users get directly to the person who can help.
  • Data steward: data stewards are used within Tree Schema for both informational purposes, to help your users understand who they should contact when they have questions about using your data, as well as for Tree Schema to assign governance actions. Any governance actions associated with a data asset is assigned to the corresponding data steward.
Tree Schema Postgres Connectors

After you select next you will be given the option to connect to your data store.

Connect directly to your data store

As mentioned above, connecting Tree Schema to your data store will save you tons of time as Tree Schema will do all of the heavy lifting to populate your data catalog on your behalf. Select “Yes, set up connection” to create the automated connection.

Tree Schema Postgres Connectors

Each data store has its own unique set of fields required for the connection. Fill in the fields for your data store and hit “Test connection” to see if Tree Schema can reach your data store! If your data sits behind a firewall or is generally not available via the public internet you can set up a jump server to route your traffic through

Tree Schema Data Store Connected

The final step, when creating a connected data store, is to determine which teams will have access to view the sample values within this data store. We will see the specifics around this later on, in the sample values section but the important thing to know for now is that if a user is in any of the teams granted full access to the data store then they will always be able to see the sample values within that data store. If a user is not in any of the teams that have full access to the data store, then their ability to view specific sample values within the data store can be revoked.

Here we’ve given the default team full access to the data store.

Tree Schema Postgres Connectors

That’s it! When you hit submit you will now see your data store listed under the Data Stores page.

Tree Schema Data Store Created
Create the data store without a connection

To create a data store without a connection, follow the same steps as above but when it comes to the automated selection screen select “No, create it manually”.

Tree Schema Data Store Connected

This will complete the data store creation process and you will see your data store listed under your data stores view.

Tree Schema Data Store Connected

Add schemas to your data stores

A schema represents the shape and semantics of your data. If your data store is a SQL database then your schema may be represented as a table or, if your data store is a no-sql database, your schema may be represented as JSON objects or Parquet files. The common denominator is that a schema sits within a data store and it has one or more fields.

There are three ways to create schemas within Tree Schema:

  1. Automated schema creation from your data store: This is the recommended approach and should be used when you have created a connected data store. With this approach Tree Schema will not only identify the schemas that exist within your data store but it will also populate the fields for each schema and sample values for each field.
  2. Automated schema inference from a file: You can automatically create a schema by providing a sample file, this is useful if you do not have a connected data store to represent your data.
  3. Manual schema definition: Just as the name implies, you can manually create or adjust your schemas.

To add a new schema you first need to navigate to the data store that the schema will reside within and then add the schema within that data store. Select the data store that you created in the steps above to navigate to the data store details.

Tree Schema Empty Data Store

In the details panel at the bottom, the schemas tab will be selected. There are two buttons here, Manage Existing Schemas and Add New Schemas. These two buttons are relatively self-descriptive, for now we will select Add New Schemas.

This will bring up a modal at which point you can decide to create schemas directly from the data store or to create them manually - either with a file or by creating the entire schema yourself.

Tree Schema Data Store Add Schema Selection
Automated schema creation from your data store

To add schemas automatically, select the Automatically from Data Store button in the pop up. This will display a quick blurb that tells you that depending on the type of data store you are connecting to and the number of schemas that exist within your data store it may take up to a few minutes to retrieve all of the results. Hit next to confirm and to have Tree Schema capture your schemas.

Tree Schema Data Store Add Schema Blurb

When the results load you will have the ability to choose which schemas are saved and which schemas should have the fields added. By default Tree Schema selects all of your schemas to save and adds fields for each schema. You can exclude any specific schemas here that you would like.

Tree Schema Data Store Add Schema Schemas Found

When you are done hit submit and save your schemas! If you have several hundred or thousands of schemas within your data store it may take a few minutes for the results to fully load. Tree Schema will send you an email once all of your schemas have been added.

Tree Schema Data Store Add Schema Schemas Results

When you close the modal and refresh the page you will see your schemas in the details section at the bottom.

(don’t forget - the fields may not be available until you receive the completion email!)

Tree Schema Data Store Add Schema Results Listed
Automated schema inference from a file

If you have a file that represents your schema, maybe you created an extract from your database or a client sent you a sample file, you can upload that to Tree Schema to have it automatically create the schema based off of the content of the file.

Navigate to add schemas again but this time select Manually / Sample File.

Tree Schema Data Store Add Schema Selection

You will need to enter the schema name as well as to assign the tech point of contact and the data steward. When you create schemas automatically from the data store Tree Schema applies the same steward and technical point of contact of the data store to all data assets created.

Tree Schema Data Store Add Schema Manual Definition

The final step in the process is to upload a file. You will see the empty schema definition view with a button at the top to upload a sample file. Select the button to infer the schema from a sample file, choose your file extension and select a file.

Tree Schema Data Store Add Schema Infer From File

When you hit submit the file will be uploaded and Tree Schema will infer your schema.

Tree Schema Data Store Add Schema Infer From File

You can hit submit at the bottom (not depicted in the picture) to save your schema.

Manual schema definition

The manual schema definition follows the same steps as inferring a schema from a file, the only difference is that you can manually define all, or parts,of the schema. In the schema definition you can add new fields, change data types, or create sample values for your fields.

Tree Schema uses “dot” notation for embedded objects, so if you create a schema definition such as:

Tree Schema Dot Notation

Will create a set of fields that has the following structure:

            
  {
    "field": {
        "sub_field": "string",
        "sub_field2": "string"
    }
  }
            
            

Populate sample values for your fields

Sample values allow your users to understand what specific values exist for each field and what those values mean. They are a critical aspect to allowing your data users to effectively use your data.

From the data store details page select Visit Schema for one of your schemas that was created.

Tree Schema Data Store Select Schema

This will bring up the schema details page. Now, select a specific field in order to update the sample values.

Tree Schema Data Schema Select Field

The field that I’ve chosen only has one sample value populated. Edit the description by selecting the edit button on the right side.

Tree Schema Data Field Overview

This brings up the edit sample value modal. You can change the value of the sample value, the description and whether or not users without full data access can view this field.

Tree Schema Edit Field Value

Back when we created a connected data store we assigned teams to have elevated access to the data store. A user will not be able to see this sample value if all three of the following are true:

  1. The user is not in one of the teams that has elevated access to the data store
  2. The user is not an admin
  3. The value for Allow users without full access to this Data Store to view & edit this value? is set to “No”

You can also add additional sample values as needed from the field details page.

Data lineage

Data lineage allows your data users to understand how your data moves and is a critical capability for all data users whether they are researching potential impacts when making changes to a data pipeline or trying to understand which of the six date fields in a given table should be used for reporting.

Define your data flows

In order to have data lineage you first need to capture how your data moves. Tree Schema captures data movements through Transformations. A Transformation in Tree Schema is simply a reference to data moving from a field in one schema to a field in another schema.

To create a transformation in Tree Schema, navigate to the Transformations page.

Tree Schema Navigate to Transformation

Select Create Transformation to move forward with creating a new transformation.

Tree Schema Create New Transformation

Similar to data stores and schemas, the transformations have a name, type and points of contact.

Tree Schema New Transformation Info

The last step is to simply define where the data comes from: the source(s), and where the data is going: the target(s). From the Select source schema(s) tab, first choose the data store, then the schema and fields you want to include in your transformation. Once you have selected the fields you want to include in your transformation hit Add to transformation diagram to place the fields into the transformation.

Tip: You can add more than one source schema at a time.

Tree Schema New Transformation Sources Added

Next, choose Select target schema(s) at the top to select the target schemas and fields. This follows the same process.

Tree Schema New Transformation Targets Added

Now, just click the triangles from the source and then click again on the triangle at the target.

Tip: Click and release to create a connection, do not click and hold!

Tree Schema New Transformation Targets Added

When you have completed creating your transformation, click I’m done, save transformation! to finalize your transformation.

Explore data lineage

When you have created transformations for your data assets you can start to explore them with data lineage. Every data store, schema, field and transformation has a tab under their corresponding details section for lineage. When you load the page for a data asset all of the immediate data lineage connections for that asset will be displayed.

Tree Schema Data Lineage

From here you can explore up and downstream connections for your data lineage. A more fully-connected example is shown below:

Capture your team’s knowledge

We’ve now walked through how to capture and relate all of your data assets. In this section you’ll learn some of the ways that you can share knowledge about your data with your teammates.

Rich-text documentation

All data assets in Tree Schema - data stores, schemas, fields and transformations - have the ability to define rich-text documentation in their own corresponding README panel. Update the README start sharing your knowledge!

Tree Schema Data Asset Readme
Comments & conversations

Your users will have questions and oftentimes more users will have the same questions later on. The comments section is a great way to allow your users to share additional information about your data. You can also attach files to comments which can be a great method for sharing common content such as queries, access instructions and more

Tree Schema Data Asset Comments
Assign experts

Knowing who uses your data is important, those who use certain data assets are generally the ones who can help others as well. In Tree Schema all data assets have the following types of experts:

  • Power Users: those who visit and use the data asset most often
  • Volunteer Experts: those who volunteer as experts for the given data asset
Tree Schema Data Asset Experts

Define your keywords in the data dictionary

You can use Tree Schema to capture the keywords that drive your business. When you define your keywords you also create a context for that keyword. A context is the scope for which the keyword has a meaning. Consider the example keyword “channel”.

For the marketing team a channel could be how the user came to your app. For the product team, it could be a segment of correlated user groups, and for the development team the channel could mean the type of pipeline that is processing the data.

To create a keyword, navigate to the dictionary and select Add New Keyword.

Tree Schema Add New Keyword

The keyword creation modal will be displayed, complete and save your keyword.

Tree Schema Add New Keyword Details

Every time that you add a new keyword, the context is registered as a tag. In the example above we just created the context “Marketing”, now, the rest of our data assets in Tree Schema will have the tag “Marketing” available. While this is not the only way to create tags, using the context from your keywords to pre-define your tags can be a great way to limit the scope and structure for your tags.

Tag your data assets

Tags can be used to group your data assets together. Common uses for tags include identifying data assets for a use-case (e.g. marketing), finding all PII data or for including a structured set of training.

Every data asset in Tree Schema has the ability to add tags. To add a tag to an asset just type in the tag you would like to add and either select an existing, similar, tag or create a new one. In this example, I’ll navigate to a data asset and just type the letter “M”. As you can see the “Marketing” tag is available because we defined the work “channel” with the context “Marketing” above.

Tree Schema Context Tag

You can add as many tags as you would like to each data asset.

Tree Schema Tag Multiple
View assets by tag

Once your assets are tagged you can navigate to the tags page to view all assets by each of your tags.

Tree Schema Tag Overview

Selecting a row will beak out each different type of asset that is associated with that tag:

Tree Schema Tag Details

Thats it!

You've now gone through the core Tree Schema features. Make sure to check out the help & documentation for full details and more examples!


Share this article:

Like this article? Get great articles direct to your inbox