So you’re ready to build your data catalog - where do you begin?
Building a data catalog can be a daunting task. Data is complex. There are many shapes and formats to data and each needs to be captured and depicted in its native format. The way that users access and use data is unique to each team. Most challenging of all - getting a data catalog populated with enough data to be useful can be a time consuming process.
I’ll walk through how you can build a data catalog that represents your company and your users which will enable your team to govern data better, improve self-sufficiency for metadata access and share knowledge about your data more effectively. Along the way we'll touch on how the Tree Schema data catalog helps teams turn the seemingly insurmountable task of documenting your data from a mundane chore into a deeply engrained part of your data culture!
There are 5 steps to building an effective data catalog:
- Capture your data
- Assign owners and points of contact
- Document your team’s knowledge
- Keep your data catalog up to date
- Optimize for your team
When you plan to build a data catalog and you need to capture your data there are two questions you need to answer: what metadata should be captured? and how should the metadata be captured?. Let's address each one at a time.
What metadata should be captured?
The first step in building a data catalog is to populate the data catalog with the shape, structure and semantics of your data. Most data users -data scientists, data engineers, business analysts, etc.- talk about data in terms of the schema, or table, where data resides. Consider the following questions and answers that may be common in your organization:
Where can I find customers who have bought at least one item?
Check out the “cust_purchases” tables
How do I calculate the cost of acquisition?
You need to join the “mkt_leads” and “customers” tables on email address
How are invoices created?
An invoice is made up of one or more orders, you can find those in the “invoices” and “orders” table respectively. If an invoice has been paid you can find the payment in “payments”
Users generally speak about data in terms of schemas, or tables, because they represent entities and it is natural for users to talk about how a lead becomes a customer and how a customer has one or more orders. A well-defined data catalog should have first-class support for all types of schemas, not just tables, and should enable interactions around schemas.
A schema, however, is not the only important entity. A schema resides within a data store, a schema is made up of one or more fields, and each field has 0-N unique values. On top of this, transformations move data from field to field between schemas; this is what creates data lineage. While a data catalog should have a schema-first approach in order to be aligned with how users think and use data, it must also be able to accommodate this full hierarchical set of relationships in order to answer the minute and detailed questions that users will have about the data.
In today’s world, streaming data and non-tabular data (e.g. JSON, Parquet structs, etc.) are commonplace and their usage is growing at an increasing rate. Even if you do not use these technologies today, look for a data catalog that has fully supports nested data structures and will allow you to integrate streaming technologies in the future.
Finally, it is imperative that any data catalog be able to capture data lineage. Data lineage enables users to see where data came from and where it is going; this is critical to providing context that users often need when using data. Here are some common scenarios that exemplify the value that data lineage provides:
- A data scientist is looking at user acquisition costs and wants to know what upstream production table outputs data into the analytical marketing tables.
- A data engineer is updating a pipeline and wants to see what downstream processes are impacted.
- A CIO wants to track where the PII data is being moved after it lands in the data lake
How should the metadata be captured?
When you build your data catalog you will want to use a tool that is able to populate the catalog on your behalf so that you do not need to spend countless hours manually updating every database, table and field in your data ecosystem. Nearly all databases and data stores (e.g. Kafka, AWS S3, etc.) has APIs available that will allow you to extract the metadata that represents the shape and semantics of your data; therefore you should heavily consider the ability to automatically populate your metadata when building your data catalog.
There are scenarios where you may not be able to connect directly to your database, for example, if you do not want to expose sensitive data or if you are using a managed database that is not publicly available. In these scenarios you should be able to use sample files and extracts from your data store in place of a direct connection to your database.
As seen below, with the click of a button Tree Schema can extract your data schema in a matter of seconds, giving you the ability to quickly create your data catalog and populate it with sample values from your underlying data store:
When all else fails you should be able to quickly and easily capture your data on your own without automation. No process is perfect and no tool will be able to keep up with the frequency of change for all of the client libraries of the disparate databases so having a way to correct problems or document your data yourself is critical.
Once you have data in your data catalog you will want to identify who the important people are for each data asset. Assigning owners to your data assets is important for two main reasons:
- It allows users who have additional questions to know who to reach out to
- It establishes responsibility for ensuring that documentation is complete and accurate
When data users have questions about your data the questions generally can be placed into one of two categories:
What is the business context for this data asset?
- What does this status code “01” mean?
- What does Null mean for this field?
What are the technical attributes for the data asset?
- How do I connect to the database?
- Who can add this new field to the schema?
- Where can I get an SSH key to access the associated jump-server?
While a data catalog may have many types of owners (e.g. data steward, technical owner, business owner, executive owner, etc.), the two most important are the data steward and the technical owner. The data steward should enable your users to know who to go to for all business related information and the technical owner will have the answers to the more tech-oriented questions that your users have.
When you are creating a data catalog you will want to assign tasks to your owners, these tasks are intended to ensure that your data catalog is well documented and useful to your other teammates. It is important to remember that your data catalog does not directly create value for your organization, it is an enabler for your team to be more efficient. The tasks that your data catalog creates should be aligned with this and be focused specifically on helping you get the maximum utilization out of your data, tasks should not be directly intended to drive engagement within the data catalog.
When you are starting to document your data in a data catalog the amount of information that you want to capture can seem overwhelming at first. If you only have two databases and each one has a few dozen tables and each table only has a handful of fields you are already looking at a few thousand data assets. Don’t worry, not everything needs to be fully defined in your data catalog to get the most value!
Start by picking a single methodology and slowly adding documentation over time with the goal to reach a certain percentage of coverage - perhaps 90% or even 40% - within a few months. Some of the more common methodologies are:
- When you learn about it, document it - everyone takes responsibility for updating the data catalog when they learn something new that was not documented
- When you change the code, change the documentation - as teams release new features part of the checklist is to update the data documentation
- Set aside time on a set cadence - have each of your team members spend an hour a week, or perhaps 15 minutes each morning, to add documentation for the data assets that they know well or research the ones they do not know as well
All data assets should be able to have rich-text documentation within the data catalog to give users the ability to highlight key points. Data catalogs should also provide users the ability to group assets together in common sets, one common way to do this is by tagging your data. For example, if you want to be able to see a report on all of your Personally Identifiable Information (PII) you could tag all of your tables and fields that contain this data with “PII”.
Documentation for your data is most powerful when your data catalog allows your users to have conversations about your data, with your data. When a user has a question about data and that data is eventually answered, the question, the answer and the conversation that led to the answer should be documented within the catalog. This will allow the next data user who has a similar question to be able to see the previous question and to understand the context around the answer and it will save countless conversations that repeat the same set of questions and answers, for example:
- Mary,"Where is the SSH key to access the database and how do I set up a tunnel to get there from my computer?"
- Mark,"We got rid of SSH keys a while back, you just need to be logged into the VPN and you can point directly to the database host."
This is only a simple example but consider how frustrated Mike would be if every new user asked him this! Or how your operations could slow down if Mike left the company!
All of the data assets in Tree Schema contain a similar layout, which allows you to quickly digest the information that you teammates have posted. The example below walks through the documentation for a data schema but you have just as much rich information available for everything from fields to your data lineage!
Keeping your data catalog fresh can be a major challenge. Your developers probably change the structure of databases all the time and create new pipelines on a weekly basis. Your data scientists and business analysts are likely creating data cubes or moving data between analytical environments to create new dashboards just as frequently. You data catalog should enable you to automatically identify these changes where possible and to update itself accordingly.
There is only so much that a data catalog can do on its own to ensure that it is fresh, therefore some user interaction to double check the quality and staleness of the information is important. Governance actions can be used by your data catalog to nudge your users to take action when it thinks that the underlying documentation may be old or outdated.
The way that every company uses a data catalog is going to be unique. It is important that you set standards and team norms for how you want your organization to utilize the data catalog. The way that your team plans to use the data catalog will highly influence how you capture documentation, therefore if you do not know how your team will be using the data catalog then it is highly likely that the time you spend to document your knowledge will lead to suboptimal results. Here are some things that your team can do to optimize your interactions with a data catalog:
- Set standardized documentation formats and use this for all of your databases, schemas, fields and data lineage
- Identify key learning plans (e.g. new associate onboarding) and tag the assets that are included in each learning plan with a common theme
- Reinforce your team norms for how you use the data catalog so that it becomes deeply embedded in your data culture
A data catalog is only as valuable as the people who use it and the information that they put into it. Whether you are using an off-the-shelf product such as Data Catalog or building your own, make sure to give feedback to your own developers who are making the catalog or to the company that provides it. Important feedback such as how you are using the data catalog and what features you would like to see will help to create the features that you need and that are unique to you and your data.