If you’re reading this, then you are probably already familiar with what data lineage is. We’re going to skip over the nuts and bolts of what data lineage is but if you need a quick refresher I find that Wikipedia’s definition is quite good. Rather, I'll discuss some of the key factors as to why data lineage is not more prevalent today for small and medium sized businesses and how we’re trying to change that here at Tree Schema.
Company culture is the #1 factor in creating a top tier data organization
You cannot execute on data lineage well if you do not have a strong data culture. My colleagues and I have each worked at several companies and we’ve consulted with many more. After you’ve been able to work with a few dozen companies it becomes clear what actions and traits organizations with strong data cultures have, here are a few examples of what they do:
- Spend time to curate their data and thoroughly document it
- Keep their metadata up to date when things change
- Provide clear & reliable processes for how their users can have self-service access to knowledge as well as how to escalate questions
- Promote knowledge and training through data stewards and or other data experts
- Reinforce the value of data and importance of proper documentation through social norms
To be clear, doing all of these and doing them well is rare. These are all important when it comes to providing data lineage to your organization, but the two that are imperative are how well you document the current state of your keep the information up to date. If your data is not comprehensively cataloged or if your metadata is out of date then your users will lose trust.
Similarly to the graphic above - if you do not prioritize a quality data culture there will be a lack of emphasis within your team and a lack of emphasis leads to a lack of effort spent to grow your data culture. Even when the entire team is aligned there are still challenges that make building a strong data culture difficult, some of the most common issues that I hear companies say is:
- That there are not enough resources to do this effectively
- There are always higher priorities, or, teams often have the I’ll-just-document-that-later mindset
- The ROI for building a strong data culture is hard to quantify
These are all very valid points, however, they are generally just affirmation that the company does not have a good data culture and that they are not willing to prioritize making their data culture better. These are the three responses that I generally give, respectively:
- Having a strong data culture means that everyone participates as part of their job, it is not about having more resources, but more so about ensuring everyone does their part
- Developers, analysts, data scientists and other data users should be spending less than 5% of their time on documentation, if it takes a week to create a new feature, spend one hour to two hours to document it
- Consider the most naive value assessment, how many hours a month would it save your data users to have self-service data lineage / data catalog? Multiply this by the average amount per hour that you pay. This is not a perfect methodology but even still, most small to medium companies will tell you that it is at least a few hundred to a few thousand dollars per month.
Sure, achieving this may require some adjustments in the way that an organization is run and there may be other areas where you may need to make a sacrifice and steal some time; but that’s how change happens, right?
I fundamentally believe that every company has the capacity to implement a solid foundation for building a data culture. This can be a challenge for many smaller companies for any number of reasons, perhaps the leadership is more product focused or the entire tech team is concentrated on shipping the next release so that the company can meet a contractual obligation. My colleages and I will continue to share knowledge and suggestions about how to build a data culture and to make this available to everyone; if you ever have questions, whether you’re a customer or not, we’re always happy to listen and to try and give advice.
Integrating into a tech stack is challenging
There are a handful of products out there that do a really good job at automating metadata capture from data persistence. Here at Tree Schema, we pride ourselves on doing that as well to make it as seamless as possible for you to get up and running.
The part that is challenging, is to capture the movements and relationships between your data. There are some providers that do an exceptional job at this if your entire data pipeline is in SQL, but even still, automated automated lineage capture for SQL can break down once you have complicated scripts with sub tables, complicated window functions, chained functions, dynamically generated scripts, etc. There are open source tools, such as Apache Atlas that are really good as long as you’re using certain tools within the HDFS domain. However, there are no fully-encompassing, one-size-fits-all solution, nor should there be.
Furthermore, the prevalence of serverless capabilities (e.g. AWS Lambda, Google Cloud Functions, Azure Functions), dockerized services and a plethora of ever-growing open source and pluggable components (e.g. Kafka Connect, Debezium, Maxwell, etc.) means that many companies will not be tied into a single technology for their entire data pipeline stack. There will be a consolidation of tool sets within companies, however this still creates a large impact to service providers that build data lineage integrations. Each data movement product means another set of APIs for data lineage providers to integrate with and another set of APIs to maintain as the products evolve. To make matters worse, the number of different frameworks still continues to grow.
Since it will be impossible for a data lineage tool to actively maintain the ability to parse out the data movement from external sources, the alternative would be to have a standardized API that all external data movement providers could integrate with. Again, Apache Atlas has done the best job describing that with their create relationships API. The API itself is pretty good and if we were to simply take that as an industry standard the challenge would remain in that most of the good integrations with Apache Atlas reside within the HDFS ecosystem which is something that most small and many medium sized companies do not have.
Our approach has been slightly different in that we’re not aiming to enable fully automated data lineage capture; instead we are aiming to make it as easy as possible to all users - technical and non-technical alike - to be able to create and maintain data lineage relationships. We may create a systematic way to ingest lineage at some point as we continue to always look for more ways to make data cataloging and data lineage easier but it really depends on the feedback we get from our customers. Currently our customers love the simplicity of being able to quickly capture and represent any relationship between their data with a few clicks so we’re going to focus on making this feature even simpler.
In order to get access to data lineage you need to buy... well, everything else
There is a major market demand for data lineage right now, especially from large, international companies. This is being driven by several factors:
- The number of data users, such as analysts, data scientists, and engineers remains in high demand, this means more pipelines, more tables, more databases and less knowledge by any one person of the whole picture
- Data collection is (still) growing at an ever increasing rate
The thing is, when a big company buys a product for their entire organization they want it to do… well, everything. Large companies want the data lineage tool to also be the tool that their analysts use to understand summary statistics. They want the same tool to be the place where access is provisioned, where reports are generated, and where ETL pipelines are created. Intuitively, it makes sense, if you can build these deep integrations and then you can get tremendous benefits of metadata capture, more informed users, and automated lineage creation.
Regardless of how efficient these integrations actually are the cost certainly is not cheap and it is often prohibitive to any company that makes less than a few million in revenue per year. The components that make up existing frameworks are not (at least currently) available independently and if you just want access to a data catalog with data lineage you will have to purchase the whole package that starts in the range of $1k - $12k (at least from what we’re aware of) and generally has additional per-user fees.
Tree Schema is not trying to attract the Fortune 500 nor do we intend to overload our product with ancillary features; we know that there are many startups and small companies that want to move quickly and that only want to pay for the capabilities to move their tribal knowledge into documented form. Our product is aiming to bring the bar down, way down, such that everyone can have access to a data catalog with data lineage from day one without a substantial investment.
We’re making a data lineage tool your team will want to use
One thing we’re clearly passionate about and that I fundamentally believe is that the user experience must come first (yes, even for a data catalog). Our aim is to keep things simple, to make it intuitive and to give you the right amount of automation in the right places. Sometimes data governance and cataloging takes a little bit of hands on time and if you’re building a strong data culture we think you will agree with us. In these areas, we look to give you the easiest ways to document your data, connect your sources, create your lineage and explore your data.