In modern data-driven businesses, the complexity that arises from fast-paced analytics, data mining and ETL processes makes metadata increasingly important. However, traditionally metadata is typically stored and queried inside the system that generates it. Examples of this include databases like Oracle, Teradata, Hive on Hadoop; NoSQL datastores like MongoDB, Cassandra; ETL systems like Informatica; BI systems like Microstrategy and scheduling systems like Oozie, Azkaban, UC4 etc. This siloing of metadata causes problems; each system has its own partial view of the end to end data pipeline and data storage organization. It is very hard for data producers, consumers and other interested parties (e.g. legal compliance teams) to perform fast and accurate analysis of the entire data ecosystem for data provenance and compliance related use-cases.
WhereHows, a project of the LinkedIn Data team, aims to solve this problem by creating a central metadata repository for the processes, people, and knowledge around the most important element of any big data system: the data itself. The repository has captured the status of 50 thousand datasets (with more than 15 petabytes storage footprint across multiple Hadoop, Teradata and other clusters), 14 thousand comments, 35 million job executions and related lineage information. Current challenges include the quest for a generic data model, expansion to NoSQL datastores and stream processing systems.