Back
Blog Post

Why You Need Data Documentation in 2024

No items found.

Data teams have no shortage of priorities in 2024. From alluring projects like building new pipelines and exploring new tools to the inescapable platform migrations and ad-hoc questions, teams may struggle to prioritize their abundant backlogs. This year, the pressure goes beyond the usual push and pull between maintenance and innovation as teams’ also face budget pressures on headcount and cloud costs.

In such an environment, where should teams channel their limited time and energy? The answer may be none of the above. Investing in data documentation may be the dark horse solution to help offload demands for user support, improve internal team coordination, and create a clear-eyed data strategy.

Why you need data documentation in 2024

Data documentation is too often seen as a nice-to-have, like putting the final bow on a Christmas present. But thinking of documentation as a passive, static artifact or, even worse, a frivolous ornament misses the point. When done right, data documentation can be a catalyst for data teams by playing an active role in supporting data users, coordinating data teams, and informing executives.

Documentation is a data product’s developer advocate

Documentation is the “front end” packaging of our data products and plays crucial functions in marketing and customer support. Data analysts and data scientists often feel impeded by their organization’s data management – including data availability, accessibility, and quality – to the befuddlement of data teams. However, the root cause of these perceived gaps in infrastructure is often a gap in understanding of what data exists and how it is intended to be used. 

Documentation improves user support and satisfaction by acting like a 24/7 developer advocate. It’s always available to articulate the value of a data product, answer questions, and help users problem-solve. This type of support can dramatically increase the user experience of working with data by helping users avoid snags coming from misunderstandings of the data and empowering them to troubleshoot issues when they arise. 

For example, consider product analysts at an e-commerce company working with an order_fulfillment table modeled with Slowly Changing Dimensions Type 2:

  • If analysts assume the table contains data only on shipped orders, they may report data issues when they find null values in the shipping_date field. However, such an occurrence could instead represent a valid record for a place but not shipped order
  • If analysts assume the table contains only one record per order (perhaps updated with Slowly Changing Dimensions Type 1), they may report issues for duplicate order_id’s in the data while order_id alone is not actually a primary key

These occurrences can be equally frustrating to data users and data teams. Data users feel their time is wasted “cleaning the data” as they must guess-and-check assumptions. Then, as assumptions fail or others slip by unchallenged, incorrect analysis may be blamed on bad data. Similarly, data teams can get bogged down fielding these erroneous reports and lose sight of actual data quality issues and opportunities. 

So, by empowering data users, documentation actually helps data teams build capacity. And that’s not the only way documentation is an asset to data teams.

Documentation is a data team’s project manager

Users aren’t the only data constituent that may need help aligning on data’s intent. As data teams and organizations evolve over time, different understandings of data’s intended use can arise between developers across time, within a team, or across multiple data teams. 

Documentation serves as the original data contract by establishing a precise set of expectations of what data means, how it should behave, and what different constituents “owe” to one another should that intent evolve. Seen in this light, documentation plays the role of an effective project and product manager: forcing clarity and consensus on what has been and should be built. 

The alignment provided by good documentation can enable data teams in multiple ways:

  • Creating confidence in upstream data dependencies with a clear understanding of their intent
  • Reducing redundant exercises in data modeling, naming, etc. as documentation has helped drive an organization toward shared truths
  • Catching internal inconsistencies before they arise in a single team’s data work

For example, the meaning of fields can accidentally evolve over time as different developers interpret ambiguous human-language terms in different ways. It’s become a cliche among data practitioners that organizations typically have many definitions of what constitutes a “customer” (e.g. An account? A subscription? A household? A person?), but this is a symptom of a lack of consensus-forming documentation. 

In addition to avoiding errors, the way that documentation codifies intent can also accelerate data teams by making it easier to understand the full landscape of a team’s data products. This visibility can also support data leaders in delivering an effective data strategy.

Documentation is a data leader’s chief of staff

While data leaders may play a wide variety of roles, a survey from MIT’s 2022 Chief Data Officer Symposium highlighted the most common responsibilities as establishing clear and effective data governance, improving data quality, and building and maintaining data capabilities. 

Investments in data documentation help deliver on each of these objectives by surfacing and cascading information to help leaders evaluate and nurture their data organization to maturity. As examples of the contributions data documentation can make:

  • Effective data governance: Combining the power of column-level lineage and field-level tagging of sensitive data can improve data security. Properly secured fields can inadvertently be leaked into the result of downstream ETL jobs, but understanding the usage of sensitive data throughout the ecosystem enables the propagation of PII tags or other safeguards.
  • Improve data quality: Documenting fields establishes what should be true about data at the table and field levels. Such clear, unambiguous assertions are a necessary first step in any data quality agenda. Leaders can use this information with their teams to design tests, measure test coverage, and set goals to maintain acceptable data quality thresholds.
  • Building and maintaining data capabilities: Many data leaders often face an epidemic of growing tables with intermingled purposes, fields, and definitions. Redundant data products confuse users, waste valuable team time on maintenance, and reduce trust in the overall data initiative. Organizing documentation in searchable data catalogs help leaders plan with their data teams more strategically by identifying similar information in different siloed parts of the data ecosystem. This visibility can inspire consolidation of duplicative data sources for improved maintenance or fuel new data products that synthesize related but distinct use cases.

How to improve your documentation in 2024

If you’re convinced that data documentation can massively enhance your teams’ leverage in 2024, you can kick off the new year with many small, achievable goals.

Survey coverage

First, determine the current state of your teams’ documentation. List out all of your teams’ data assets and track how many of them have any documentation currently.

This step may sound simple. However, you may want to push to take a broad definition of what data products require documentation. For example:

  • Usage statistics can highlight the most popular tables to prioritize for documentation or surprisingly underutilized assets that documentation can better promote.
  • dbt staging models may not be intended for public consumption, but to encode a data teams’ institutional knowledge, they should be documented (a process that can be aided with automation); additionally, if these tables exist in a public schema, they may have more users than you think!
  • Metric documentation of downstream reports and dashboards is critical to ensure your stakeholders have the context to interpret results at the right level of specificity.

Add substance & style

Beyond clear gaps in documentation, evaluate whether existing documentation is well-designed to anticipate and address the likely questions of its readers. Documentation that stops at a simple column-level data dictionary is often insufficient to help users fully anticipate and appropriately use data assets. Some additional topics that thorough documentation might address are:

  • Data sources and business entities: What real-world processes are captured as new records in this dataset? Returning to our e-commerce shipping example, documentation could clarify if we are fundamentally modeling “orders” or “order status updates”
  • Refresh cadence: How often is data loaded from the source and how fresh is that data? Is there any reason some events may be slower to load than others due to sourcing via different systems or processes?
  • Field encoding: For any fields encoded with codes or abbreviations, what does each code mean? Sometimes this is documented directly in the field, but it can be even better to refer users to a relevant lookup table that they can join onto this field for automatic decoding
  • Null value handling: How are missing values addressed? Are fields null or are sentinel values used? Is there a way to differentiate different types of missingness such as fields that are missing because they’re unknown (e.g. no shipping date for a shipment yet to be made) versus missing by irrelevance (e.g. no shipping date for an e-book)

Select Star’s automated documentation tools can help jumpstart the process of writing such accurate and detailed documentation by learning and drawing context from across your warehouse. This can allow data experts to focus their time and energy into documenting the more unique and esoteric tribal knowledge that cannot be automatically inferred.

Improve accessibility

Of course, all of this work is in vain if this documentation cannot be found! Centralizing documentation onto a single, standardized platform and enabling a robust ability to search is critical to realizing the value of documentation.

A minimum viable way to improve accessibility might consist of a basic solution like Google Sheets in a shared file. However, this approach is severely limited and only allows users to search by knowing the exact table name. A robust solution should better meet users where they are by enabling search results based on table names, field names, and human-language field definitions. 

Now is the time to improve your docs

If you want happy users, you need data documentation; if you want effective developers, you need data documentation; and if you want a clear-eyed data strategy, you need data documentation. It’s hard to imagine a more important priority for 2024 than documentation.

Here's good news: you don’t have to tackle documentation alone! Select Star offers best-in-class automated doc tools to streamline your doc efforts. Book a demo to see how.

Related Posts

Data Governance: Key takeaways from the Gartner Data & Analytics Summit 2024
Learn More
Operationalizing Data Quality with Active Metadata
Learn More
Future of Data Platforms with Generative AI
Learn More
Data Lineage
Data Lineage
Data Quality
Data Quality
Data Documentation
Data Documentation
Data Engineering
Data Engineering
Data Catalog
Data Catalog
Data Science
Data Science
Data Analytics
Data Analytics
Data Mesh
Data Mesh
Company News
Company News
Case Study
Case Study
Technology Architecture
Technology Architecture
Data Governance
Data Governance
Data Discovery
Data Discovery
Business
Business
Data Lineage
Data Lineage
Data Quality
Data Quality
Data Documentation
Data Documentation
Data Engineering
Data Engineering
Data Catalog
Data Catalog
Data Science
Data Science
Data Analytics
Data Analytics
Data Mesh
Data Mesh
Company News
Company News
Case Study
Case Study
Technology Architecture
Technology Architecture
Data Governance
Data Governance
Data Discovery
Data Discovery
Business
Business

Unlock the full context of your data

Get Started
Ring