You're closer to well-documented data than you think

Data documentation is an essential component of delivering useful data products. Yet despite its benefits, teams often fall into documentation debt due to two misconceptions:

Data documentation is a heavy lift to produce
Data documentation is a liability to maintain because things change and it goes stale

But here’s some good news for teams that want to uplevel their documentation but don’t know how to fit it into their busy roadmaps: documentation isn’t a separate task from what you’ve already been doing, so you’re closer to well-documented data than you think!

Focus on building latent and lasting documentation

Analytics engineers have long drawn inspiration from software engineering best practices into their workflows. Once again for documentation, we can look to engineering for inspiration.

Specifically, different engineering communities have explored the idea of how to recycle resources created in the development of a project instead of seeing documentation as a separate task. Specifically, we can find many examples of latently-generated, long-lasting documentation such as:

Automated documentation from existing project assets like Sphinx and Mkdocs for python and pkgdown for R
Passive documentation (as described in Silone Bonewald’s O’Reilly Media book Understanding the Innersource Commons) reflects on how well organized stories and written conversations between project members and users serve as useful no-cost documentation
Immutable documentation (as explored on Etsy’s Code as Craft blog) provides a framework for separating out “how-docs” and “why-docs” to better guarantee the ongoing relevance of the latter and need to upkeep only the former

Reuse makes sense both for initial documentation and for upkeep. The more closely the actual project function “depends on” documentation elements, the more likely it is that these elements will be appropriately updated as anything inside the project changes.

What documentation do you already have?

Due to the inherently cross-functional nature of data work, analytics engineering is rife with the kind of resources that can serve as the basis for robust documentation. Planning documentation, code itself, and subsequent user behavior can all be potent source material. This creates a virtuous cycle where the more that teams focus on doing core data work to the best of their abilities, the better the documentation they can get for free.

So where specifically do these documentation assets arise? Let’s consider each stage of planning, development, and user behavior one at a time.

Planning is documentation

Approaching project planning artifacts from the mindset of end-user documentation not only gives your data team a leg up on documentation but also helps you build better data products.

Ultimately, users seek documentation to clarify the intent behind a table’s entities and fields and for deeper intuition on how the table works. However, too often team’s fail to align on this intent with their stakeholders and even amongst themselves in the planning phases of projects, resulting in failure or rework.

Tactically, we can undertake proactive project planning and product discovery while frontloading the creation of data documentation by:

Using a data dictionary as your product requirements doc: Writing out table requirements in a data dictionary in partnership with data users can help ensure that the resulting product will fully meet their needs. This is an opportunity to consider the intent of each table and its required performance capabilities, all of the fields that should be included, how fields can be strategically named to best communicate their definitions, things that must always be true about these columns (which can be converted to data quality tests), and potentially even information on the column-level lineage of from where each field should be sourced. These requirements can serve as a contract throughout data development and ultimately form the basis of the end-state data dictionary.
Mapping out your entity-relationship diagram to beta test the data user experience: Once you understand your user requirements, it’s important to think about how the pieces of your solution will come together into an intuitive whole. This is the core of data modeling and can be visualized with an entity-relationship diagram which visualizes how the concepts represented by the different tables relate. ERDs are useful in planning for engineers since they have implications for data quality checks like primary/foreign keys and uniqueness, and they are also useful for end-users planning their queries.
Defining higher-order concepts than just field names: As you understand your fields and entities, you will likely unearth higher-order concepts which may not specifically appear in your data project and yet it’s critical for your team to have the same core understanding (e.g. understanding what constitutes a “new” annual subscription versus a “renewal” is critical to getting the primary ID correct in a subscription table). These can be defined in a glossary.

Development is documentation

While planning docs can be highly effective documentation, they require some proactive effort to structure and good processes to ensure that they are updated along with any changes to project plans. Even better is when we can use development artifacts themselves for truly latent documentation. A few potent examples are:

dbt field definitions: Many analytics engineers already meticulously document data fields in their projects’ YAML files. However, this documentation is often not easily accessible to end-users if it is confined to a projects’ GitHub repo or the autogenerated dbt docs site with limited searchability. Of course, this information should ideally be the same as what was already written in your planning docs!
Orchestration DAG diagrams: Any modern tool with orchestration capabilities (e.g. dbt, Airflow, Prefect, Dagster) is able to visualize its dependency graph as a diagram. This creates a pragmatic complement to ER diagrams; users can use ER diagrams to plan how they want to query the data, whereas they can use DAGs to better understand what the data contains or what challenges they might encounter (e.g. e-commerce orders sourced from a fulfillment “orders shipped” table will give a lagged perspective on sales versus a financial “orders placed” table)
Tags for each field’s tests: Automatically surfacing information on what data tests are applied to each column (ideally along with results) can be a useful way for users to ground their expectations and understand aspects of uniqueness, null handling, and allowed values that can help them plan savier and more resilient queries.

And that’s just the tip of the iceberg! The opportunities to harvest development assets directly from your project vary greatly by the specific tool stack you chose. For example, speaking of icebergs, if your data processing is driven through a lakehouse with an open table format like Apache Iceberg, tipping users what fields may have hidden partitions can help them write more efficient queries.

User behavior is documentation

Ironically, the more popular a data product grows, its data team is no longer the singular nor best source of usage documentation! Users build their own dependencies, come to their own realizations of what they can find in the data, and develop a set of their own derivative queries, tables, dashboards, reporting, and analyses that depend on the data. It’s more likely that not that other users may learn as much or more from these examples as from the raw data documentation itself (assuming that they are appropriately familiar with the base tables.)

Capturing usage statistics such as a table’s top users, most used fields, and most commonly joined tables is the epitome of latent documentation. Without any type of customer success engineer mocking up projects or writing long tutorials, these statistics (sometimes along with example queries) can breathe life into data documentation by beginning to relate it back to users’ specific use cases.

Select Star makes it easy to identify and communicate with table users both to share updates about data assets and to solicit more scripts and examples of user behavior.

Unread documentation isn’t documentation

Planning docs, code and comments, and user behavior can all be great seeds to a comprehensive documentation solution. So, what doesn’t count as documentation? Documentation that users cannot find to read.

That means you have probably already done the hard part of structuring useful resources to understand your data. What your data team may feel as a lack of documentation is generally a lack of organization of those resources on a centralized and searchable platform.

Need a doc tool? Select Star offers best-in-class automated doc tools to streamline your doc efforts. Book a demo to see how.

Data Governance: Key takeaways from the Gartner Data & Analytics Summit 2024

Learn More

Operationalizing Data Quality with Active Metadata

Learn More

Future of Data Platforms with Generative AI

Learn More

Why You’re Closer to Data Documentation Than You Think