Nature data documentation: 3 key challenges
Introducing principles for nature data documentation.
Welcome to the Nature Data Newsletter. Each month, we share insights with data enthusiasts, GIS experts, and investors learning about nature data.
In our last newsletter, we explored challenges caused by the increased scale and complexity of nature datasets, and discussed spatial indexes. In this newsletter, we address one of the most common difficulties faced by individuals working with nature data: finding, accessing, and understanding documentation.
3 challenges faced when using nature data documentation
Nature data documentation provides essential information about datasets, enabling users to understand and effectively utilise the data. It explains what a dataset quantifies, defines each variable, guides data access, and supports analysis design. It highlights methodological or technical limitations, and indicates situations, or locations where the data may be more or less reliable.
Documentation is crucial at all stages of a nature data workflow because:
Nature data is generated from a combination of technical inputs, including existing datasets, new measurements, and statistical models.
Each input carries its own biases or limitations, and data providers vary in how they handle these elements.
Data providers deliver data in a range of formats and parameters, often through custom user interfaces or data integrations.
Our research, which involved over 30 professionals working with nature data, revealed three common documentation challenges:
Incomplete or outdated: Important details are omitted, such as the regions covered by training data or basic criteria like spatial resolution. Documentation frequently lags behind live products and lacks version control for new features and changes to existing ones.
Fragmented: Users often need to switch between several documents simultaneously. Documentation can also be hidden in the bowels of websites or locked behind paywalls or email sign-ups.
Inconsistent or ambiguous: Users often need to grapple with technical language, dense prose, or inconsistent formatting. Naming conventions can be ambiguous, either because one term is used to represent multiple things or because names differ across API references, scientific documentation, and company websites.
These challenges arise because nature data products bridge different disciplines across environmental science, spatial data analysis, and software engineering. This complexity creates unique expectations for documentation.
As a result, individuals spend weeks trawling through websites, documents, and academic papers, as well as engaging directly with providers – all to get the information they need to start an analysis workflow.
Principles for nature data documentation
At Cecil, we've encountered these same challenges firsthand. We believe two key principles can help overcome them and create effective nature data documentation: completeness, and accessibility
Completeness
The cornerstone of good documentation is completeness. Users need to understand how a dataset was created, its limitations, and how to access it. Without this information, datasets are prone to misuse - such as applying them to regions where their accuracy is low.
Best in class examples of complete documentation include scientific papers, where every scientific paper includes a methods section that provides enough detail about data collection, processing, and validation to repeat the study, and software documentation, where information is up to date, consistently named, and well-organised with detailed examples throughout.
We've outlined the minimum content that we expect in nature data documentation:
Data collection: the technologies and methods used to create input data
If using existing data, include source, version, and date accessed
If collecting new data, include devices (e.g. manufacturer, model), methods (e.g. ISO standard), and sampling details (e.g. location, plot sizes, dates)
Data processing: the steps taken to generate output data from input data
How the input data is pre-treated
The calculations or statistical models applied, and in what tools (e.g. linear regression)
Dataset validation: the benchmarking done to evaluate dataset accuracy
Include source, locations, and details of ground truth data, steps taken, and graphs and statistical test outputs showing performance
State dataset recommendations and limitations (e.g. Impact Observatory's Biodiversity Intactness Index is unsuitable for point-based data extraction)
Dataset structure: how the data is organised
Include format, variable types, and metadata
Apply version control to datasets and documentation, including version numbers and change logs
For metadata, adhere to domain standards (e.g. the FGDC Geospatial Metadata Standards)
Provide key nature data criteria, which we introduced in Newsletter 2: selecting nature data sources
Dataset access: how users access the dataset
Best practice is a step-by-step guide, including code snippets (e.g. Stripe)
Usage rights: the protections that apply to the data
State the standardised (e.g. CC-BY-4.0) or bespoke licence set out in a commercial agreement
State additional terms of use (e.g. WWF Living Planet Index Data Use Policy)
Accessibility
Teams using nature data come from diverse backgrounds, including academic research, data system design, and financial services. Given this variety of expertise, it's crucial to balance completeness with accessibility. We believe this hinges on two key factors: the clarity of the documentation and sufficient context for interpretation.
Despite the technical complexity involved in creating such data, the documentation can be simplified by using clear and concise language and avoiding unnecessary jargon. Long paragraphs should be minimised and, where appropriate, text can be replaced with images or diagrams. Key details about the dataset, such as usage notes or spatial resolution, should be prominently displayed rather than embedded within dense blocks of text, ensuring that users can easily locate essential information and make decisions effectively.
It's not always possible to provide all necessary context in main documentation. Technical users may need more detail, such as equations for converting raw LiDAR waveforms into height values. Non-technical users may find such detail challenging, and instead benefit from additional learning materials to provide further background context – like our Plant Biomass concept note.
We suggest a set of simple best practices to make documentation more accessible:
Limit jargon to essential words only, with acronyms spelt out on first use
Use short paragraphs or bullets, replace blocks of text with tables or figures
Highlight or pull-out essential details, like key criteria, variable descriptions, and usage notes
Provide step-by-step guides, code snippets, and examples
Follow a standard and conventional template (see Completeness, above)
Provide URLs to deeper source documentation for technical users
Provide learning resources for less technical users (e.g. scientific concept notes)
Ensure the documentation is easily findable throughout the website and visible on the dataset page
Avoid any stage-gates impeding access to documentation (e.g. requiring email sign-ups)
Notices
Cecil is bringing London's nature data community together to geek out over dumplings and drinks on Wednesday 18th September. RSVP here.
Cecil + The Landbanking Group are hosting a breakfast during New York Climate Week on Thursday 26th September. RSVP here.
Cecil + Earthmover are bringing New York’s nature data community together to geek out over dumplings and drinks on Thursday 26th September. RSVP here.
Chloris Geospatial is hosting a breakfast during New York Climate Week to share how new VCM standards, project developers and buyers of carbon credits are using new monitoring technologies. RSVP here.
Kanop is running a pickleball, bagels and networking event at New York Climate Week. RSVP here.
Thank yous
A special thank you to everyone who supported us this month with feedback, introductions, and advice.
Adam Weiner, Alex Burns, Coby Strell, Danielle Rappaport, Elaine Mitchell, Gregg Treinish, Helen Crowley, Mac Bryla, Matthias Mohr, Nate Trappe, Noah Golmant, Peter Levine, Roberto Miethe, Romain Fau, Syakira Syafiqah, Sylvain Vaquer, and Wade Cooper.
Keep reading
Sponsorship
Do you want to share monthly updates with our engaged audience of 1700+ nature data users and investors? Please contact rory@cecil.earth to find out more about our newsletter sponsorship packages.
This newsletter is curated by Cecil, a team on a mission to make nature data accessible. Their platform helps data teams access analysis-ready commercial and public nature datasets, eliminating the need for cleaning, harmonising, and pre-processing tasks. Whilst currently focused on aboveground biomass, they will soon launch land cover and land use datasets to support market-leading nature-tech applications, consultants, and nature restoration professionals.
Enjoy the newsletter? Please forward to a friend.