Welcome to the Nature Data Newsletter. Here, we share insights with nature data teams, spatial data scientists, and enthusiasts learning about nature data.
A number without context has little meaning at all. Consider a temperature value of 23.5 – is it today’s weather or the mean temperature of the past century? Measured by hand or satellite sensor? In celsius or fahrenheit? Without these details, it’s impossible to know what this number means and how to use it in a data analysis workflow.
These details are called metadata. Metadata is the information that provides context to numbers and datasets, telling users what they are, how they were generated, and how to interpret them. Metadata is critical to all domains that use data, including environmental and geospatial data science. And metadata is becoming even more essential as AI begins to leverage nature data at unprecedented scale and without human supervision.
At Cecil, we think about metadata every day and continue to explore the best ways to represent it in our platform. This newsletter shares some of the key things we’ve learnt along the way. We first examine different types of metadata, then introduce some of the most relevant metadata standards and frameworks for nature data. Finally, we touch on why metadata is becoming even more important as AI transforms how nature data is used.
Types of metadata
Nature data is multidimensional, typically consisting of a range of measurements (variables) taken at multiple locations over time (observations). Metadata can exist on all dimensions of this data.
Consider a simple case where a dataset is delivered in a table, with observations as rows and variables as columns. Metadata can apply to a single cell in that table, a single row or column, groups of rows or columns, or the entire table:
Datapoints
Datapoint level metadata describes properties of specific observations for specific variables (i.e. individual table cells). Examples include flagging outliers, sensor failures, or other problems isolated to individual datapoints. Cell-wise metadata is usually stored in a separate table with the same row-column structure as the main data, which is then indexed to filter datapoints as required.
Observations
Observation level metadata describes properties of observations that apply to all variables (i.e. row-wise). Examples include latitude/longitude locations, sampling dates, or quality flags that apply to groups of pixels from the same raster tile (e.g. cloud cover). Increasingly, row-wise metadata is stored as additional variables in the main data table (i.e. denormalised data), which maximises the efficiency of data operations.
Variables
Variable level metadata describes properties of variables that apply to all observations (i.e. column-wise). Examples include measurement units and limits (e.g. maximum 100%), or identifiers for variables from the same measurement platform (e.g. Sentinel-2 bands). Column-wise metadata is usually stored in a lookup table, where each row represents a variable in the main data table and columns contain metadata about each variable.
Variable level metadata can also allow for grouping of variables to simplify interpretation. One example is the IUCN’s Global Ecosystem Typology dataset, where the presence/absence of 109 ecosystem functional groups can be simplified to 25 biomes or 10 realms using a metadata lookup table:
Datasets
Dataset level metadata describes properties of the entire dataset (i.e. table-wise). Common examples include dataset name or version, as well as how missing data is encoded. Table-wise metadata can be stored in a variety of different ways, including dataset or file level tags, changelogs, or product documentation.
Common metadata standards
It is challenging enough to organise metadata for one dataset, let alone multiple datasets that draw on different measurement technologies and bridge geospatial and environmental science domains. In fact, in our experience no two nature datasets deliver metadata the same way.
Thankfully, recent efforts to align on metadata standards and frameworks have begun to enable greater data sharing, interoperability, and automation across the nature sector. While new resources are still emerging, here are a few of the common standards and frameworks that are used for nature data today:
Geospatial data:
ISO 19115 – the global metadata standard for geospatial datasets
SpatioTemporal Asset Catalogs (STAC) – a metadata specification for indexing geospatial data
CF Conventions – a metadata standard for climate model data
Biological data:
Darwin Core – standard identifiers, labels, and definitions for biodiversity data
Ecological Metadata Language (EML) – a standard vocabulary and syntax for ecological research data
MIxS – a metadata specification for genomic samples (e.g. eDNA)
General principles:
FAIR principles – guidelines to ensure data is findable, accessible, interoperable, and reusable
CARE principles – a framework to ensure Indigenous data sovereignty
DataCite – a framework for citing and tracking research datasets and assets
Metadata and AI
Organising metadata effectively is already important today. However, as AI becomes more prevalent in nature data applications, metadata plays an increasingly critical role in ensuring that these systems work as intended – i.e. accurately and ethically in absence of human supervision. Some examples include:
Training data – metadata is needed to constrain predictions based on what the model was trained on (e.g. satellite-based models need metadata about sensor calibration, cloud masking, etc.)
Data quality – metadata is needed to improve predictions by flagging outliers and poor quality values (e.g. weighting gap-filled observations differently)
Automated data discovery – metadata allows AI systems to automatically find and use relevant datasets (e.g. STAC allows AI to automate discovery of satellite data for specific regions / dates)
At Cecil, we build data infrastructure to make nature data accessible and interoperable. To this end, we have the privilege of supporting the whole sector in its efforts to build a standard and mature approach to metadata. We are excited to be working alongside our partners and peers on this, and look forward to the opportunity to share more soon.
Recent updates at Cecil:
The GLAD laboratory Global Forest Change dataset (from Hansen et al. 2013) is now available on Cecil
The IUCN Global Ecosystem Typology dataset (from Keith et al. 2022) is now available on Cecil
Join our Slack Community to learn from other teams tackling nature data
Get in touch if you’d like to learn more about the Cecil platform, or the work we’re doing to prepare the nature data industry for scale.
My latest episode with Cecil covers what they’re doing about metadata for nature data.
✒️ https://www.geospatial.fm/p/cecil
🎙️ http://open.spotify.com/episode/1sEKDJTzwDSLrbEuiwNV0t
🎥 https://youtu.be/DcAur7HFbcE