FAIR Data

FAIR data are human and machine readable, structured datasets which can be found in trusted repositories and reused under well-defined conditions. Such data should be understandable not only to a small group of experts but also to a broader scientific community and, at least to some extent, general public.

FAIR principles were originally described in the article published in Scientific Data journal in 2016. Let’s explain first the four letters in the abbreviation.

Findable

As the term defines, such data must be accompanied with machine-readable metadata to guarantee their finding by using search engines or catalogues (e.g. Google search or EOSC Portal), and have a unique and persistent identifier (PID) for the future reference.

Accessible

It has to be obvious how a user can access your data and under what conditions. This is usually defined by a license. Some websites state that metadata are always accessible but this is not true for all records in trusted research environments with confidential and sensitive data.

Interoperable

Data are stored using open and standard formats and descriptions (metadata) follow well-defined and known standards. If possible, data should be combinable and exchangeable with other data to enable large, machine-driven studies of available data. The use of controlled vocabularies and community standards helps.

Reusable

Data are accompanied with a rich documentation to support data interpretation and reuse by a broad research community and, potentially, citizens. Whenever possible, use globally-accepted standard formats (more here) and general language – e.g. TIFF or PNG for images and CSV or ODS for spreadsheets. This should help to preserve validity of the data also even long term – see here.

Persistent Identifiers

To unequivocally identify datasets in a repository or any digital entities and resources, we use persistent identifiers (PIDs). They reliably point to a digital entity; e.g. ORCID iD points to a digital resource about a person-researcher. As the name states, these must be persistent – not changing with time. There are authorities responsible to the control over assigning PIDs to digital resources, e.g. Crossref and DataCite for scholarly communication. Research organisations PIDs are assigned in ROR registry. PIDs exist also for a more specific entities, such as chemical compounds (e.g. InChI of IUPAC). Use these globally-recognised PIDs to clearly identify your metadata. A repository must provide PID (e.g. DOI of DataCite) for your published dataset.

Licenses

One needs a basic driving license to drive a car. But a car, not a truck. Similarly, a license of a dataset defines whether the data is freely available, or whether there are certain limitations associated with its use. In case of confidential or sensitive data (see below), the data may not be available to general visitors of the repository. A permit for data access from a defined authority can be required. Licenses known from journal websites or web-archives for documents (e.g., Creative Commons) are applied to datasets as well. If you are not sure which license to use for your dataset or algorithm/code, try using advisory tools such as License Selector from CLARIN.

Confidential and Sensitive Data