16.4 Social Impact

To make predictions on data, the provenance of the data or data lineage – where the data came from and how it was manipulated – can make a difference between a good prediction and nonsense. Provenance is typically recorded as metadata – data about the data – including:

  • Who collected each piece of data? What are their credentials?

  • Who transcribed the information?

  • What was the protocol used to collect the data? Was the data chosen at random or chosen because it was interesting or some other reason?

  • What were the controls? What was manipulated, when?

  • What sensors were used? What is their reliability and operating range?

  • What processing has been done to the data?

Such metadata is needed for environmental, geospatial, and social data – data about the Earth – that is collected by people and used for environmental decision making [Gil et al., 2019].

This is particularly important if the data should be FAIR [Wilkinson et al., 2016]:

  • Findable – the (meta)data uses unique persistent identifiers, such as IRIs.

  • Accessible – the data is available using free and open protocols, and the metadata is accessible even when the data is not.

  • Interoperable – the vocabulary is defined using formal knowledge representation languages (ontologies).

  • Reusable – the data uses rich metadata, including provenance, and an appropriate open license, so that the community can use the data.

Data repositories based on these principles are available for many areas including Earth observations [NASA, 2022], social sciences [King, 2007], computational workflows [Goble et al., 2020], and all research domains [Springer Nature, 2022]. FAIR data is an important part of modern data-driven science, however, some researchers that have commercial or military reasons to think of themselves as being in competition with one another may have an incentive to not follow FAIR guidelines.

Stodden et al. [2016], Gil et al. [2017], and Sikos et al. [2021] overview ways to enhance reproducibility in data science. Gebru et al. [2021] propose 57 questions about the content of a dataset and the workflow used to produce it.