Difference between revisions of "Paleoclimate Data Standards"

From Linked Earth Wiki
Jump to: navigation, search
(Initial version, based on notes sent by Bronwen Konecky on July 27, 2016)
 
(added next steps)
 
(86 intermediate revisions by 4 users not shown)
Line 1: Line 1:
A key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data and metadata.
+
= Background =
 +
== What is a standard? ==
 +
[http://www.earthcube.org/document/2015/ecstandardsrecs EarthCube defines] a standard as follows:
  
This page aims to summarize initial discussions from the [http://linked.earth/event/paleoclimate-ontology-workshop/ June 22-23 PDS workshop] on the subject of data and metadata standards. Notes came chiefly from Bronwen Konecky and Wendy Gross.
+
a public specification documenting some practice or technology that is adopted and used by a community. [..] There is a continuum starting with any documented practice in some community. If lots of people use a particular documented practice it could be adopted as a best practice. If almost everyone uses some documented practice, then it is a de facto standard.
  
== The Big Idea ==
+
Notice the emphasis on community and on practice. If only person uses a technical specification, it's not a standard. If it's voted on but not applied in practice, it's worthless as well. Thus, the objective of this EarthCube activity is to propose a standard with broad community appeal and adoption.
  
While more (meta)data seems universally better, workshop participants identified the necessity to '''distinguish a set of essential, recommended and desired properties''' for each dataset. A consensus emerged that the definition of these levels should be archive-specific, as what is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. It was decided that archive-centric [[Working Groups]] would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology.
+
== Why do we need standards? ==
 +
Modern life would simply be unlivable without standards. Imagine having to use a separate browser for each web page your visit, or a separate power-transmission system for every appliance you use! You only have to travel to a country that uses a different electric plug than the one your computer and phone employ to appreciate what a nightmare that would be. In science, the ultimate objective of a standard is to make data understandable by others (including machines), and the derived analyses reproducible. Thus, a key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data. Indeed, despite some ad-hoc gatherings among communities of interest over many years, until recently there had never been a concerted effort to produce a standard applicable to all paleoclimate observations. Given the increased importance of synthesis work (e.g. PAGES2k, Shakun et al 2012, Marcott et al 2013, MARGO, others), it is increasingly important that a common solution be found.
  
== Essential metadata ==
+
= Prior Work =
 +
== Pre-2016 ==
 +
Here is a non-exhaustive list of past discussions that we are aware of:
  
The conversation was guided by the following questions:
+
*  [[media: Reporting_Standards_for_Paleoceanographic_PMIP3_Dec2013.docx  | PMIP3 Workshop in Corvallis]] (December 2013)
 +
* [http://www.clivar.org/sites/default/files/documents/TriesteReport_nov5.pdf report for the workshop "REPRESENTING AND REDUCING UNCERTAINTIES IN HIGH-RESOLUTION PROXY CLIMATE DATA"] (Trieste, Italy, June 2008)
  
# What does “essential” mean (vs. highly desirable)?
+
We welcome any summaries of prior discussions on data standards you may have had at meetings, workshops. To do so, either link to a document that you have uploaded on the wiki or create a new page and list it here.
# Who is the target audience/who are the end-users?
+
# What are the scientific end-goals?   
+
  
For #1, we determined that:
+
== LiPD ==
* "essential" = data cannot be uploaded without it (the dataset would be utterly useless without any of this information missing)
+
The [[Linked_Paleo_Data|LiPD]] format embodies one part of this solution: it offers a container that can wrap tightly around a wide varieties of paleoclimate [[:Category:Dataset (L)|datasets]],  providing a vessel for paleoclimate content. Other formats, of course, would be acceptable; however, there is no viable alternative currently in existence, which is why [[Nicholas_McKay | Nick]] and [[J._Emile-Geay | Julien]]  had to go through the trouble of inventing such a format. Another reason to adopt it is that '''there is a growing code ecosystem being developed around LiPD''' in Matlab, R and Python: the [https://github.com/nickmckay/LiPD-utilities LiPD utilities] that allow cross-walk between all kinds of commonly-used formats, [http://nickmckay.github.io/GeoChronR/ GeoChronR] for the analysis of time-uncertain paleo data, and [https://github.com/LinkedEarth/Pyleoclim_util Pyleoclim] to visualize and analyze the data. 
* "Essential metadata" should mean something different for legacy datasets vs. new datasets
+
Why not just stick with LiPD and call it a day, you ask?  Well, LiPD's infinite flexibility is a double-edge sword: it can accommodate all manner of information, but that information may or may not align with community best practices. It is thus '''necessary for the community to decide on such practices'''.  In other words, if LiPD provides a field-tested answer to the question: ''how should paleoclimate data be stored?'', it says nothing about '''''what''''' should be stored: that decision is up to the community.
* "Essential" may vary by archive type
+
  
Overall, it was decided that separate "essential metadata" criteria needed to be applied to existing datasets vs. new datasets. In other words, the paleoclimate community should adopt stricter standards for what is essential in upcoming datasets, rather than being limited by the realities of old datasets.
+
== First workshop on paleoclimate data standards ==
 +
The [[PDS_workshop_2016 | 2016 workshop on paleoclimate data standards]] (PDS workshop, for short) served as a stepping stone to initiate a broader process of community engagement and feedback elicitation, with the goal of generating such a community-vetted standard. The workshop identified a need to '''delineate a set of essential, recommended and desired properties for each dataset'''.
 +
By default, any and all information is desired. A subset of that should be recommended to ensure optimal re-use. Yet a smaller subset of that is '''essential''' in the sense that a paleoclimate data set should not be acceptable without this information (for more details, see the [[PDS_workshop_2016 | PDS workshop page]]). Four additional themes emerged:
 +
 +
=== Cross-Archive Standards ===
 +
Some essential data/metadata are shared among all conceivable archive types:
  
For existing datasets, we deemed the following essential:
+
* '''A table with at least two columns''', one representing time, the other a climate indicator of some sort
 +
* '''Geolocation''': coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info
 +
* '''Source''': PI, contributor, or database (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
 +
* '''Names''' and '''Units''' of the variables in the dataset
  
- location information (coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info)
+
=== Archive-specific standards ===
- PI, contributor, or source (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
+
What is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. Therefore, these these levels are archive-specific.
- Units of variables in dataset
+
  
== Required Metadata ==
+
=== Legacy vs Modern datasets ===
Moving forward, we felt the paleodata community should adopt the following guidelines for what is 'essential':
+
The group also recognized that standards need to be more stringent for modern datasets than for legacy datasets, for which some (meta)data are sometimes impossible to procure (think: raw radiocarbon dates from a PI now deceased). Thus, for every archive and across archives, there needs to be different set of standards for both kinds.
 +
What constitutes "legacy" data is also open to interpretation, and requires a formal definition (and a vote).
  
* All the above criteria, plus:
+
=== Audience and Purpose ===
* measured material
+
Finally, it was recognized that all of the above considerations are a function of who is using the data, and for what purpose. As such, it is useful to consider a few science drivers.
* Archive type 
+
* uncertainty on measured variables (we did not get a chance to hash out what essential 'uncertainty' metadata would be)
+
* Depth AND age
+
* age control points and other relevant age model info
+
  
== Desired metadata ==
+
= Science Drivers =
 +
One possible way to elaborate a standard is to ask oneself: "What pieces of metadata would I require to reproduce this particular dataset and therefore which one should I provide for my own datasets?" Some of these metadata will be standard across all archives (i.e., geographic coordinates, publication information). Some will be specific to each archive or observation type. See [[:Category:Marine_Sediment_Working_Group | here]] for an example of such discussions on foraminiferal Mg/Ca.
  
* age-uncertain ensembles (realizations of the timeseries X(t) for different age model paths)
+
== Querying the datasets ==
* calibration ensembles (e.g. posterior draws from a Bayesian temperature calibration)
+
 
 +
Another way to think about the essential/recommended/optional metadata is to think about the kind of queries one would want to enable for their research. For instance, a recent study on [http://linked.earth/wp-content/uploads/2016/01/Khider_AGUFall16.pdf Testing the Millennial-Scale Solar-Climate Connection in the Indo-Pacific Warm Pool] required the following queries and associated metadata:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! Query || Required Metadata
 +
|-
 +
| SST-sensitive proxies (i.e. Mg/Ca, Uk37 and TEX86) || Needed a property describing the proxy observations as well as a standard to express the concept of Mg/Ca, TEX86 and UK37. This need gave rise to [[:Property:ProxyObservationType_(L)]] and terms in the ontology [[:Category:ProxyObservation_(L)]] to describe these concepts.
 +
|-
 +
| Holocene data (0-10ka) spanning at least 5kyr || Needed a property describing the concept of [[Age]], giving rise to [[:Property:InferredVariableType (L)]]. An ontology for [[:Category:InferredVariable_(L)]] is in progress. It also became obvious that we needed to standardize the way to represent time and the units of time. Furthermore, since the wiki doesn't read the content of the .csv file, some metadata about the values stored in the .csv file were also needed. We therefore created the following properties: [[:Property:HasMinValue (L)]], [[:Property:HasMaxValue (L)]], [[:Property:HasMeanValue (L)]].
 +
|-
 +
| In the IndoPacific Warm Pool || We needed to query the dataset by latitude/longitude.
 +
|}
 +
 
 +
Other types of basic query include searching for a particular publication, using either the DOI, title, journal, authors, and searching by [[:Category:ProxyArchive_ (L) | archiveType]]. The latter is currently used to obtain the maps under each archive category (for example, [[:Category:MarineSediment | marine sediments]]). Enter the queries you'd like to perform, and the metadata required to perform them.
 +
 
 +
== Analyzing the datasets ==
 +
 
 +
Finally, this problem can be approached from a data analysis point-of-view. "Given my interest, what are the required information I would need to perform my analysis?"
 +
 
 +
For instance, if all is known about a dataset are the geographical coordinates from which the archive was taken, then the only possible analysis is to create a location map of said archive (with or without its nearest neighbor contained in the database). On the other hand, if an age and paleo variable are contained within the PaleoDataTable, the resulting time series can be used for correlation analysis and spectral analysis. However, without the raw age data, it would be impossible to recalibrate the record using updated techniques, limiting its usefulness. To ask the most questions of our datasets, and do the best science, we need them to be as complete as possible.
 +
 
 +
For the Holocene study, the datasets had to contain the raw radiocarbon measurements for use with [https://cran.r-project.org/web/packages/Bchron/vignettes/Bchron.html Bchron].  [http://linked.earth/doing-science-with-linkedearth/ this blog post] describes how richer and standardized metadata  supported the study. This [https://github.com/khider/Holocene-Millennial-Scale-Variability/blob/master/AGU%20Fall%202016/Holocene_MillennialScaleVar.md Jupyter Notebook] integrates code, data and text to walk you through the story. None of this would have been possible without [[Linked_Paleo_Data|LiPD]].
 +
 
 +
= Process for achieving a paleoclimate data standard=
 +
== Working Groups ==
 +
Attendees of the [[PDS workshop 2016]] proposed that archive-centric [[:Category:Working_Group | working groups]] (WGs; self-assembled coalitions of knowledgeable experts) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. It is also critical to ensure interoperability between standards to enable longitudinal (multiproxy) investigations.
 +
[[:Category:Working_Group | Working groups]] have been formed and are now being consulted to generate the backbone of a standard, which has been presented to the community at the [http://www.pages-osm.org PAGES OSM meeting] in Zaragoza (May 9-13, 2017).
 +
 
 +
Current Working Groups:
 +
<DynamicPageList>
 +
category = Working Group
 +
shownamespace = false
 +
</DynamicPageList>
 +
 
 +
== International Partners ==
 +
This process contributes to the [http://pastglobalchanges.org/ini/int-act/data-stewardship data stewardship initiative] of our [http://www.pastglobalchanges.org/about/general-overview PAGES/Future Earth] partners. Therefore, we are working together with PAGES to reach out to the broadest cross-section of paleoscientists and invite them to contribute to the process. The end goal is a standard to be precisely documented and adopted by LinkedEarth and PAGES. The standard will be implemented in all LinkedEarth activities and proposed for adoption by [http://earthcube.org/ EarthCube], the [https://rd-alliance.org/ Research Data Alliance], the [http://www.esipfed.org/ Federation of Earth Science Information Partners], [https://www.ncdc.noaa.gov/data-access/paleoclimatology-data NOAA WDS-Paleo] and [https://www.pangaea.de/ Pangaea].
 +
 
 +
== Voting ==
 +
=== Wiki-based voting ===
 +
The first phase of input on data standards (up to October 2017), open only to LinkedEarth members, has leveraged the wiki's polling ability.  The results of such polls are summarized in [[PollingStats|Advanced Statistics]].
 +
=== Google Survey ===
 +
The second phase of input, open to all scientists who want to be associated with the Data Standards publication, is available [https://t.co/SUhcYojPJq here] as a Google Survey. It is active until November 10, 2017.
 +
 
 +
= Standard Publication =
 +
The first iteration of the PaleoClimate reporTing Standard (PaCTS v1.0) was published in ''Paleoceanography and Paleoclimatology'' in December 2019. The paper is open-access and available [https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2019PA003632 here]. The standard, at present, is best described as aspirational. A meeting took place at the December 2019 AGU meeting to discuss next steps. If you want to participate, contact [[D._Khider]]  or [[J._Emile-Geay]].

Latest revision as of 19:24, 13 February 2020

Background

What is a standard?

EarthCube defines a standard as follows:

a public specification documenting some practice or technology that is adopted and used by a community. [..] There is a continuum starting with any documented practice in some community. If lots of people use a particular documented practice it could be adopted as a best practice. If almost everyone uses some documented practice, then it is a de facto standard.

Notice the emphasis on community and on practice. If only person uses a technical specification, it's not a standard. If it's voted on but not applied in practice, it's worthless as well. Thus, the objective of this EarthCube activity is to propose a standard with broad community appeal and adoption.

Why do we need standards?

Modern life would simply be unlivable without standards. Imagine having to use a separate browser for each web page your visit, or a separate power-transmission system for every appliance you use! You only have to travel to a country that uses a different electric plug than the one your computer and phone employ to appreciate what a nightmare that would be. In science, the ultimate objective of a standard is to make data understandable by others (including machines), and the derived analyses reproducible. Thus, a key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data. Indeed, despite some ad-hoc gatherings among communities of interest over many years, until recently there had never been a concerted effort to produce a standard applicable to all paleoclimate observations. Given the increased importance of synthesis work (e.g. PAGES2k, Shakun et al 2012, Marcott et al 2013, MARGO, others), it is increasingly important that a common solution be found.

Prior Work

Pre-2016

Here is a non-exhaustive list of past discussions that we are aware of:

We welcome any summaries of prior discussions on data standards you may have had at meetings, workshops. To do so, either link to a document that you have uploaded on the wiki or create a new page and list it here.

LiPD

The LiPD format embodies one part of this solution: it offers a container that can wrap tightly around a wide varieties of paleoclimate datasets, providing a vessel for paleoclimate content. Other formats, of course, would be acceptable; however, there is no viable alternative currently in existence, which is why Nick and Julien had to go through the trouble of inventing such a format. Another reason to adopt it is that there is a growing code ecosystem being developed around LiPD in Matlab, R and Python: the LiPD utilities that allow cross-walk between all kinds of commonly-used formats, GeoChronR for the analysis of time-uncertain paleo data, and Pyleoclim to visualize and analyze the data. Why not just stick with LiPD and call it a day, you ask? Well, LiPD's infinite flexibility is a double-edge sword: it can accommodate all manner of information, but that information may or may not align with community best practices. It is thus necessary for the community to decide on such practices. In other words, if LiPD provides a field-tested answer to the question: how should paleoclimate data be stored?, it says nothing about what should be stored: that decision is up to the community.

First workshop on paleoclimate data standards

The 2016 workshop on paleoclimate data standards (PDS workshop, for short) served as a stepping stone to initiate a broader process of community engagement and feedback elicitation, with the goal of generating such a community-vetted standard. The workshop identified a need to delineate a set of essential, recommended and desired properties for each dataset. By default, any and all information is desired. A subset of that should be recommended to ensure optimal re-use. Yet a smaller subset of that is essential in the sense that a paleoclimate data set should not be acceptable without this information (for more details, see the PDS workshop page). Four additional themes emerged:

Cross-Archive Standards

Some essential data/metadata are shared among all conceivable archive types:

  • A table with at least two columns, one representing time, the other a climate indicator of some sort
  • Geolocation: coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info
  • Source: PI, contributor, or database (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
  • Names and Units of the variables in the dataset

Archive-specific standards

What is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. Therefore, these these levels are archive-specific.

Legacy vs Modern datasets

The group also recognized that standards need to be more stringent for modern datasets than for legacy datasets, for which some (meta)data are sometimes impossible to procure (think: raw radiocarbon dates from a PI now deceased). Thus, for every archive and across archives, there needs to be different set of standards for both kinds. What constitutes "legacy" data is also open to interpretation, and requires a formal definition (and a vote).

Audience and Purpose

Finally, it was recognized that all of the above considerations are a function of who is using the data, and for what purpose. As such, it is useful to consider a few science drivers.

Science Drivers

One possible way to elaborate a standard is to ask oneself: "What pieces of metadata would I require to reproduce this particular dataset and therefore which one should I provide for my own datasets?" Some of these metadata will be standard across all archives (i.e., geographic coordinates, publication information). Some will be specific to each archive or observation type. See here for an example of such discussions on foraminiferal Mg/Ca.

Querying the datasets

Another way to think about the essential/recommended/optional metadata is to think about the kind of queries one would want to enable for their research. For instance, a recent study on Testing the Millennial-Scale Solar-Climate Connection in the Indo-Pacific Warm Pool required the following queries and associated metadata:

Query Required Metadata
SST-sensitive proxies (i.e. Mg/Ca, Uk37 and TEX86) Needed a property describing the proxy observations as well as a standard to express the concept of Mg/Ca, TEX86 and UK37. This need gave rise to Property:ProxyObservationType_(L) and terms in the ontology Category:ProxyObservation_(L) to describe these concepts.
Holocene data (0-10ka) spanning at least 5kyr Needed a property describing the concept of Age, giving rise to Property:InferredVariableType (L). An ontology for Category:InferredVariable_(L) is in progress. It also became obvious that we needed to standardize the way to represent time and the units of time. Furthermore, since the wiki doesn't read the content of the .csv file, some metadata about the values stored in the .csv file were also needed. We therefore created the following properties: Property:HasMinValue (L), Property:HasMaxValue (L), Property:HasMeanValue (L).
In the IndoPacific Warm Pool We needed to query the dataset by latitude/longitude.

Other types of basic query include searching for a particular publication, using either the DOI, title, journal, authors, and searching by archiveType. The latter is currently used to obtain the maps under each archive category (for example, marine sediments). Enter the queries you'd like to perform, and the metadata required to perform them.

Analyzing the datasets

Finally, this problem can be approached from a data analysis point-of-view. "Given my interest, what are the required information I would need to perform my analysis?"

For instance, if all is known about a dataset are the geographical coordinates from which the archive was taken, then the only possible analysis is to create a location map of said archive (with or without its nearest neighbor contained in the database). On the other hand, if an age and paleo variable are contained within the PaleoDataTable, the resulting time series can be used for correlation analysis and spectral analysis. However, without the raw age data, it would be impossible to recalibrate the record using updated techniques, limiting its usefulness. To ask the most questions of our datasets, and do the best science, we need them to be as complete as possible.

For the Holocene study, the datasets had to contain the raw radiocarbon measurements for use with Bchron. this blog post describes how richer and standardized metadata supported the study. This Jupyter Notebook integrates code, data and text to walk you through the story. None of this would have been possible without LiPD.

Process for achieving a paleoclimate data standard

Working Groups

Attendees of the PDS workshop 2016 proposed that archive-centric working groups (WGs; self-assembled coalitions of knowledgeable experts) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. It is also critical to ensure interoperability between standards to enable longitudinal (multiproxy) investigations. Working groups have been formed and are now being consulted to generate the backbone of a standard, which has been presented to the community at the PAGES OSM meeting in Zaragoza (May 9-13, 2017).

Current Working Groups:


International Partners

This process contributes to the data stewardship initiative of our PAGES/Future Earth partners. Therefore, we are working together with PAGES to reach out to the broadest cross-section of paleoscientists and invite them to contribute to the process. The end goal is a standard to be precisely documented and adopted by LinkedEarth and PAGES. The standard will be implemented in all LinkedEarth activities and proposed for adoption by EarthCube, the Research Data Alliance, the Federation of Earth Science Information Partners, NOAA WDS-Paleo and Pangaea.

Voting

Wiki-based voting

The first phase of input on data standards (up to October 2017), open only to LinkedEarth members, has leveraged the wiki's polling ability. The results of such polls are summarized in Advanced Statistics.

Google Survey

The second phase of input, open to all scientists who want to be associated with the Data Standards publication, is available here as a Google Survey. It is active until November 10, 2017.

Standard Publication

The first iteration of the PaleoClimate reporTing Standard (PaCTS v1.0) was published in Paleoceanography and Paleoclimatology in December 2019. The paper is open-access and available here. The standard, at present, is best described as aspirational. A meeting took place at the December 2019 AGU meeting to discuss next steps. If you want to participate, contact D._Khider or J._Emile-Geay.