Difference between revisions of "Paleoclimate Data Standards"

From Linked Earth Wiki
Jump to: navigation, search
(Voting: included link to Google Survey)
(44 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
= Background =
 
= Background =
 
== What is a standard? ==
 
== What is a standard? ==
[http://www.earthcube.org/document/2015/ecstandardsrecs EarthCube defines] a standard as follows:
+
[http://www.earthcube.org/document/2015/ecstandardsrecs EarthCube defines] a standard as follows:
  
  ''a public specification documenting some practice or technology that is adopted and used by a community. [..] There is a continuum starting with any documented practice in some community. If lots of people use a particular documented practice it could be adopted as a best practice. If almost everyone uses some documented practice, then it is a de facto standard''.
+
  a public specification documenting some practice or technology that is adopted and used by a community. [..] There is a continuum starting with any documented practice in some community. If lots of people use a particular documented practice it could be adopted as a best practice. If almost everyone uses some documented practice, then it is a de facto standard.
  
 
Notice the emphasis on community and on practice. If only person uses a technical specification, it's not a standard. If it's voted on but not applied in practice, it's worthless as well. Thus, the objective of this EarthCube activity is to propose a standard with broad community appeal and adoption.
 
Notice the emphasis on community and on practice. If only person uses a technical specification, it's not a standard. If it's voted on but not applied in practice, it's worthless as well. Thus, the objective of this EarthCube activity is to propose a standard with broad community appeal and adoption.
  
 
== Why do we need standards? ==
 
== Why do we need standards? ==
This is a bit like asking why we need water.  Modern life would simply be unlivable without standards. Imagine having to use a separate browser for each web page your visit, or a separate power-transmission system for every appliance you use! You only have to travel to a country that uses a different electric plug than the one your computer and phone employ to appreciate what a nightmare that would be. In science, the ultimate objective of a standard to make data understandable by others (including machines), and the derived analyses reproducible. Thus, a key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data. Indeed, despite some ad-hoc gatherings among communities of interest over many years, until recently there had never been a concerted effort to produce a standard applicable to all paleoclimate observations. Given the increased importance of synthesis work (e.g. PAGES2k, Shakun et al 2012, Marcott et al 2013, MARGO, others), it is increasingly important that a common solution be found.
+
Modern life would simply be unlivable without standards. Imagine having to use a separate browser for each web page your visit, or a separate power-transmission system for every appliance you use! You only have to travel to a country that uses a different electric plug than the one your computer and phone employ to appreciate what a nightmare that would be. In science, the ultimate objective of a standard is to make data understandable by others (including machines), and the derived analyses reproducible. Thus, a key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data. Indeed, despite some ad-hoc gatherings among communities of interest over many years, until recently there had never been a concerted effort to produce a standard applicable to all paleoclimate observations. Given the increased importance of synthesis work (e.g. PAGES2k, Shakun et al 2012, Marcott et al 2013, MARGO, others), it is increasingly important that a common solution be found.
  
 
= Prior Work =
 
= Prior Work =
 +
== Pre-2016 ==
 +
Here is a non-exhaustive list of past discussions that we are aware of:
 +
 +
*  [[media: Reporting_Standards_for_Paleoceanographic_PMIP3_Dec2013.docx  | PMIP3 Workshop in Corvallis]] (December 2013)
 +
* [http://www.clivar.org/sites/default/files/documents/TriesteReport_nov5.pdf report for the workshop "REPRESENTING AND REDUCING UNCERTAINTIES IN HIGH-RESOLUTION PROXY CLIMATE DATA"] (Trieste, Italy, June 2008)
 +
 +
We welcome any summaries of prior discussions on data standards you may have had at meetings, workshops. To do so, either link to a document that you have uploaded on the wiki or create a new page and list it here.
 +
 
== LiPD ==
 
== LiPD ==
The Linked Paleo Data ([http://linked.earth/projects/lipd/ LiPD]) format embodies one part of this solution: it offers a container that can wrap tightly around a wide varieties of paleoclimate [[:Category:Dataset ©|datasets]],  providing a vessel for paleoclimate content. Other formats, of course, would be acceptable; however, there is no viable alternative currently in existence, which is why [[J._Emile-Geay | Julien]] and [[Nicholas_McKay | Nick]] had to go through the trouble of inventing such a format. Another reason to adopt it is that '''there is a growing code ecosystem being developed around LiPD''' in Matlab, R and Python: the [https://github.com/nickmckay/LiPD-utilities LiPD utilities] that allow cross-walk between all kinds of commonly-used formats, [http://nickmckay.github.io/GeoChronR/ GeoChronR] for the analysis of time-uncertain paleo data, and [https://github.com/LinkedEarth/Pyleoclim_util Pyleoclim] to visualize and analyze the data.   
+
The [[Linked_Paleo_Data|LiPD]] format embodies one part of this solution: it offers a container that can wrap tightly around a wide varieties of paleoclimate [[:Category:Dataset (L)|datasets]],  providing a vessel for paleoclimate content. Other formats, of course, would be acceptable; however, there is no viable alternative currently in existence, which is why [[Nicholas_McKay | Nick]] and [[J._Emile-Geay | Julien]] had to go through the trouble of inventing such a format. Another reason to adopt it is that '''there is a growing code ecosystem being developed around LiPD''' in Matlab, R and Python: the [https://github.com/nickmckay/LiPD-utilities LiPD utilities] that allow cross-walk between all kinds of commonly-used formats, [http://nickmckay.github.io/GeoChronR/ GeoChronR] for the analysis of time-uncertain paleo data, and [https://github.com/LinkedEarth/Pyleoclim_util Pyleoclim] to visualize and analyze the data.   
Why not just stick with LiPD and call it a day, you ask?  Well, LiPD's infinite flexibility is a double-edge sword: it can accommodate all manner of information, but that information may or may not align with community best practices. It is thus '''necessary for the community to decide on such practices'''.  In other words, if LiDP provides a field-tested answer to the question: ''how should paleoclimate data be stored?'', it says nothing about '''''what''''' should be stored: that decision is up to the community.
+
Why not just stick with LiPD and call it a day, you ask?  Well, LiPD's infinite flexibility is a double-edge sword: it can accommodate all manner of information, but that information may or may not align with community best practices. It is thus '''necessary for the community to decide on such practices'''.  In other words, if LiPD provides a field-tested answer to the question: ''how should paleoclimate data be stored?'', it says nothing about '''''what''''' should be stored: that decision is up to the community.
  
 
== First workshop on paleoclimate data standards ==  
 
== First workshop on paleoclimate data standards ==  
 
The [[PDS_workshop_2016 | 2016 workshop on paleoclimate data standards]] (PDS workshop, for short) served as a stepping stone to initiate a broader process of community engagement and feedback elicitation, with the goal of generating such a community-vetted standard. The workshop identified a need to '''delineate a set of essential, recommended and desired properties for each dataset'''.  
 
The [[PDS_workshop_2016 | 2016 workshop on paleoclimate data standards]] (PDS workshop, for short) served as a stepping stone to initiate a broader process of community engagement and feedback elicitation, with the goal of generating such a community-vetted standard. The workshop identified a need to '''delineate a set of essential, recommended and desired properties for each dataset'''.  
By default, any and all information is desired. A subset of that should be recommended to ensure optimal re-use. Yet a smaller subset of that is '''essential''' in the sense that a paleoclimate data set should not be acceptable without this information (for more details, see the [[PDS_workshop_2016 | PDS workshop page]]). Four additional themes emerged:
+
By default, any and all information is desired. A subset of that should be recommended to ensure optimal re-use. Yet a smaller subset of that is '''essential''' in the sense that a paleoclimate data set should not be acceptable without this information (for more details, see the [[PDS_workshop_2016 | PDS workshop page]]). Four additional themes emerged:
 
   
 
   
 
=== Cross-Archive Standards ===
 
=== Cross-Archive Standards ===
Line 23: Line 31:
  
 
* '''A table with at least two columns''', one representing time, the other a climate indicator of some sort
 
* '''A table with at least two columns''', one representing time, the other a climate indicator of some sort
* '''Geolocation''' : coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info)
+
* '''Geolocation''': coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info
* '''Source''': PI, contributor, or database (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
+
* '''Source''': PI, contributor, or database (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
 
* '''Names''' and '''Units''' of the variables in the dataset
 
* '''Names''' and '''Units''' of the variables in the dataset
  
=== Archive-specific standards ===
+
=== Archive-specific standards ===
 
What is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. Therefore, these these levels are archive-specific.  
 
What is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. Therefore, these these levels are archive-specific.  
  
=== Legacy vs Modern datasets ===
+
=== Legacy vs Modern datasets ===
 
The group also recognized that standards need to be more stringent for modern datasets than for legacy datasets, for which some (meta)data are sometimes impossible to procure (think: raw radiocarbon dates from a PI now deceased). Thus, for every archive and across archives, there needs to be different set of standards for both kinds.  
 
The group also recognized that standards need to be more stringent for modern datasets than for legacy datasets, for which some (meta)data are sometimes impossible to procure (think: raw radiocarbon dates from a PI now deceased). Thus, for every archive and across archives, there needs to be different set of standards for both kinds.  
 
What constitutes "legacy" data is also open to interpretation, and requires a formal definition (and a vote).
 
What constitutes "legacy" data is also open to interpretation, and requires a formal definition (and a vote).
  
 
=== Audience and Purpose ===
 
=== Audience and Purpose ===
Finally, all of the above considerations are a function of who is using the data, and for what purpose. As such, it is useful to consider a few science drivers.
+
Finally, it was recognized that all of the above considerations are a function of who is using the data, and for what purpose. As such, it is useful to consider a few science drivers.
 
+
== Pre-2016 ==
+
 
+
Here is a non-exhaustive list of past discussions that we are aware of:
+
 
+
- [[media: Reporting_Standards_for_Paleoceanographic_PMIP3_Dec2013.docx  | PMIP3 Workshop in Corvallis]] (December 2013) [[User:Khider|Deborah Khider]] ([[User talk:Khider|talk]]) 17:00, 30 September 2016 (PDT)
+
 
+
- [http://www.clivar.org/sites/default/files/documents/TriesteReport_nov5.pdf report for the workshop "REPRESENTING AND REDUCING UNCERTAINTIES IN HIGH-RESOLUTION PROXY CLIMATE DATA"] (June 2008) [[User:Jeg|JEG]] ([[User talk:Jeg|talk]]) 22:46, 21 February 2017 (PST)
+
 
+
We welcome any summaries of prior discussions on data standards you may have had at meetings, workshops. To do so, either link to a document that you have uploaded on the wiki or create a new page and list it below. Sign  and data using the following notation: <pre>~~~~</pre>
+
  
 
= Science Drivers =
 
= Science Drivers =
One possible way to elaborate a standard is to ask oneself: "What pieces of metadata would I require to reproduce this particular dataset and therefore which one should I provide for my own datasets?" Some of these metadata will be standard across all archives (i.e., geographic coordinates, publication information). Some will be archive (and even observation)-specific. See [[:Category:Marine_Sediment_Working_Group | here]] for the beginning of a discussion on foraminiferal Mg/Ca.
+
One possible way to elaborate a standard is to ask oneself: "What pieces of metadata would I require to reproduce this particular dataset and therefore which one should I provide for my own datasets?" Some of these metadata will be standard across all archives (i.e., geographic coordinates, publication information). Some will be specific to each archive or observation type. See [[:Category:Marine_Sediment_Working_Group | here]] for an example of such discussions on foraminiferal Mg/Ca.
  
 
== Querying the datasets ==
 
== Querying the datasets ==
  
Another way to think about the essential/recommended/optional metadata is to think about the kind of queries one would want to enable for their research. For instance, [http://linked.earth/wp-content/uploads/2016/01/Khider_AGUFall16.pdf Testing the Millennial-Scale Solar-Climate Connection in the Indo-Pacific Warm Pool] required the following query and associated metadata:
+
Another way to think about the essential/recommended/optional metadata is to think about the kind of queries one would want to enable for their research. For instance, a recent study on [http://linked.earth/wp-content/uploads/2016/01/Khider_AGUFall16.pdf Testing the Millennial-Scale Solar-Climate Connection in the Indo-Pacific Warm Pool] required the following queries and associated metadata:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 58: Line 56:
 
! Query || Required Metadata
 
! Query || Required Metadata
 
|-
 
|-
| SST-sensitive proxies (i.e. Mg/Ca, Uk37 and TEX86) || Needed a property describing the proxy observations as well as a standard to express the concept of Mg/Ca, TEX86 and UK37. This need gave rise to [[:Property:OnProxyObservationProperty_©]] and terms in the ontology [[:Category:ProxyObservation_©]] to describe these concepts.
+
| SST-sensitive proxies (i.e. Mg/Ca, Uk37 and TEX86) || Needed a property describing the proxy observations as well as a standard to express the concept of Mg/Ca, TEX86 and UK37. This need gave rise to [[:Property:ProxyObservationType_(L)]] and terms in the ontology [[:Category:ProxyObservation_(L)]] to describe these concepts.
 
|-
 
|-
| Holocene data (0-10ka) spanning at least 5kyr || Needed a property describing the concept of [[Age]], giving rise to [[:Property:OnInferredVariableProperty]]. An ontology for [[:Category:InferredVariable_ ©]] is in progress. It also became obvious that we needed to standardize the way to represent time and the units of time. Furthermore, since the wiki doesn't read the content of the csv file, some metadata about the values stored in the .csv file were also needed. We therefore created the following properties: [[:Property:HasMinValue]], [[:Property:HasMaxValue]],[[:Property:HasMeanValue]].  
+
| Holocene data (0-10ka) spanning at least 5kyr || Needed a property describing the concept of [[Age]], giving rise to [[:Property:InferredVariableType (L)]]. An ontology for [[:Category:InferredVariable_(L)]] is in progress. It also became obvious that we needed to standardize the way to represent time and the units of time. Furthermore, since the wiki doesn't read the content of the .csv file, some metadata about the values stored in the .csv file were also needed. We therefore created the following properties: [[:Property:HasMinValue (L)]], [[:Property:HasMaxValue (L)]], [[:Property:HasMeanValue (L)]].  
 
|-
 
|-
| In the IndoPacific Warm Pool || We needed to query the dataset by latitude/longitude
+
| In the IndoPacific Warm Pool || We needed to query the dataset by latitude/longitude.
 
|}
 
|}
  
Other types of basic query include querying for a particular publication, using either the DOI, title, journal, authors, and querying by [[:Category:ProxyArchive_ © | archiveType]]. The later is currently used to obtain the maps under each archive category (for example, [[:Category:MarineSediment | marine sediments]]). Enter the queries you'd like to perform, and the metadata required to perform them.
+
Other types of basic query include searching for a particular publication, using either the DOI, title, journal, authors, and searching by [[:Category:ProxyArchive_ (L) | archiveType]]. The latter is currently used to obtain the maps under each archive category (for example, [[:Category:MarineSediment | marine sediments]]). Enter the queries you'd like to perform, and the metadata required to perform them.
  
 
== Analyzing the datasets ==
 
== Analyzing the datasets ==
Line 71: Line 69:
 
Finally, this problem can be approached from a data analysis point-of-view. "Given my interest, what are the required information I would need to perform my analysis?"
 
Finally, this problem can be approached from a data analysis point-of-view. "Given my interest, what are the required information I would need to perform my analysis?"
  
For instance, if all is know about a dataset is the geographical coordinates from which the archive was taken, then the only possible analysis is to create a location map of said archive (with or without its nearest neighbor contained in the database.) On the other hand, if an age and inferred variable are contained within the PaleoDataTable, the resulting time series can be used for correlation analysis and spectral analysis. However, without the raw data, it would be impossible to recalibrate the record using updated techniques, further limiting its usefulness.
+
For instance, if all is known about a dataset are the geographical coordinates from which the archive was taken, then the only possible analysis is to create a location map of said archive (with or without its nearest neighbor contained in the database). On the other hand, if an age and paleo variable are contained within the PaleoDataTable, the resulting time series can be used for correlation analysis and spectral analysis. However, without the raw age data, it would be impossible to recalibrate the record using updated techniques, limiting its usefulness. To ask the most questions of our datasets, and do the best science, we need them to be as complete as possible.
  
For the Holocene study, the Dataset had to contain the raw radiocarbon measurements for use with Bchron.
+
For the Holocene study, the datasets had to contain the raw radiocarbon measurements for use with [https://cran.r-project.org/web/packages/Bchron/vignettes/Bchron.html Bchron].  [http://linked.earth/doing-science-with-linkedearth/ this blog post] describes how richer and standardized metadata  supported the study. This [https://github.com/khider/Holocene-Millennial-Scale-Variability/blob/master/AGU%20Fall%202016/Holocene_MillennialScaleVar.md Jupyter Notebook] integrates code, data and text to walk you through the story. None of this would have been possible without [[Linked_Paleo_Data|LiPD]].
 
+
LINK TO BLOG POST!!
+
  
 
= Process for achieving a paleoclimate data standard=
 
= Process for achieving a paleoclimate data standard=
 +
== Working Groups ==
 +
Attendees of the [[PDS workshop 2016]] proposed that archive-centric [[:Category:Working_Group | working groups]] (WGs; self-assembled coalitions of knowledgeable experts) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. It is also critical to ensure interoperability between standards to enable longitudinal (multiproxy) investigations.
 +
[[:Category:Working_Group | Working groups]] have been formed and are now being consulted to generate the backbone of a standard, which has been presented to the community at the [http://www.pages-osm.org PAGES OSM meeting] in Zaragoza (May 9-13, 2017).
  
Attendees of the [[:Category:PDS workshop 2016 | 2016 PDS workshop]] proposed that archive-centric [[:Category:Working_Group | working groups]] (WGs; self-assembled coalitions of knowledgeable experts) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. It is also critical to ensure interoperability between standards to enable longitudinal (multiproxy) investigations.
+
Current Working Groups:
 +
<DynamicPageList>
 +
category = Working Group
 +
shownamespace = false
 +
</DynamicPageList>
  
This process contributes to the [http://pastglobalchanges.org/ini/int-act/data-stewardship data stewardship initiative] of our [http://www.pages-igbp.org/about/general-overview PAGES/Future Earth] partners. Therefore, we are working together with PAGES to reach out to the broadest cross-section of paleoscientists and invite them to contribute to the process. The end goal is a standard to be precisely documented and adopted by LinkedEarth and PAGES. The standard will be implemented in all LinkedEarth activities and proposed for adoption by [http://earthcube.org/ EarthCube], the [https://rd-alliance.org/ Research Data Alliance], the [http://www.esipfed.org/ Federation of Earth Science Information Partners], [https://www.ncdc.noaa.gov/data-access/paleoclimatology-data NOAA WDS-Paleo] and [https://www.pangaea.de/ Pangaea].
+
== International Partners ==
 +
This process contributes to the [http://pastglobalchanges.org/ini/int-act/data-stewardship data stewardship initiative] of our [http://www.pastglobalchanges.org/about/general-overview PAGES/Future Earth] partners. Therefore, we are working together with PAGES to reach out to the broadest cross-section of paleoscientists and invite them to contribute to the process. The end goal is a standard to be precisely documented and adopted by LinkedEarth and PAGES. The standard will be implemented in all LinkedEarth activities and proposed for adoption by [http://earthcube.org/ EarthCube], the [https://rd-alliance.org/ Research Data Alliance], the [http://www.esipfed.org/ Federation of Earth Science Information Partners], [https://www.ncdc.noaa.gov/data-access/paleoclimatology-data NOAA WDS-Paleo] and [https://www.pangaea.de/ Pangaea].
  
[[:Category:Working_Group | Working groups]] have been formed and are now being consulted to generate the backbone of a standard, which will be presented to the community at the [http://www.pages-osm.org PAGES OSM meeting] in Zaragoza (May 9-13, 2017).
+
== Voting ==
 +
=== Wiki-based voting ===
 +
The first phase of input on data standards (up to October 2017), open only to LinkedEarth members, has leveraged the wiki's polling ability.  The results of such polls are summarized in [[PollingStats|Advanced Statistics]].
 +
=== Google Survey ===
 +
The second phase of input, open to all scientists who want to be associated with the Data Standards publication, is available [https://t.co/SUhcYojPJq here] as a Google Survey. It is active until November 10, 2017.
  
 
= Standard Publication =
 
= Standard Publication =
 
Once the community has spoken on these matters, the decisions will be summarized in a publication.  
 
Once the community has spoken on these matters, the decisions will be summarized in a publication.  
  
  A formal standard is a specification of some practice that is adopted by a recognized standards body. The set of formal standards and set of de facto standards intersect, but are not the same; some formal standards are not very widely used. Nonetheless, because of the community participation and rigor required to formalize the standard we recognize that they merit careful evaluation. [http://www.earthcube.org/document/2015/ecstandardsrecs]  
+
  A formal standard is a specification of some practice that is adopted by a recognized standards body. The set of formal standards and set of de facto standards intersect, but are not the same; some formal standards are not very widely used. Nonetheless, because of the community participation and rigor required to formalize the standard we recognize that they merit careful evaluation. [http://www.earthcube.org/document/2015/ecstandardsrecs]  
 +
 
 +
In the internet age, a standard can be a web-based document that details all the specifications pertaining to a technical matter. To encourage community participation and promote transparency, the LinkedEarth team will lead a '''crowd-sourced peer-reviewed publication''' detailing this standard.
 +
 
 +
==Platform==
 +
The writing process will take place in the collaborative platform [https://www.authorea.com Authorea] and synthesize the decisions taken, and the pertinent discussions that led to such decisions.
  
In the internet age, a standard can be a web-based document that details all the specifications pertaining to a technical matter. However, to encourage participation and promote transparency, the LinkedEarth team decided that the standard should be published  in a '''crowd-sourced peer-reviewed publication'''.  
+
==Authorship==
 +
Pursuant to PAGES policies, authorship will be extremely inclusive and acknowledge all scientific input into the process. Anyone contributing to the discussion on developing standards for paleoclimatology (either during the [[PDS workshop 2016]], on the wiki, or via teleconferences an in-person exchanges) will be invited to the author list.
  
The writing process will likely take place in [https://www.authorea.com Authorea] and synthesize the decisions taken, and the pertinent discussions that led to such decisions. Pursuant to PAGES policies, authorship will be extremely inclusive and acknowledge all scientific input into the process. Anyone contributing to the discussion on developing standards for paleoclimatology (either during the [[:Category:PDS workshop 2016 | 2016 PDS workshop]], on the wiki, or via teleconferences an in-person exchanges) will be included in the author list.
+
Authorship will be consortium-based (provisional name: "Working Group on Paleo Data Standards")

Revision as of 10:42, 24 October 2017

Background

What is a standard?

EarthCube defines a standard as follows:

a public specification documenting some practice or technology that is adopted and used by a community. [..] There is a continuum starting with any documented practice in some community. If lots of people use a particular documented practice it could be adopted as a best practice. If almost everyone uses some documented practice, then it is a de facto standard.

Notice the emphasis on community and on practice. If only person uses a technical specification, it's not a standard. If it's voted on but not applied in practice, it's worthless as well. Thus, the objective of this EarthCube activity is to propose a standard with broad community appeal and adoption.

Why do we need standards?

Modern life would simply be unlivable without standards. Imagine having to use a separate browser for each web page your visit, or a separate power-transmission system for every appliance you use! You only have to travel to a country that uses a different electric plug than the one your computer and phone employ to appreciate what a nightmare that would be. In science, the ultimate objective of a standard is to make data understandable by others (including machines), and the derived analyses reproducible. Thus, a key objective of LinkedEarth is to promote the development of a community standard for paleoclimate data. Indeed, despite some ad-hoc gatherings among communities of interest over many years, until recently there had never been a concerted effort to produce a standard applicable to all paleoclimate observations. Given the increased importance of synthesis work (e.g. PAGES2k, Shakun et al 2012, Marcott et al 2013, MARGO, others), it is increasingly important that a common solution be found.

Prior Work

Pre-2016

Here is a non-exhaustive list of past discussions that we are aware of:

We welcome any summaries of prior discussions on data standards you may have had at meetings, workshops. To do so, either link to a document that you have uploaded on the wiki or create a new page and list it here.

LiPD

The LiPD format embodies one part of this solution: it offers a container that can wrap tightly around a wide varieties of paleoclimate datasets, providing a vessel for paleoclimate content. Other formats, of course, would be acceptable; however, there is no viable alternative currently in existence, which is why Nick and Julien had to go through the trouble of inventing such a format. Another reason to adopt it is that there is a growing code ecosystem being developed around LiPD in Matlab, R and Python: the LiPD utilities that allow cross-walk between all kinds of commonly-used formats, GeoChronR for the analysis of time-uncertain paleo data, and Pyleoclim to visualize and analyze the data. Why not just stick with LiPD and call it a day, you ask? Well, LiPD's infinite flexibility is a double-edge sword: it can accommodate all manner of information, but that information may or may not align with community best practices. It is thus necessary for the community to decide on such practices. In other words, if LiPD provides a field-tested answer to the question: how should paleoclimate data be stored?, it says nothing about what should be stored: that decision is up to the community.

First workshop on paleoclimate data standards

The 2016 workshop on paleoclimate data standards (PDS workshop, for short) served as a stepping stone to initiate a broader process of community engagement and feedback elicitation, with the goal of generating such a community-vetted standard. The workshop identified a need to delineate a set of essential, recommended and desired properties for each dataset. By default, any and all information is desired. A subset of that should be recommended to ensure optimal re-use. Yet a smaller subset of that is essential in the sense that a paleoclimate data set should not be acceptable without this information (for more details, see the PDS workshop page). Four additional themes emerged:

Cross-Archive Standards

Some essential data/metadata are shared among all conceivable archive types:

  • A table with at least two columns, one representing time, the other a climate indicator of some sort
  • Geolocation: coordinates, polygons, or, in cases where coordinates or polygons cannot be given, general location info
  • Source: PI, contributor, or database (first author of publication if published, or some other person who can speak for the dataset if not published/if the first author is no longer in science/etc)
  • Names and Units of the variables in the dataset

Archive-specific standards

What is needed to intelligently re-use a marine-annually resolved record could be quite different than what is needed to intelligently re-use an ice core record, for instance. Therefore, these these levels are archive-specific.

Legacy vs Modern datasets

The group also recognized that standards need to be more stringent for modern datasets than for legacy datasets, for which some (meta)data are sometimes impossible to procure (think: raw radiocarbon dates from a PI now deceased). Thus, for every archive and across archives, there needs to be different set of standards for both kinds. What constitutes "legacy" data is also open to interpretation, and requires a formal definition (and a vote).

Audience and Purpose

Finally, it was recognized that all of the above considerations are a function of who is using the data, and for what purpose. As such, it is useful to consider a few science drivers.

Science Drivers

One possible way to elaborate a standard is to ask oneself: "What pieces of metadata would I require to reproduce this particular dataset and therefore which one should I provide for my own datasets?" Some of these metadata will be standard across all archives (i.e., geographic coordinates, publication information). Some will be specific to each archive or observation type. See here for an example of such discussions on foraminiferal Mg/Ca.

Querying the datasets

Another way to think about the essential/recommended/optional metadata is to think about the kind of queries one would want to enable for their research. For instance, a recent study on Testing the Millennial-Scale Solar-Climate Connection in the Indo-Pacific Warm Pool required the following queries and associated metadata:

Query Required Metadata
SST-sensitive proxies (i.e. Mg/Ca, Uk37 and TEX86) Needed a property describing the proxy observations as well as a standard to express the concept of Mg/Ca, TEX86 and UK37. This need gave rise to Property:ProxyObservationType_(L) and terms in the ontology Category:ProxyObservation_(L) to describe these concepts.
Holocene data (0-10ka) spanning at least 5kyr Needed a property describing the concept of Age, giving rise to Property:InferredVariableType (L). An ontology for Category:InferredVariable_(L) is in progress. It also became obvious that we needed to standardize the way to represent time and the units of time. Furthermore, since the wiki doesn't read the content of the .csv file, some metadata about the values stored in the .csv file were also needed. We therefore created the following properties: Property:HasMinValue (L), Property:HasMaxValue (L), Property:HasMeanValue (L).
In the IndoPacific Warm Pool We needed to query the dataset by latitude/longitude.

Other types of basic query include searching for a particular publication, using either the DOI, title, journal, authors, and searching by archiveType. The latter is currently used to obtain the maps under each archive category (for example, marine sediments). Enter the queries you'd like to perform, and the metadata required to perform them.

Analyzing the datasets

Finally, this problem can be approached from a data analysis point-of-view. "Given my interest, what are the required information I would need to perform my analysis?"

For instance, if all is known about a dataset are the geographical coordinates from which the archive was taken, then the only possible analysis is to create a location map of said archive (with or without its nearest neighbor contained in the database). On the other hand, if an age and paleo variable are contained within the PaleoDataTable, the resulting time series can be used for correlation analysis and spectral analysis. However, without the raw age data, it would be impossible to recalibrate the record using updated techniques, limiting its usefulness. To ask the most questions of our datasets, and do the best science, we need them to be as complete as possible.

For the Holocene study, the datasets had to contain the raw radiocarbon measurements for use with Bchron. this blog post describes how richer and standardized metadata supported the study. This Jupyter Notebook integrates code, data and text to walk you through the story. None of this would have been possible without LiPD.

Process for achieving a paleoclimate data standard

Working Groups

Attendees of the PDS workshop 2016 proposed that archive-centric working groups (WGs; self-assembled coalitions of knowledgeable experts) would be best positioned to elaborate and discuss the components of a data standard for their specific sub-field of paleoclimatology. It is also critical to ensure interoperability between standards to enable longitudinal (multiproxy) investigations. Working groups have been formed and are now being consulted to generate the backbone of a standard, which has been presented to the community at the PAGES OSM meeting in Zaragoza (May 9-13, 2017).

Current Working Groups:


International Partners

This process contributes to the data stewardship initiative of our PAGES/Future Earth partners. Therefore, we are working together with PAGES to reach out to the broadest cross-section of paleoscientists and invite them to contribute to the process. The end goal is a standard to be precisely documented and adopted by LinkedEarth and PAGES. The standard will be implemented in all LinkedEarth activities and proposed for adoption by EarthCube, the Research Data Alliance, the Federation of Earth Science Information Partners, NOAA WDS-Paleo and Pangaea.

Voting

Wiki-based voting

The first phase of input on data standards (up to October 2017), open only to LinkedEarth members, has leveraged the wiki's polling ability. The results of such polls are summarized in Advanced Statistics.

Google Survey

The second phase of input, open to all scientists who want to be associated with the Data Standards publication, is available here as a Google Survey. It is active until November 10, 2017.

Standard Publication

Once the community has spoken on these matters, the decisions will be summarized in a publication.

A formal standard is a specification of some practice that is adopted by a recognized standards body. The set of formal standards and set of de facto standards intersect, but are not the same; some formal standards are not very widely used. Nonetheless, because of the community participation and rigor required to formalize the standard we recognize that they merit careful evaluation. [1] 

In the internet age, a standard can be a web-based document that details all the specifications pertaining to a technical matter. To encourage community participation and promote transparency, the LinkedEarth team will lead a crowd-sourced peer-reviewed publication detailing this standard.

Platform

The writing process will take place in the collaborative platform Authorea and synthesize the decisions taken, and the pertinent discussions that led to such decisions.

Authorship

Pursuant to PAGES policies, authorship will be extremely inclusive and acknowledge all scientific input into the process. Anyone contributing to the discussion on developing standards for paleoclimatology (either during the PDS workshop 2016, on the wiki, or via teleconferences an in-person exchanges) will be invited to the author list.

Authorship will be consortium-based (provisional name: "Working Group on Paleo Data Standards")