Difference between revisions of "Linked Paleo Data"
Nick mckay (Talk | contribs) (→LiPD components) |
(updated links) |
||
(22 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
= LiPD: Linked Paleo Data = | = LiPD: Linked Paleo Data = | ||
+ | |||
+ | [[File:LiPD Logo.jpg |thumb|400px|right]] | ||
== The LiPD vision == | == The LiPD vision == | ||
Line 5: | Line 7: | ||
Paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the [https://www.ncdc.noaa.gov/data-access/paleoclimatology-data World Data Center for Paleoclimatology] and [http://www.pangaea.de/ Pangaea]. However, there is no agreed-upon data standard for how to store and exchange such data. As the number of records in these archives has grown, making connections manually has thus become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Paleoclimatologists thus need a common tongue to describe their datasets to each other and to machines. | Paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the [https://www.ncdc.noaa.gov/data-access/paleoclimatology-data World Data Center for Paleoclimatology] and [http://www.pangaea.de/ Pangaea]. However, there is no agreed-upon data standard for how to store and exchange such data. As the number of records in these archives has grown, making connections manually has thus become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Paleoclimatologists thus need a common tongue to describe their datasets to each other and to machines. | ||
− | LiPD (Linked Paleo Data) proposes such a common tongue. It is a universally-readable data container that organizes data and metadata in a uniform way, such that they may be web-searchable, and such that a variety of code functionalities may be built and apply instantly to any dataset that observes that standard. | + | LiPD (Linked Paleo Data) proposes such a common tongue. It is a universally-readable data container that organizes data and metadata in a uniform way, such that they may be web-searchable, and such that a variety of code functionalities may be built and apply instantly to any dataset that observes that standard. |
For more details, see [http://www.clim-past-discuss.net/11/4309/2015/cpd-11-4309-2015.html this article]. | For more details, see [http://www.clim-past-discuss.net/11/4309/2015/cpd-11-4309-2015.html this article]. | ||
Line 11: | Line 13: | ||
== LiPD in LinkedEarth == | == LiPD in LinkedEarth == | ||
− | LiPD is a convenient way to store and exchange paleoclimate format, and | + | LiPD is a convenient way to store and exchange paleoclimate format, and provides the backbone of the LinkedEarth edifice. LiPD is closely aligned with the [[LinkedEarth Ontology]]; changes in one are often reflected in the other (with a small lag). |
= Structuring data : the LiPD Way = | = Structuring data : the LiPD Way = | ||
== LiPD components == | == LiPD components == | ||
− | [[File:Lipd_structure.png| | + | [[File:Lipd_structure.png|400px|thumb|right|structure of a lipd file. For an animated presentation see [http://prezi.com/svmit2ys9_d7/?utm_campaign=share&utm_medium=copy&rc=ex0share this prezi]]] |
− | There are 6 possible components to a LiPD dataset | + | There are 6 possible components to a LiPD dataset. To be considered valid, a LiPD file must observe the structure to the right, and contain basic metadata for at least some of these categories. |
=== Root Metadata === | === Root Metadata === | ||
This describes metadata that applies to the whole dataset. Common examples are: | This describes metadata that applies to the whole dataset. Common examples are: | ||
− | |||
*Dataset Name | *Dataset Name | ||
*Investigators | *Investigators | ||
Line 29: | Line 30: | ||
=== Geographic Metadata === | === Geographic Metadata === | ||
Here the the site location is described following the geoJSON standard. This includes: | Here the the site location is described following the geoJSON standard. This includes: | ||
− | |||
*Coordinates | *Coordinates | ||
*Sitename | *Sitename | ||
Line 37: | Line 37: | ||
**Province | **Province | ||
**Ocean | **Ocean | ||
− | |||
− | |||
=== Publication Metadata === | === Publication Metadata === | ||
+ | Describes publications, either traditional or data publications are described following the bibJSON standard. This includes: | ||
+ | *Authors | ||
+ | *Title | ||
+ | *Journal | ||
+ | *DOI | ||
+ | *Year | ||
+ | *URL | ||
=== Funding Metadata === | === Funding Metadata === | ||
+ | Includes information about the research that produced the data were funded, including: | ||
+ | *Funding agency | ||
+ | *Funding grant | ||
+ | |||
=== PaleoData Table === | === PaleoData Table === | ||
+ | This includes all of the measured and inferred paleoenvironmental data - essentially all of the measurements and data from the study that were not used to infer the age. This includes: | ||
+ | |||
+ | ====Measurement Tables==== | ||
+ | *Column names, units and descriptions | ||
+ | *Interpretation metadata | ||
+ | *Calibration metadata | ||
+ | |||
+ | ====Models==== | ||
+ | *Methodology used to produce the model results | ||
+ | *Summary table of model output | ||
+ | *Ensemble tables of data produced by the model | ||
+ | *Distribution tables of data produced by the model | ||
+ | |||
+ | |||
=== ChronData Table === | === ChronData Table === | ||
+ | This mirrors [[Linked_Paleo_Data#PaleoData_Table | PaleoData]] but only includes the data from the study that were used to infer the age of the sequence. This includes: | ||
+ | |||
+ | ====Measurement Tables==== | ||
+ | *Column names, units and descriptions | ||
+ | |||
+ | ====Models==== | ||
+ | *Methodology used to produce the model results | ||
+ | *Summary table of model output | ||
+ | *Ensemble tables of data produced by the model | ||
+ | *Distribution tables of data produced by the model | ||
== LiPD implementation == | == LiPD implementation == | ||
− | LiPD is centered on [http://json-ld.org/ JSON-LD], a [http://www.json.org/ JSON]-based format compliant with the [https://vimeo.com/36752317 Linked Data] paradigm. JavaScript Object Notation (JSON) is an extremely lightweight and flexible way to encode information, and has become the leading format for data exchange on the Web. [http://www.ted.com/talks/tim_berners_lee_on_the_next_web Linked Data] are datasets that observe common rules to be able to be automatically linked through the World Wide Web. A LiPD file (.lpd) is in fact a zipped folder | + | LiPD is centered on [http://json-ld.org/ JSON-LD], a [http://www.json.org/ JSON]-based format compliant with the [https://vimeo.com/36752317 Linked Data] paradigm. JavaScript Object Notation (JSON) is an extremely lightweight and flexible way to encode information, and has become the leading format for data exchange on the Web. [http://www.ted.com/talks/tim_berners_lee_on_the_next_web Linked Data] are datasets that observe common rules to be able to be automatically linked through the World Wide Web. A LiPD file (.lpd) is in fact a zipped folder that follows the [https://en.wikipedia.org/wiki/BagIt BagIt] standard: |
+ | |||
+ | This includes: | ||
+ | *The BagIt "payload": | ||
+ | **One json-ld file that describes all of the metadata | ||
+ | **Csv tables that include data from all of the tables | ||
+ | *BagIt complementary information, including: | ||
+ | **a “bag-info.txt” file which details metadata for the bag, using colon-separated key/value pairs | ||
+ | **a tag manifest file which lists tag files and their associated MD5 checksums | ||
+ | |||
+ | BagIt is needed to ensure that all the bits on your computer are the same as the bits on the server. | ||
+ | |||
+ | == Working with LiPD data == | ||
+ | |||
+ | {{See also|GeoChronR|Pyleoclim|LiPD Utilities}} | ||
+ | |||
+ | LiPD was designed so that it can capture much richer sets of (meta)data than ASCII or Excel files, and to have a fixed backbone around which scientific codes can be built. | ||
+ | There is a price to pay for this power: LiPD is undoubtedly more difficult to interact with than a plain text file. Although it is possible to unzip a .lpd file and navigate through the native json-ld and csv files, this is not the best way to harness the power of LiPD files. | ||
+ | |||
+ | This very wiki was designed to allow non-coders to directly access and edit LiPD files. Additionally, a growing number of utilities and software packages can read and write LiPD files, and enable users to readily take advantage of its rich structure: | ||
+ | |||
+ | ===List of utilities and software that read and write LiPD=== | ||
+ | * [https://github.com/nickmckay/LiPD-utilities LiPD Utilities] in: | ||
+ | **Matlab | ||
+ | **Python | ||
+ | **R | ||
+ | *[https://nickmckay.github.io/GeoChronR/ GeoChronR], an R package for age-modeling and associated ensemble-based workflows (spectral analysis, PCA, correlations, etc.) | ||
+ | *[https://www.cs.colorado.edu/~lizb/cscience.html CSciBox], see [http://linked.earth/featured-partnership-cscience-linkedearth/ this post by Liz Bradley on how LiPD changed how they do science] | ||
+ | *[https://github.com/LinkedEarth/Pyleoclim_util Pyleoclim], a Python package to analyze and visualize paleo data. | ||
+ | |||
+ | == Getting your data into LiPD == | ||
+ | {{See also|Creating a LiPD file|label 1= creating a LiPD file}} | ||
+ | |||
+ | Once data are in LiPD, and the LiPD file is [http://lipd.net/validator valid], everything is [https://www.youtube.com/watch?v=StTqXEQ2l-Y awesome]. | ||
− | + | How do you get paleo data into LipD, you ask? This is now very easy thanks to the [http://lipd.net/playground LiPD playground], which enabled basic quality checks along the way. Soon LiPD creation will leverage the [[LinkedEarth_Ontology]] and NOAA's controlled vocabulary. Stay tuned for more updates. |
Latest revision as of 03:56, 11 August 2021
Contents
LiPD: Linked Paleo Data
The LiPD vision
Paleoclimate investigators have made a major effort over the past decade to make their data available to the broader community, largely through online archiving systems like the World Data Center for Paleoclimatology and Pangaea. However, there is no agreed-upon data standard for how to store and exchange such data. As the number of records in these archives has grown, making connections manually has thus become more and more challenging, hampering integrative efforts at the very time they should be flourishing. Paleoclimatologists thus need a common tongue to describe their datasets to each other and to machines.
LiPD (Linked Paleo Data) proposes such a common tongue. It is a universally-readable data container that organizes data and metadata in a uniform way, such that they may be web-searchable, and such that a variety of code functionalities may be built and apply instantly to any dataset that observes that standard.
For more details, see this article.
LiPD in LinkedEarth
LiPD is a convenient way to store and exchange paleoclimate format, and provides the backbone of the LinkedEarth edifice. LiPD is closely aligned with the LinkedEarth Ontology; changes in one are often reflected in the other (with a small lag).
Structuring data : the LiPD Way
LiPD components
There are 6 possible components to a LiPD dataset. To be considered valid, a LiPD file must observe the structure to the right, and contain basic metadata for at least some of these categories.
Root Metadata
This describes metadata that applies to the whole dataset. Common examples are:
- Dataset Name
- Investigators
- Link to online dataset
- LiPD version
Geographic Metadata
Here the the site location is described following the geoJSON standard. This includes:
- Coordinates
- Sitename
- Descriptive location (e.g.,)
- Country
- State
- Province
- Ocean
Publication Metadata
Describes publications, either traditional or data publications are described following the bibJSON standard. This includes:
- Authors
- Title
- Journal
- DOI
- Year
- URL
Funding Metadata
Includes information about the research that produced the data were funded, including:
- Funding agency
- Funding grant
PaleoData Table
This includes all of the measured and inferred paleoenvironmental data - essentially all of the measurements and data from the study that were not used to infer the age. This includes:
Measurement Tables
- Column names, units and descriptions
- Interpretation metadata
- Calibration metadata
Models
- Methodology used to produce the model results
- Summary table of model output
- Ensemble tables of data produced by the model
- Distribution tables of data produced by the model
ChronData Table
This mirrors PaleoData but only includes the data from the study that were used to infer the age of the sequence. This includes:
Measurement Tables
- Column names, units and descriptions
Models
- Methodology used to produce the model results
- Summary table of model output
- Ensemble tables of data produced by the model
- Distribution tables of data produced by the model
LiPD implementation
LiPD is centered on JSON-LD, a JSON-based format compliant with the Linked Data paradigm. JavaScript Object Notation (JSON) is an extremely lightweight and flexible way to encode information, and has become the leading format for data exchange on the Web. Linked Data are datasets that observe common rules to be able to be automatically linked through the World Wide Web. A LiPD file (.lpd) is in fact a zipped folder that follows the BagIt standard:
This includes:
- The BagIt "payload":
- One json-ld file that describes all of the metadata
- Csv tables that include data from all of the tables
- BagIt complementary information, including:
- a “bag-info.txt” file which details metadata for the bag, using colon-separated key/value pairs
- a tag manifest file which lists tag files and their associated MD5 checksums
BagIt is needed to ensure that all the bits on your computer are the same as the bits on the server.
Working with LiPD data
LiPD was designed so that it can capture much richer sets of (meta)data than ASCII or Excel files, and to have a fixed backbone around which scientific codes can be built. There is a price to pay for this power: LiPD is undoubtedly more difficult to interact with than a plain text file. Although it is possible to unzip a .lpd file and navigate through the native json-ld and csv files, this is not the best way to harness the power of LiPD files.
This very wiki was designed to allow non-coders to directly access and edit LiPD files. Additionally, a growing number of utilities and software packages can read and write LiPD files, and enable users to readily take advantage of its rich structure:
List of utilities and software that read and write LiPD
- LiPD Utilities in:
- Matlab
- Python
- R
- GeoChronR, an R package for age-modeling and associated ensemble-based workflows (spectral analysis, PCA, correlations, etc.)
- CSciBox, see this post by Liz Bradley on how LiPD changed how they do science
- Pyleoclim, a Python package to analyze and visualize paleo data.
Getting your data into LiPD
Once data are in LiPD, and the LiPD file is valid, everything is awesome.
How do you get paleo data into LipD, you ask? This is now very easy thanks to the LiPD playground, which enabled basic quality checks along the way. Soon LiPD creation will leverage the LinkedEarth_Ontology and NOAA's controlled vocabulary. Stay tuned for more updates.