Difference between revisions of "Digitization"

From SPNHC Wiki
Jump to: navigation, search
(Data Management)
m (Contributors: add name to contributor page)
 
(36 intermediate revisions by 4 users not shown)
Line 34: Line 34:
  
 
===Data Aggregators===
 
===Data Aggregators===
Natural history collections commonly contribute to these data aggregators:
+
Natural history collections commonly contribute to data aggregators including those listed here, among ''many'' others:
*[http://ala.org.au/ ALA]
+
*[http://ala.org.au/ Atlas of Living Australia (ALA)]
*[https://bison.usgs.gov/ BISON]
+
*[https://bison.usgs.gov/ Biodiversity Serving Our Nation (BISON)]
*[http://www.cria.org.br/projetos CRIA]
+
*[https://www.canadensys.net/ Canadensys]
 +
*[http://www.cria.org.br/projetos Centro de Referência em Informação Ambiental (CRIA)]
 
*[https://www.idigbio.org/ iDigBio]
 
*[https://www.idigbio.org/ iDigBio]
*[https://www.gbif.org/ GBIF]
+
*[https://www.gbif.org/ Global Biodiversity Information Facility (GBIF)]
 +
*[http://swbiodiversity.org/seinet/ SEINet]
 +
*[https://datos.biodiversidad.co/ Sistema de Información sobre Biodiversidad (SiB) Colombia]
 
*[http://www.vertnet.org/ VertNet]
 
*[http://www.vertnet.org/ VertNet]
  
Line 45: Line 48:
 
* The [https://dataoneorg.github.io/Education/ DataONE Data Management Skillbuilding Hub] contains resources in data management and includes teaching materials, webinars, and a database of best-practices to improve methods for data sharing and management.
 
* The [https://dataoneorg.github.io/Education/ DataONE Data Management Skillbuilding Hub] contains resources in data management and includes teaching materials, webinars, and a database of best-practices to improve methods for data sharing and management.
 
* See this iDigBio Workshop for topics, materials, and presentations relevant to [https://www.idigbio.org/wiki/index.php/Managing_Natural_History_Collections_Data_for_Global_Discoverability Managing Natural History Collections for Global Discoverability].
 
* See this iDigBio Workshop for topics, materials, and presentations relevant to [https://www.idigbio.org/wiki/index.php/Managing_Natural_History_Collections_Data_for_Global_Discoverability Managing Natural History Collections for Global Discoverability].
* Search all iDigBio for materials tagged '''[https://www.idigbio.org/tags/data-management data management]'''
+
* Search all iDigBio for materials tagged [https://www.idigbio.org/tags/data-management data management]
 +
* [https://carpentries.org/ The Carpentries] offers hands-on workshops for building skills related to data literacy and coding.
 +
* Many software tools exist to help clean up and manage data; [http://openrefine.org/ Open Refine] is perhaps one of the most versatile and simplest to learn.
  
 
===Data Mobilization===
 
===Data Mobilization===
 
<p>Consider what needs to be done to get data out of a local collections database and into one or more other online resources. Some of the other categories on this wiki page that relate to this topic are [[Digitization#Data_Standards_and_Mobilization|data standards]], [[Digitization#Data_Management|data management]], [[Digitization#Data_Aggregation|data aggregation]], and [[Digitization#Workflows|workflows]]. Sharing data is often a cyclic process. Once shared, aggregators provide feedback and collections staff need to evaluate which items to address and how. After updates, data can be published again, with the enhancements.</p>
 
<p>Consider what needs to be done to get data out of a local collections database and into one or more other online resources. Some of the other categories on this wiki page that relate to this topic are [[Digitization#Data_Standards_and_Mobilization|data standards]], [[Digitization#Data_Management|data management]], [[Digitization#Data_Aggregation|data aggregation]], and [[Digitization#Workflows|workflows]]. Sharing data is often a cyclic process. Once shared, aggregators provide feedback and collections staff need to evaluate which items to address and how. After updates, data can be published again, with the enhancements.</p>
<p>Data aggregators often differ somewhat in what they expect collections data to look like to simplify aggregation. Overall, the community is moving toward shared aggregation practices. For example, most aggregators today accept darwin core archives (i.e. zippped text files in a specific format) for ingestion.</p>
+
<p>Data aggregators often differ somewhat in what they expect collections data to look like to simplify aggregation. Overall, the community is moving toward shared aggregation practices. For example, most aggregators today accept darwin core archives (i.e. zippped text files in a specific format) for ingestion. Some relevant resources include: </p>
*Darwin Core Hour: [https://github.com/tdwg/dwc-qa/wiki/Webinars#chapter-5-darwin-core-hour-darwin-core-in-practice-introduction-to-the-gbif-ipt Darwin Core in Practice: Introduction to the GBIF IPT]
+
*Demo of using the IPT to publish data from [https://github.com/tdwg/dwc-qa/wiki/Webinars#chapter-5-darwin-core-hour-darwin-core-in-practice-introduction-to-the-gbif-ipt Darwin Core in Practice: Introduction to the GBIF IPT]
*Overview of [https://www.idigbio.org/wiki/images/d/de/Penn_DataToiDigBio.pdf Data Standards and Mobilization] from an iDigBio viewpoint.
+
*Process for [https://www.gbif.org/publishing-data getting data to GBIF].
 +
*Overview of [https://www.idigbio.org/wiki/images/d/de/Penn_DataToiDigBio.pdf data standards and mobilization] from an iDigBio viewpoint.
 
<p>In the process of preparing to share data, there are many known issues to consider. Have a look at the [https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance iDigBio Data Ingestion Guidance] for an idea of the scope of the issues. Some overall topics that will come up include:</p>
 
<p>In the process of preparing to share data, there are many known issues to consider. Have a look at the [https://www.idigbio.org/wiki/index.php/Data_Ingestion_Guidance iDigBio Data Ingestion Guidance] for an idea of the scope of the issues. Some overall topics that will come up include:</p>
 
*Globally unique identifiers
 
*Globally unique identifiers
Line 60: Line 66:
  
 
===Data Standards and Mobilization===
 
===Data Standards and Mobilization===
To share our respective datasets, the data must be '''mapped''' to a single set of terms and concepts. By doing this, we can aggregate data into one searchable resource. It's rather similar to agreeing on a common language. Our collections community currently uses both [https://dwc.tdwg.org/ Darwin Core] and [https://github.com/tdwg/abcd Access to Biological Collections Data (ABCD)] to share biodiversity data. European collections use ABCD more often that Darwin Core. Current discussions are underway to work on merging these standards. [https://www.tdwg.org/standards/ac/ Audubon Core (AC)] standard provides a common language for sharing information about media (2D, 3D, etc.). Note that [https://dwc.tdwg.org/ Darwin Core] is a widely adopted standard for biodiversity data sharing. It was developed by the organization Biodiversity Information Standards ([https://www.tdwg.org/ TDWG]; historically known as the Taxonomic Databases Working Group) in 2009. A number of resources exist for its use:
+
To share our respective datasets, the data must be '''mapped''' to a single set of terms and concepts. By doing this, we can aggregate data into one searchable resource. It's rather similar to agreeing on a common language. Our collections community currently uses both [https://dwc.tdwg.org/ Darwin Core] and [https://abcd.tdwg.org/ Access to Biological Collections Data (ABCD)] to share biodiversity data. European collections use ABCD more often that Darwin Core. Current discussions are underway to work on merging these standards. [https://www.tdwg.org/standards/ac/ Audubon Core (AC)] standard provides a common language for sharing information about media (2D, 3D, etc.). Note that [https://dwc.tdwg.org/ Darwin Core] is a widely adopted standard for biodiversity data sharing. It was developed by the organization Biodiversity Information Standards ([https://www.tdwg.org/ TDWG]; historically known as the Taxonomic Databases Working Group) in 2009. A number of resources exist for its use:
 
* [https://www.gbif.org/darwin-core What is Darwin Core, and why does it matter?]
 
* [https://www.gbif.org/darwin-core What is Darwin Core, and why does it matter?]
 
* [https://dwc.tdwg.org/terms/ Darwin Core quick reference guide]
 
* [https://dwc.tdwg.org/terms/ Darwin Core quick reference guide]
Line 68: Line 74:
  
 
===Data Transcription===
 
===Data Transcription===
Transcription, aka data capture or data entry, is an essential part of the digitization process but can pose a number of challenges. Many institutions enlist the services of 'Transcription Portals', also known as 'Volunteer or Citizen Science Portals' for help in transcribing collections records. Whether that be specimen labels, field notes and diaries or helping describe and annotate some other form of media ie. animals appearing in camera trap images. Search iDigBio for materials tagged [https://www.idigbio.org/tags/transcription transcription]
+
Transcription, or the process of capturing data as text, is an essential part of the digitization process but can be time consumptive and pose a number of challenges. Technologies like [https://en.wikipedia.org/wiki/Optical_character_recognition optical character recognition (OCR)], [https://en.wikipedia.org/wiki/List_of_speech_recognition_software speech recognition] and [https://en.wikipedia.org/wiki/Machine_learning machine learning] offer opportunities to make transcription more efficient and to improve the accuracy of results. Many institutions currently enlist the services of crowdsourcing via citizen science portals for help in transcribing collections records such as specimen labels and field notes, as well as for help adding text annotations to media. Search iDigBio for materials tagged [https://www.idigbio.org/tags/transcription transcription], or see below for examples of additional transcription resources:
 +
 
 +
*all publications in the journal ''Biodiversity Information Science & Standards'' [https://biss.pensoft.net/browse_journal_articles.php?form_name=filter_articles&sortby=0&journal_id=63&search_hidden=machine+learning&search_in_=0&search_in_hidden=0&tAction=Filter related to machine learning]
 +
*all publications in the journal ''Biodiversity Information Science & Standards'' [https://biss.pensoft.net/browse_journal_articles.php?form_name=filter_articles&sortby=0&journal_id=63&search_hidden=optical+character+recognition&search_in_=0&search_in_hidden=0&tAction=Filter related to OCR]
 +
*[https://www.notesfromnature.org/ Notes from Nature] citizen science transcription project hosted on the [https://www.zooniverse.org/ Zooniverse] platform
 +
*Atlas of Living Australia's [https://digivol.ala.org.au/ DigiVol transcription platform]
 +
*[https://transcription.si.edu/ Transcription projects] at the Smithsonian Museums
 +
*[https://wedigbio.org/ WeDigBio], an annual global transcription event celebrating citizen science
 +
*project management support for public transcription projects by [https://biospex.org/ Biospex]
 +
*[https://fromthepage.com/ From the Page], a platform designed for collaboratively transcribing documents
  
 
===Database Software===
 
===Database Software===
Those curating natural history collections are currently using a number of different platforms to capture, track, and share data. Below are a few of the more common database systems :
+
Those curating natural history collections are currently using a number of different platforms to capture, track, and share data. Below are a few of the more common collection management systems (CMS):
* [https://arctosdb.org/ ARCTOS]
+
* [https://arctosdb.org/ Arctos]
 
* [https://alm.axiell.com/collections-management-solutions/technology/emu-collections-management/ Axiell EMu]
 
* [https://alm.axiell.com/collections-management-solutions/technology/emu-collections-management/ Axiell EMu]
* [http://www.collectionspace.org/ Collection Space]
+
* [https://herbaria.plants.ox.ac.uk/bol/brahms BRAHMS]
 +
* [http://www.collectionspace.org/ CollectionSpace]
 +
* [https://earthcape.com/ EarthCape]
 +
* [https://www.irisbg.com/ IrisBG]
 
* [https://www.museumsoftware.com/ Past Perfect]
 
* [https://www.museumsoftware.com/ Past Perfect]
 
* [http://www.sustain.specifysoftware.org/ Specify]
 
* [http://www.sustain.specifysoftware.org/ Specify]
* [http://symbiota.org/docs/symbiota-introduction/symbiota-overview/ Symbiota]
+
* [https://symbiota.org/ Symbiota]
Features may vary widely, including:  
+
* [http://taxonworks.org/ TaxonWorks]
 +
 
 +
Evaluating which CMS to use can be difficult and has major consequences for digitization and data management in your collection. An example of how one institution evaluated their CMS options is [https://www.idigbio.org/sites/default/files/workshop-presentations/spnhc2016/1350_Krimmel_SPNHC_Arctos.pdf available here]. CMS features may vary widely, including:  
 
*Ability to customize
 
*Ability to customize
 
*Ability to easily map to data standards
 
*Ability to easily map to data standards
Line 86: Line 106:
 
*Cost
 
*Cost
 
*Ease of publishing data to aggregators
 
*Ease of publishing data to aggregators
*Georeferencing (built-in tools, or not)
+
*Integrated tools for georeferencing
 
*Linking to media resources (e.g. label images, 2D, 3D media)
 
*Linking to media resources (e.g. label images, 2D, 3D media)
 
*Ways to batch update records
 
*Ways to batch update records
Line 96: Line 116:
 
* [http://www.gbif.org/orc/?doc_id=1288 Biogeomancer Guide to Best Practices]
 
* [http://www.gbif.org/orc/?doc_id=1288 Biogeomancer Guide to Best Practices]
 
* [http://www.geo-locate.org/ GEOLocate]  
 
* [http://www.geo-locate.org/ GEOLocate]  
* [https://www.idigbio.org/wiki/index.php/GWG_Second_Train_the_Trainers_Workshop GWS Second Train the Trainers Workshop]
+
* [https://www.idigbio.org/wiki/index.php/GWG_Second_Train_the_Trainers_Workshop GWG Second Train the Trainers Workshop]
 
It is logical to separate georeferencing of collections locality data into two categories.
 
It is logical to separate georeferencing of collections locality data into two categories.
 
# Georeferencing '''legacy data''' from the text-based locality descriptions for specimens collected before the '''global positioning system (GPS)''' made GPS coordinate collection possible. The references and examples in the list above give many hints on best/better practices for georeferencing this type of data.
 
# Georeferencing '''legacy data''' from the text-based locality descriptions for specimens collected before the '''global positioning system (GPS)''' made GPS coordinate collection possible. The references and examples in the list above give many hints on best/better practices for georeferencing this type of data.
# For new specimens entering collections, best practice would be for the georeference for that item to be included. This keeps the '''legacy data''' pile from growing and speeds access to the specimen data needed for scientific research. For best practice, a collection/institution would have a policy in place about what geospatial information is expected to be submitted with specimen. See the current [http://www.gbif.org/orc/?doc_id=1288 Biogeomancer Guide to Best Practices] for guidance on this topic.
+
# For new specimens entering collections, best practice would be for the coordinate data and metadata associated with the specimen collecting events to be provided by the collector/donor. This eliminates the need to georeference legacy data, increases the accuracy of the coordinates, and speeds access to data for scientific research. For best practice, a collection/institution would have a policy in place about what geospatial information is expected to be submitted with specimen. See the current [http://www.gbif.org/orc/?doc_id=1288 Biogeomancer Guide to Best Practices] for guidance on this topic.
 
If your institution has such a policy and guidance in place, it would be good to share it here.
 
If your institution has such a policy and guidance in place, it would be good to share it here.
  
Line 111: Line 131:
 
* [https://www.idigbio.org/wiki/index.php/Specimen_Image_Capture Image Capture] information from iDigBio
 
* [https://www.idigbio.org/wiki/index.php/Specimen_Image_Capture Image Capture] information from iDigBio
 
* [https://www.idigbio.org/wiki/index.php/Specimen_Image_Processing Image Processing] information from iDigBio
 
* [https://www.idigbio.org/wiki/index.php/Specimen_Image_Processing Image Processing] information from iDigBio
* [https://www.idigbio.org/biblio/filter Search iDigBio] for all available materials regarding imaging
 
  
 
===Key References and Further Reading===
 
===Key References and Further Reading===
* Nelson, G., D. Paul, G. Riccardi, and A.R. Mast. 2012. Five task clusters that enable efficient and effective digitization of biological collections. Zookeys 209:19-45. [https://zookeys.pensoft.net/articles.php?id=2926]
+
* Nelson, G., D. Paul, G. Riccardi, and A.R. Mast. 2012. [https://zookeys.pensoft.net/articles.php?id=2926 '''Five task clusters that enable efficient and effective digitization of biological collections''']. ''Zookeys'' 209:19-45.
* Vollmar, A. J.A. Macklin, and L.S. Ford. Natural History Specimen Digitization: Challenges and Concerns. 2010. Biodiversity Informatics 7:93-112. [https://journals.ku.edu/jbi/article/view/3992]
+
* Vollmar, A. J.A. Macklin, and L.S. Ford. [https://journals.ku.edu/jbi/article/view/3992 '''Natural History Specimen Digitization: Challenges and Concerns''']. 2010. ''Biodiversity Informatics'' 7:93-112.
* [https://zookeys.pensoft.net/browse_journal_issue_documents?issue_id=361 ZooKeys Special Issue] (No specimen left behind: mass digitization of natural history collections (2012)
+
* [https://zookeys.pensoft.net/browse_journal_issue_documents?issue_id=361 ZooKeys 2012 Special Issue] on '''No specimen left behind: Mass digitization of natural history collections'''
 +
* Lendemer, J., B. Thiers, A. K. Monfils, J. Zaspel, E. R. Ellwood, A. Bentley, K. LeVan, et al. 2020. '''The Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education'''. ''BioScience'' 70:1. [https://doi.org/10.1093/biosci/biz140 doi:10.1093/biosci/biz140].
 +
* White, E., E. Baldridge, Z. Brym, K. Locey, D. McGlinn, and S. Supp. 2013. '''Nine Simple Ways to Make It Easier to (Re)Use Your Data'''. ''Ideas in Ecology and Evolution'' 6:2. [https://doi.org/10.4033/iee.2013.6b.6.f doi:10.4033/iee.2013.6b.6.f]
 +
* Goodman, A., A. Pepe, A. W. Blocker, C. L. Borgman, K. Cranmer, M. Crosas, R. Di Stefano, et al. 2014. '''Ten Simple Rules for the Care and Feeding of Scientific Data'''. ''PLoS Computational Biology'' 10:4 e1003542. [https://doi.org/10.1371/journal.pcbi.1003542 doi:10.1371/journal.pcbi.1003542]
 +
* The [https://www.ala.org.au/wp-content/uploads/2011/10/Digitisation-guide-120326.pdf '''Atlas of Living Australia (ALA) digitisation guide''']. Includes key guidance material such as the ‘[https://www.ala.org.au/who-we-are/digitisation-guidance/ digitisation maturity model]'
 
* [https://www.idigbio.org/biblio/filter Search iDigBio] for all available digitization materials
 
* [https://www.idigbio.org/biblio/filter Search iDigBio] for all available digitization materials
* The [https://www.ala.org.au/wp-content/uploads/2011/10/Digitisation-guide-120326.pdf Atlas of Living Australia (ALA) digitisation guide].Includes key guidance material such as the ‘[https://www.ala.org.au/who-we-are/digitisation-guidance/ digitisation maturity model]'
+
 
 +
=====Herbarium resources=====
 +
* Sweeney, P., et al. 2018. [https://doi.org/10.12705/671.10 '''Large-scale digitization of herbarium specimens: Development and usage of an automated, high-throughput conveyor system''']. ''Taxon'' 67(1):165-178.
 +
* Harris, K. M., Marsico, T. D. 2017. [https://doi.org/10.3732/apps.1600125 '''Digitizing specimens in a small herbarium: A viable workflow for collections working with limited resources''']. ''Applications in Plant Sciences'' 5(4).
 +
* Nelson, G., et al. 2015. [https://doi.org/10.3732/apps.1500065 '''Digitization workflows for flat sheets and packets of plants, algae, and fungi''']. ''Applications in Plant Sciences'' 3(9).
  
 
===Webinars===
 
===Webinars===
Access to various webinars (and select recorded presentations) regarding digization is available:
+
Access to various webinars (and select recorded presentations) regarding digitization and digital data management is available:
* [https://www.idigbio.org/tags/webinar iDigBio Webinars]
+
* [https://github.com/tdwg/dwc-qa/wiki/Webinars Darwin Core Hour webinar series]
* [https://github.com/tdwg/dwc-qa/wiki/ Webinars Darwin Core Hours]
+
* [https://www.idigbio.org/wiki/index.php/Paleo_Digitization_Working_Group Paleo Digitization Working Group webinars]
* See this [https://www.idigbio.org/wiki/index.php/Paleo_Digitization_Working_Group Digitization Working Group Page] for Paleo Webinars
+
* [http://smallcollections.net/tags/webinar-recordings Small Collections (SCNet) webinars]
* [http://smallcollections.net/tags/webinar-recordings SCNet Webinars]
+
* [https://arctosdb.org/learn/webinars/ Arctos webinar series]
* [https://vimeo.com/idigbio Presentations] from iDigBio events on Vimeo
+
* [https://www.idigbio.org/wiki/index.php/SWG_Webinar_Series Symbiota webinar series]
 +
* search iDigBio for [https://www.idigbio.org/tags/webinar everything tagged webinar]
 +
* see [https://vimeo.com/idigbio presentations from iDigBio events] on Vimeo
  
 
===Workflows===
 
===Workflows===
 
Various general and discipline-specific materials regarding digitization are available via iDigBio:
 
Various general and discipline-specific materials regarding digitization are available via iDigBio:
* [https://www.idigbio.org/wiki/images/0/09/LFG-idigbio-1-short.pdf Workflows: Perspectives from Peabody Entomology]
+
* [https://www.idigbio.org/wiki/images/6/60/Workflows.pdf Digitization workflows for herbaria (2012 workshop)]
* [https://www.idigbio.org/wiki/images/a/a4/Mast_Crowd_Valdosta_15.pdf Workflows Herbarium Digitization]
+
* [https://www.idigbio.org/content/digitization-workflows Compiled digitization workflows from iDigBio]
* [https://www.idigbio.org/wiki/images/c/c2/Norris_Workflows_for_Digitizing_Vertebrate_Paleo.pdf Workflows for Digitizing Vertebrate Paleontology]
+
* [https://www.idigbio.org/wiki/images/c/c2/Norris_Workflows_for_Digitizing_Vertebrate_Paleo.pdf Workflows for digitizing vertebrate paleontology collections]
* [https://www.idigbio.org/wiki/images/f/f8/Mccormick-UNC-Lessons_Learned.pdf Workflows at University of North Carolina]
+
* [https://www.idigbio.org/content/workflow-modules-and-task-lists Workflow modules and task lists]
 +
* [https://www.idigbio.org/wiki/images/0/09/LFG-idigbio-1-short.pdf Workflow Perspectives from Peabody Entomology]
 
* [https://www.idigbio.org/wiki/images/2/20/2.Lorence_NTBG.pdf We're Virtually There: Digitizing NTBG's Herbarium Collection (PTBG)]
 
* [https://www.idigbio.org/wiki/images/2/20/2.Lorence_NTBG.pdf We're Virtually There: Digitizing NTBG's Herbarium Collection (PTBG)]
* [https://www.idigbio.org/sites/default/files/workshop-presentations/small-herbarium2013/The%20VSU%20Experience.pdf The Valdosta State Herbarium (VSC) Experience: Mobilizing Small Herbaria for Digitization]
 
  
 
===Workshops and Symposia===
 
===Workshops and Symposia===
 
A number of workshops and conference symposia have focused on the subject of digitization:
 
A number of workshops and conference symposia have focused on the subject of digitization:
* [https://www.idigbio.org/wiki/index.php/IDigBio_Workshops iDigBio workshops (2011-2019)]
+
* [https://www.idigbio.org/wiki/index.php/IDigBio_Workshops iDigBio workshops (2011-2020)]
 
* [https://biss.pensoft.net/collection/129/ Biodiversity Next Symposium: Digitisation Next (2019)]
 
* [https://biss.pensoft.net/collection/129/ Biodiversity Next Symposium: Digitisation Next (2019)]
 
* [[Digitization_Workshop_ASIH_2016|SPNHC-sponsored ASIH Workshop (2016)]]
 
* [[Digitization_Workshop_ASIH_2016|SPNHC-sponsored ASIH Workshop (2016)]]
Line 145: Line 174:
 
* [https://www.gbif.org/event/6eOfmA7ApqWcWKWOwYW44o/digitization-and-publication-workshop GBIF Digitization and Publication Workshop (2017)]
 
* [https://www.gbif.org/event/6eOfmA7ApqWcWKWOwYW44o/digitization-and-publication-workshop GBIF Digitization and Publication Workshop (2017)]
 
* [https://www.gbif.org/event/2rfvo6U3Osagmw8Yyog6a8/digitization-and-mobilization-of-biological-data-workshop GBIF Digitization and Mobilization of Biological Data Workshop (2018)]
 
* [https://www.gbif.org/event/2rfvo6U3Osagmw8Yyog6a8/digitization-and-mobilization-of-biological-data-workshop GBIF Digitization and Mobilization of Biological Data Workshop (2018)]
 +
* Biodiversity Informatics 101, held at [https://www.idigbio.org/wiki/index.php/SPNHC_2017_Biodiversity_Informatics_101 SPNHC 2017] and again at [https://github.com/tdwg/curriculum/blob/master/biodiversity-informatics-101/bi101_schedule_2019.md Biodiversity Next in 2019]
  
 
==Contributors==
 
==Contributors==
Current content contributors: SPNHC members [[User:BredaZimkus|Breda Zimkus]], [[User:JessicaCundiff|Jessica Cundiff]], [[User:GenevieveTocci|Genevieve Tocci]], [[User:NicoleFisher|Nicole Fisher]], and [[User:DeborahPaul|Deborah Paul]]. We hope that others will add their names to this list as information is added and updated.
+
Current content contributors: SPNHC members [[User:BredaZimkus|Breda Zimkus]], [[User:JessicaCundiff|Jessica Cundiff]], [[User:GenevieveTocci|Genevieve Tocci]], [[User:NicoleFisher|Nicole Fisher]], [[User:DeborahPaul|Deborah Paul]], [[User:Erica_Krimmel|Erica Krimmel]], and [[User:Katie_Pearson|Katie Pearson]]. We hope that others will add their names to this list as information is added and updated.
  
 
[[Digitization_Workshop_ASIH_2016|Original digitization page content]] now found [[Digitization_Workshop_ASIH_2016|here]] was generated during The American Society of Ichthyologists and Herpetologists (ASIH) Annual Joint Meeting - 2016, during an iDigBio sponsored workshop by the following individuals participating in the "Digitization" working group of the aforementioned workshop: Gil Nelson (Florida State University, Courtesy Faculty), Larry Page (The Florida Museum of Natural History, Ichthyology Curator), Cristina Cox-Fernandes (UMass Amherst Biology, Adjunct Research Associate Professor), Mark Sabaj (ANSP, Ichthyology Collection Manager), Adam Summers (University of Washington, Professor - Friday Harbor Labs), Kevin Love (iDigBio, IT Expert), Ken Thompson (Lock Haven University, Professor; Retired), Randy Singer (Florida Museum of Natural History), and Gregory Watkins-Colwell (Yale Peabody Museum, Herps and Fishes, Collection Manager).
 
[[Digitization_Workshop_ASIH_2016|Original digitization page content]] now found [[Digitization_Workshop_ASIH_2016|here]] was generated during The American Society of Ichthyologists and Herpetologists (ASIH) Annual Joint Meeting - 2016, during an iDigBio sponsored workshop by the following individuals participating in the "Digitization" working group of the aforementioned workshop: Gil Nelson (Florida State University, Courtesy Faculty), Larry Page (The Florida Museum of Natural History, Ichthyology Curator), Cristina Cox-Fernandes (UMass Amherst Biology, Adjunct Research Associate Professor), Mark Sabaj (ANSP, Ichthyology Collection Manager), Adam Summers (University of Washington, Professor - Friday Harbor Labs), Kevin Love (iDigBio, IT Expert), Ken Thompson (Lock Haven University, Professor; Retired), Randy Singer (Florida Museum of Natural History), and Gregory Watkins-Colwell (Yale Peabody Museum, Herps and Fishes, Collection Manager).
Line 154: Line 184:
  
  
[[Category:Best_Practices]][[Category:Digitization and Information Sharing]]
+
[[Category:Best_Practices]][[Category:Digitization and Imaging]]

Latest revision as of 16:32, 10 November 2022

Statement of Purpose

Realizing the import of collections [1][2][3], SPNHC recognizes the need to collaborate to develop, discover, disseminate and update best (better, current, recommended) practices for creating digital collections resources and publishing them for global access. Materials linked here represent the efforts of many collections data mobilization projects worldwide. All in the collections and standards community are encouraged to contribute.

Defining Digitization

In the context of the SPNHC wiki 'digitize' means converting ALL analog data to digital data according to standard vocabularies such as DarwinCore and AudubonCore. That is, we start with the concept of a specimen that has been accessioned in a collection. We envision these digital data eventually to include the entirety of analog data that are associated with a particular specimen. This may include but is not limited to:

  • Text data from labels and ledgers associated with specimens
  • Images of specimens
  • DNA and other 'omics
  • Field notes, drawings and images
  • Tomographic imaging data
  • Specimen history (including preservation)
  • Specimen-associated literature and media
  • Collection-level metadata

Digitizing might be accomplished by collections managers, technicians, contractors, volunteers, and other entities, the results of which are included within the institution's collection management system. In many instances these data may be generated off site by investigators.

The process of digitization has been analyzed by Nelson et al. (2012)[4], and five task clusters that comprise the digitization process leading up to data publication have been identified:

  1. Pre-digitization curation and staging
  2. Specimen image capture
  3. Specimen image processing
  4. Electronic data capture
  5. Georeferencing locality descriptions

We expect these groupings to change over time as standards of practice for digitization processes and procedures evolve. For example, the following should likely be added

  • Data mobilization as a task cluster (aka data publishing), as after data capture in a local database, the data need to be shared outside the local database.
  • Feedback re-integration - after collections data is published, feedback from others needs re-integrating into local collections. This re-integration step requires vetting and usually some policy decision-making. At some point, local changes to a collection management system may be desired/needed.
  • Pro-active data capture - policies and procedures for capturing new specimen data in the field (i.e."born-digital data"), already mapped to relevant data standards and formatted accordingly.

Digitization Resources

Data Aggregation

Data mobilization (getting the data out of your local collection management database) involves contributing data and media to a designated aggregator/s. These data are then integrated with data from other institutions to provide access to a greater volume of datasets. The aggregation resource scope may be taxonomic-focused (e.g. SCAN), organization or institution-based (e.g. C. V. Starr Virtual Herbarium), regional (e.g. SEINet), national (e.g. the Atlas of Living Australia), global (e.g. GBIF), or otherwise. Aggregating data offers collections unique opportunities to enhance collections data, facilitate discovery, and increase re-use. The following resources introduce the aggregator's point-of-view and what to expect.

Getting collections data to an aggregator is a multi-step, and often cyclic process. See also: data standards, data management, data mobilization, and workflows.

Data Aggregators

Natural history collections commonly contribute to data aggregators including those listed here, among many others:

Data Management

Data Mobilization

Consider what needs to be done to get data out of a local collections database and into one or more other online resources. Some of the other categories on this wiki page that relate to this topic are data standards, data management, data aggregation, and workflows. Sharing data is often a cyclic process. Once shared, aggregators provide feedback and collections staff need to evaluate which items to address and how. After updates, data can be published again, with the enhancements.

Data aggregators often differ somewhat in what they expect collections data to look like to simplify aggregation. Overall, the community is moving toward shared aggregation practices. For example, most aggregators today accept darwin core archives (i.e. zippped text files in a specific format) for ingestion. Some relevant resources include:

In the process of preparing to share data, there are many known issues to consider. Have a look at the iDigBio Data Ingestion Guidance for an idea of the scope of the issues. Some overall topics that will come up include:

  • Globally unique identifiers
  • Collection-level metadata
  • Data standard and format issues (e.g. date formats, missing higher taxonomy, geo-coordinate issues,...)
  • Rights information (e.g. Creative Commons licenses for images)
  • Check out the VertNet Norms for Data Use and Publication for a thorough introduction to the licensing issues pertinent to collections data and media.

Data Standards and Mobilization

To share our respective datasets, the data must be mapped to a single set of terms and concepts. By doing this, we can aggregate data into one searchable resource. It's rather similar to agreeing on a common language. Our collections community currently uses both Darwin Core and Access to Biological Collections Data (ABCD) to share biodiversity data. European collections use ABCD more often that Darwin Core. Current discussions are underway to work on merging these standards. Audubon Core (AC) standard provides a common language for sharing information about media (2D, 3D, etc.). Note that Darwin Core is a widely adopted standard for biodiversity data sharing. It was developed by the organization Biodiversity Information Standards (TDWG; historically known as the Taxonomic Databases Working Group) in 2009. A number of resources exist for its use:

More and more, the process of mapping collections data to Darwin Core or other standards is simplified by the collections software itself.

Data Transcription

Transcription, or the process of capturing data as text, is an essential part of the digitization process but can be time consumptive and pose a number of challenges. Technologies like optical character recognition (OCR), speech recognition and machine learning offer opportunities to make transcription more efficient and to improve the accuracy of results. Many institutions currently enlist the services of crowdsourcing via citizen science portals for help in transcribing collections records such as specimen labels and field notes, as well as for help adding text annotations to media. Search iDigBio for materials tagged transcription, or see below for examples of additional transcription resources:

Database Software

Those curating natural history collections are currently using a number of different platforms to capture, track, and share data. Below are a few of the more common collection management systems (CMS):

Evaluating which CMS to use can be difficult and has major consequences for digitization and data management in your collection. An example of how one institution evaluated their CMS options is available here. CMS features may vary widely, including:

  • Ability to customize
  • Ability to easily map to data standards
  • Ability to store and track identifiers (e.g. for people, specimens, identifications, ...)
  • Assignment of globally unique identifiers
  • Available fields
  • Cost
  • Ease of publishing data to aggregators
  • Integrated tools for georeferencing
  • Linking to media resources (e.g. label images, 2D, 3D media)
  • Ways to batch update records

Georeferencing

A number of resources pertaining to the process of georeferencing, defining a location using map coordinates and assigning the coordinate system of the map frame, are available:

It is logical to separate georeferencing of collections locality data into two categories.

  1. Georeferencing legacy data from the text-based locality descriptions for specimens collected before the global positioning system (GPS) made GPS coordinate collection possible. The references and examples in the list above give many hints on best/better practices for georeferencing this type of data.
  2. For new specimens entering collections, best practice would be for the coordinate data and metadata associated with the specimen collecting events to be provided by the collector/donor. This eliminates the need to georeference legacy data, increases the accuracy of the coordinates, and speeds access to data for scientific research. For best practice, a collection/institution would have a policy in place about what geospatial information is expected to be submitted with specimen. See the current Biogeomancer Guide to Best Practices for guidance on this topic.

If your institution has such a policy and guidance in place, it would be good to share it here.

iDigBio Digitization Resources Wiki

The iDigBio Digitization Resources wiki page provides resources and information regarding digitization, including training workshops being conducted by iDigBio, digitization information and resources, and links to documents, websites, videos, presentations, and other important information related to biological collection digitization.

Imaging and Media

A number of techniques are available for two-dimensional (2D) and three-dimensional (3D) digitization, including X-ray computed tomography (CT):

Key References and Further Reading

Herbarium resources

Webinars

Access to various webinars (and select recorded presentations) regarding digitization and digital data management is available:

Workflows

Various general and discipline-specific materials regarding digitization are available via iDigBio:

Workshops and Symposia

A number of workshops and conference symposia have focused on the subject of digitization:

Contributors

Current content contributors: SPNHC members Breda Zimkus, Jessica Cundiff, Genevieve Tocci, Nicole Fisher, Deborah Paul, Erica Krimmel, and Katie Pearson. We hope that others will add their names to this list as information is added and updated.

Original digitization page content now found here was generated during The American Society of Ichthyologists and Herpetologists (ASIH) Annual Joint Meeting - 2016, during an iDigBio sponsored workshop by the following individuals participating in the "Digitization" working group of the aforementioned workshop: Gil Nelson (Florida State University, Courtesy Faculty), Larry Page (The Florida Museum of Natural History, Ichthyology Curator), Cristina Cox-Fernandes (UMass Amherst Biology, Adjunct Research Associate Professor), Mark Sabaj (ANSP, Ichthyology Collection Manager), Adam Summers (University of Washington, Professor - Friday Harbor Labs), Kevin Love (iDigBio, IT Expert), Ken Thompson (Lock Haven University, Professor; Retired), Randy Singer (Florida Museum of Natural History), and Gregory Watkins-Colwell (Yale Peabody Museum, Herps and Fishes, Collection Manager).

References

  1. Lawrence M. Page, Bruce J. MacFadden, Jose A. Fortes, Pamela S. Soltis, Greg Riccardi, Digitization of Biodiversity Collections Reveals Biggest Data on Biodiversity, BioScience, Volume 65, Issue 9, 01 September 2015, Pages 841–842, https://doi.org/10.1093/biosci/biv104
  2. Nelson, G., & Ellis, S. (2019, January 7). The history and impact of digitization and digital data mobilization on biodiversity research. Philosophical Transactions of the Royal Society B: Biological Sciences. Royal Society Publishing. https://doi.org/10.1098/rstb.2017.0391
  3. Monfils, A. K., Powers, K. E., Marshall, C. J., Martine, C. T., Smith, J. F., & Prather, L. A. (2017). Natural History Collections: Teaching about Biodiversity Across Time, Space, and Digital Platforms. Southeastern Naturalist, 16(sp10), 47–57. https://doi.org/10.1656/058.016.0sp1008
  4. Nelson G, Paul D, Riccardi G, Mast A (2012) Five task clusters that enable efficient and effective digitization of biological collections. ZooKeys 209: 19-45. https://doi.org/10.3897/zookeys.209.3135