Thursday, December 12, 2013

Guest post: response to "Putting GenBank Data on the Map"

DES Tahiti 09 biggerThe following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.

Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.

The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.

Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE347,349286,975 (83%)347,077 (~100%)
All COI751,955365,949 (49%)531,428 (71%)
All 16S4,876,284461,030 (9%)138,921 (3%)
All cytb239,7967,776 (3%)84,784 (35%)

Table 2.
Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE160,615132,192 (82%)160,615 (100%)
All COI302,507166,967 (55%)231,462 (77%)
All 16S1,535,364232,567 (15%)49,150 (3%)
All cytb74,6312,920 (4%)24,386 (33%)

The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).

  1. National Museum of Natural History, Smithsonian Institution Smithsonian Institution, Washington, DC 20013–7012, USA.
  2. University of Guelph, Ontario, Canada
  3. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA


  1. Marques, A. C., Maronna, M. M., & Collins, A. G. (2013). Putting GenBank Data on the Map. Science, 341(6152), 1341–1341. doi:10.1126/science.341.6152.1341-a
  2. Consortium for the Barcode of Life, (2009)
  3. Data in Tables 1 and 2 were drawn from GenBank ( [data as of 1 October 2013]
  4. Barcode of Wildlife Project, (2013)
  5. Yilmaz, P., Kottmann, R., Field, D., Knight, R., Cole, J. R., Amaral-Zettler, L., Gilbert, J. A., et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), 415–420. doi:10.1038/nbt.1823

Wednesday, December 04, 2013

Towards BioStor articles marked up using Journal Archiving Tag Set

A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content:

I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:
  1. BioStor articles need to be archived somewhere. At the moment they live on my server, and metadata is also served by BHL (as the "parts" you see in a scanned volume). Long term maybe PubMed Central is a possibility (BHL essentially becomes a publisher). Imagine PubMed Central becoming the primary archival repository for biodiversity literature.
  2. BioStor articles could be more useful if the OCR text was cleaned up and marked up (e.g., highlighting taxon names, localities, extracting citations, etc.).
  3. If BioStor articles were marked up to same extent as ZooKeys then we could use tools developed for ZooKeys (see Towards an interactive taxonomic article: displaying an article from ZooKeys) for a richer reading experience.
  4. Cleaned OCR text could also be used to generate searchable PDFs, which are still the most popular way for people to read articles (see Why do scientists tend to prefer PDF documents over HTML when reading scientific journals?). BioStor already generates PDFs, but these are simply made by wrapping page images in a PDF. Searchable PDFs would be much friendlier.

For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.

The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article "Microsporidian encephalitis of farmed Atlantic salmon (Salmo salar) in British Columbia" which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).

With this in mind, I dusted off some old code, put it together and created an example of the first baby steps towards BioStor and JATS. The code is in github, and there is a live example here.

The example takes BioStor article 65706, converts the metadata to JATS, links in the page scans, and also extracts images from the page scans based on data in the ABBYY OCR files. I've also generated HTML from the DjVu files, and this HTML includes hOCR tags that embed information about the OCR text. This format can be edited by tools such as (see Jim Garrison's moz-hocr-edit discussed in Correcting OCR using hOCR in Firefox). This HTML can be processed to output a PDF that includes the page scans but also has the OCR text as "hidden text" so the reader can search for phrases, or copy and paste the text (try the PDF for article 65706).

I've put the HTML (and all the XML and images) in github, so one obvious model for working on an article is to put it into a github repository, push any edits made to the repository, then push that to a web server that displays the articles.

There are still a lot of rough edges, and I think we can buld nicer interfaces than moz-hocr-edit (e.g., using the "contenteditable" attribute in the HTML), althogh moz-hocr-edit has the nice feature of being able to save the edits straight back to the HTML file (saving edited HTML to disk is a non-trivial task in most web browsers). I also need to add the code for building the initial JATS file (currently this is hidden on the BioStor server). There are also issues about PDF quality. At the moment I output black and white PNGs, which look nice and clean but can mangle plates and photos. I need to tweak that aspect of the process.

One application of these tools would be to take a single journal and convert all the BioStor articles into JATS, then make it available for people to further clean and markup as needed. There is an extraordinary amount of information locked away in this literature, it would be nice if we made better use of that treasure trove.