Friday, September 30, 2016

Guest post: It's 2016 and your data aren't UTF-8 encoded?

Bob mesibov The following is a guest post by Bob Mesibov.

According to w3techs, seven out of every eight websites in the Alexa top 10 million are UTF-8 encoded. This is good news for us screenscrapers, because it means that when we scrape data into a UTF-8 encoded document, the chances are good that all the characters will be correctly encoded and displayed.

It's not quite good news for two reasons.

In the first place, one out of eight websites is encoded with some feeble default like ISO-8859-1, which supports even fewer characters than the closely related windows-1252. Those sites will lose some widely-used punctuation when read as UTF-8, unless the webpage has been carefully composed with the HTML equivalents of those characters. You're usually safe (but see below) with big online sources like Atlas of Living Australia (ALA), APNI, CoL, EoL, GBIF, IPNI, IRMNG, NCBI Taxonomy, The Plant List and WoRMS, because these declare a UTF-8 charset in a meta tag in webpage heads. (IPNI's home page is actually in ISO-8859-1, but its search results are served as UTF-8 encoded XML.)

But a second problem is that just because a webpage declares itself to be UTF-8, that doesn't mean every character on the page sings from the Unicode songbook. Very odd characters may have been pulled from a database and written onto the page as-is. In ALA I recently found an ancient rune — the High Octet Preset control character (HOP, hex 81):

http://biocache.ala.org.au/occurrences/6191ca90-873b-44f8-848d-befc29ad7513 http://biocache.ala.org.au/occurrences/5077df1f-b70a-465b-b22b-c8587a9fb626

HOP replaces ü on these pages and is invisible in your browser, but a screenscrape will capture the HOP and put SchHOPrhoff in your UTF-8 document.

Another example of ALA's fidelity to its sources is its coding of the degree symbol, which is a single-byte character (hex b0) in windows-1252, e.g. in Excel spreadsheets, but a two-byte character (hex c2 b0) in Unicode. In this record, for example:

http://biocache.ala.org.au/occurrences/5e3a2e05-1e80-4e1c-9394-ed6b37441b20

the lat/lon was supplied (says ALA) as 37°56'9.10"S 145° 0'43.74"E. Or was it? The lat/lon could have started out as 37°56'9.10"S 145°0'43.74"E in UTF-8. Somewhere along the line the lat/lon was converted to windows-1252 and the ° characters were generated, resulting in geospatial gibberish.

When a program fails to understand a character's encoding, it usually replaces the mystery character with a ?. A question mark is a perfectly valid character in commonly used encodings, which means the interpretation failure gets propagated through all future re-uses of the text, both on the Web and in data dumps. For example,

http://biocache.ala.org.au/occurrences/dfbbc42d-a422-47a2-9c1d-3d8e137687e4

gives N?crophores for Nécrophores. The history of that particular character failure has been lost downstream, as is the case for myriads of other question marks in online biodiversity data.

In my experience, the situation is much worse in data dumps from online sources. It's a challenge to find a dump without question marks acting as replacement characters. Many of these question marks appear in author and place names. Taxonomists with eastern European names seem to fare particularly badly, sometimes with more than one character variant appearing in the same record, as in the Australian Faunal Directory (AFD) offering of Wêgrzynowicz, W?grzynowicz and Węgrzynowicz for the same coleopterist. Question marks also frequently replace punctuation, such as n-dashes, smart quotes and apostrophes (e.g. O?Brien (CoL) and L?Échange and d?Urville (AFD)).

Character encoding issues create major headaches for data users. It would be a great service to biodiversity informatics if data managers compiled their data in UTF-8 encoding or took the time to convert to UTF-8 and fix any resulting errors before publishing to the Web or uploading to aggregators.

This may be a big ask, given that at least one data manager I've talked to had no idea how characters were encoded in the institution's database. But as ALA's Miles Nicholls wrote back in 2009, "Note that data should always be shared using UTF-8 as the character encoding". Biodiversity informatics is a global discipline and UTF-8 is the global standard for encoding.

Readers needing some background on character encoding will find this and especially this helpful, and a very useful tool to check for encoding problems in small blocks of text is here.

GBIF 2016 Ebbe Nielsen Challenge entries

The GBIF 2016 Ebbe Nielsen Challenge has received 15 submissions. You can view them here: Screenshot 2016 09 30 14 01 05 Unlike last year where the topic was completely open, for the second challenge we've narrowed the focus to "Analysing and addressing gaps and biases in primary biodiversity data". As with last year, judging is limited to the jury (of which I'm a member), however anyone interested in biodiversity informatics can browse the submissions. Although you can't leave comments directly on the submissions within the GBIF Challenge pages, each submission also appears on the portfolio page of the person/organisation that created the entry, so you can leave comments there (follow the link at the bottom of the page for each submission to see it on the portfolio page).

Wednesday, September 07, 2016

Guest post: Absorbing task or deranged quest: an attempt to track all genus names ever published

YtNkVT2U This guest post by Tony Rees describes his quest to track all genus names ever published (plus a subset of the species…).

A “holy grail” for biodiversity informatics is a suitably quality controlled, human- and machine-queryable list of all the world’s species, preferably arranged in a suitable taxonomic hierarchy such as kingdom-phylum-class-order-family-genus or other. To make it truly comprehensive we need fossils as well as extant taxa (dinosaurs as well as dinoflagellates) and to cover all groups from viruses to vertebrates (possibly prions as well, which are, well, virus-like). Linnaeus had some pretty good attempts in his day, and in the internet age the challenge has been taken up by a succession of projects such as the “NODC Taxonomic Code” (a precursor to ITIS, the Integrated Taxonomic Information System - currently 722,000 scientific names), the Species 2000 consortium, and the combined ITIS+SP2000 product “Catalogue of Life”, now in its 16th annual edition, with current holdings of 1,635,250 living and 5,719 extinct valid (“accepted”) species, plus an additional 1,460,644 synonyms (information from http://www.catalogueoflife.org/annual-checklist/2016/info/ac). This looks pretty good until one realises that as well as the estimated “target” of 1.9 million valid extant species there are probably a further 200,000-300,000 described fossils, all with maybe as many synonyms again, making a grand total of at least 5 million published species names to acquire into a central “quality assured” system, a task which will take some time yet.

Ten years ago, in 2006, the author participated in a regular meeting of the steering committee for OBIS, the Ocean Biogeographic Information System which, like GBIF, aggregates species distribution data (for marine species in this context) from multiple providers into a single central search point. OBIS was using the Catalogue of Life (CoL) as its “taxonomic backbone” (method for organising its data holdings) and, again like GBIF, had come up against the problem of what to do with names not recognised in the then-latest edition of CoL, which was at the time less than 50% complete (information on 884,552 species). A solution occurred to me that since genus names are maybe only 10% as numerous as species names, and every species name includes its containing genus as the first portion of its binomial name (thanks, Linnaeus!), an all-genera index might be a tractable task (within a reasonable time frame) where an all-species index was not, and still be useful for allocating incoming “not previously seen” species names to an appropriate position in the taxonomic hierarchy. OBIS, in particular, also wished to know if species (or more exactly, their parent genera) were marine (to be displayed) or nonmarine (hide), similar with extant versus fossil taxa. Sensing a challenge, I offered to produce such a list, in my mind estimating that it might require 3 months full-time, or the equivalent 6 months in part time effort to complete and supply back to OBIS for their use.

To cut a long story short… the project, which I christened the Interim Register of Marine and Nonmarine Genera or IRMNG (originally at CSIRO in Australia, now hosted on its own domain “www.irmng.org” and located at VLIZ in Belgium) has successfully acquired over 480,000 published genus names, including valid names, synonyms and a subset of published misspellings, all allocated to families (most) or higher ranks (remainder) in an internally coherent taxonomic structure, most with marine/nonmarine and extant/fossil flags, all with the source from which I acquired them, sources for the flags, and more; also for perhaps 50% of genera, lists of associated species from wherever it has been convenient to acquire them (Catalogue of Life 2006 being a major source, but many others also used). My estimated 6 months has turned into 10 years and counting, but I do figure that the bulk of the basic “names acquisition” has been done for all groups (my estimate: over 95% complete) and it is rare (although not completely unknown) for me to come across genus names not yet held, at least for the period 1753-2014 which is the present coverage of IRMNG; present effort is therefore concentrated on correcting internal errors and inconsistencies, and upgrading the taxonomic placement (to family) for the around 100,000 names where this is not yet held (also establishing the valid name/synonym status of a similar number of presently “unresolved” generic names).

With the move of the system from Australia to VLIZ, completed within the last couple of months, there is the facility to utilise all of the software and features presently developed at VLIZ that currently runs WoRMS, the World Register of Marine Species and its many associated subsidiary databases, as well as (potentially) look at forming a distributed editing network for IRMNG in the future, as already is the case for WoRMS, presuming that others are see a value in maintaining IRMNG as a useful resource e.g. for taxonomic name resolution, detection of potential homonyms both within and across kingdoms, and generally acting as a hierarchical view of “all life” to at least genus level. A recently implemented addition to IRMNG is to hold ION identifiers (also used in BioNames), for the subset of names where ION holds the original publication details, enabling “deep links” to both ION and BioNames wherein the original publication can often be displayed, as previously described elsewhere in this Blog. Similar identifiers for plants are not yet held in the system but could be, (for example Index Fungorum identifiers for fungi), for cases where the potential linked system adds value in giving, for example, original publication details and onward links to the primary literature.

All in all I feel that the exercise has been of value not only to OBIS (the original “client”) but also to other informatics systems such as GBIF, Encyclopedia of Life, Atlas of Living Australia, Open Tree of Life and others who have all taken advantage of IRMNG data to add to their systems, either for the marine/nonmarine and extant/fossil flags or as an addition to their primary taxonomic backbones, or both. In addition it has allowed myself, the founding IRMNG compiler, to “scratch the taxonomic itch” and finally flesh out what is meant by statements that a certain group contains x families or y genera, and what these actually might be. Finally, many users of the system via its web interface have commented over time on how useful it is to be able to input “any” name, known or unknown, with a good chance that IRMNG can tell them something about the genus (or genus possible options, in the case of homonyms) as well as the species, in many cases, as well as discriminate extant from fossil taxon names, something not yet offered to any significant extent by the current Catalogue of Life.

Readers of iPhylo are encouraged to try IRMNG as a “taxonomic name resolution service” by visiting www.irmng.org and of course, welcome to contact me with suggestions of missed names (concentrating at genus level at the present time) or any other ideas for improvement (which I can then submit for consideration to the team at VLIZ who now provide the technical support for the system).

Tuesday, August 30, 2016

GRBio: A Call for Community Curation - what community?

Singlefig 98253 jpg David Schindel and colleagues recently published a paper in the Biodiversity Data Journal:

Schindel, D., Miller, S., Trizna, M., Graham, E., & Crane, A. (2016, August 26). The Global Registry of Biodiversity Repositories: A Call for Community Curation. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.4.e10293

The paper is a call for the community to help grow a database (GRBio) on biodiversity repositories, a database that will "will require community input and curation".

Reading this, I'm struck by the lack of a clear sense of what that community might be. In particular: who is this database for, and who is most likely to build it? I suspect that these are two different sets of people.

Who is it for?

It strikes me that the primary target for GrBio is people who care about cleaning up and linking data. This is very small set of people. While cleaned data is nice, and cleaned and linked data is great, by itself it's not much use until it finds its way into useful tools. Why would taxonomists, curators, and other people working with biodiversity data care about GrBio? What can it give them? Ultimately, we'd like things such as the ability to find any specimen in the world online using simply it's museum collection code. We'd like to track usage of that code in other databases, such as GenBank and BOLD, and in the primary literature. These are all nice things, but they won't happen simply because we have a curated list of natural history collections.

Who will curate it?

Arguably there are only two active communities who care about the contents of GrBio on a scale to actually contribute. One is GBIF, which is building its own registry of collections as more and more natural history collections move their collections online. GBIF's registry is primarily for digital access points to collection data, which don't necessarily readily map to physical collections listed by GrBio. If GrBo is to be relevant, it needs to have mappings between its data and GBIF.

But the community that I suspect will really care about this, to the point that they'd actively engage in editing the data, is not the biodiversity community. Rather, it's people who edit Wikipedia and Wikidata.

I was at one of the GrBio workshops and gave a short presentation, which included this slide:

If you search for a major museum on Google, on the right you will often see a rich "knowledge panel" giving much of the information that GrBio wants to capture (museum name, location, etc.), often with a link to the Wikipedia page for that institution (see http://g.co/kg/m/01t372 for a detailed view of the knowledge panel for the NHM). GrBio can't compete with Wikipedia for richness of content (just think of all the Wikipedia pages in languages other than English). Google's database is mostly hidden, but we can get some of the same data from Wikidata, e.g. the Natural History Museum is entity Q309388 on Wikidata.

From my perspective, the smart move to make is not to appeal to an overstretched community of biodiversity researchers, many of whom are suffering from "project fatigue" as yet more acronyms compete for their attention. Instead, position the project as adding to the already existing database of natural history institutions that is growing in Wikidata. GrBio could either link to Wikidata and Wikipedia pages for institutions, or simply move it's data editing efforts to Wikidata and have GrBio be (at most) a search interface to that data. The notion of having a separate database for the world's collections might not be the best way to achieve GrBio's goals. A lot of people involved in both Wikipedia and Wikidata care about cultural institutions (of which natural history museums and herbaria are examples), those are the people GrBio should be engaging with.

Friday, August 26, 2016

Displaying original species descriptions in BioNames

B8e253dc3be3d84f2c69c51b0af86c03 400x400The goal of my BioNames project is to link every taxonomic name to its original description (initially focussing on animal names). The rationale is that taxonomy is based on evidence, and yet most of this evidence is buried in a non-digitised and/or hard to find literature. Surfacing this information not only makes taxonomic evidence accessible (see Surfacing the deep data of taxonomy), it also surfaces a lot of basic biological information. In many cases the original taxonomic description will be an important source of information about what a species looks like, where it lives, and what it does.

To date I've focussed on linking names to publications, such as articles, on the grounds that this is the unit of citation in science. It's also the unit most often digitised and assigned an identifier, such as a DOI. But often taxonomists cite not an article but the individual page on which the description appears. In web-speak, taxonomists cite "fragment identifiers". Page-level identifiers are not often encountered in the digital world, in part because many digital representations don't have "pages". But this doesn't mean that we can't have identifiers for parts of an article, for example in Fragment Identifiers and DOIs Martin Fenner gives examples of ways to link to specific parts of an online article. His examples work if the article is displayed as HTML. If we are working with XML (say, for a journal published by Pensoft), then we can use XPath to refer to sections of a document. Ultimately it would be nice to have stable identifiers for document fragments linked to taxonomic names, so that we can readily go from name to description (even better if that description was in machine-readable form). You could think of these as locators for "taxonomic treatments", e.g. Miller et al. 2015.

As a quick and dirty approach to this I've reworked BioNames to be able to show the page where a species name is first published. This only works if a number of conditions are met:

  • The BioName database has the page number ("micro reference") for the name.
  • BioNames has the full text for the article, either from BioStor or a PDF.
  • The taxonomic name has been found in that text (e.g., by the Global Names GNRD service).

If these conditions are met, then BioNames will display the page, like this example (Belobranchus segura Keith, Hadiaty & Lord 2012: Screenshot 2016 08 26 16 13 50

Both the page image and OCR text (if available) are displayed. This is a first step towards (a) making stable identifiers available for these pages, and (b) making the text accessible for machine reading.

For some more examples, try Heterophasia melanoleuca kingi Eames 2002 (bird), Echinoparyphium anatis Fischthal & Kuntz 1976 (trematode), Bathymodiolus brooksi Gustafson, Turner, Lutz & Vrijenhoek 1998 (bivalve), Amolops cremnobatus Inger & Kottelat 1998 (frog), Leptothorax caesari Espadaler 1997 (ant), and Daipotamon minos Ng & Trontelj 1996 (crab).

Thursday, August 18, 2016

GBIF Challenge: €31,000 in prizes for analysing and addressing gaps and biases in primary biodiversity data

Full widthIn a classic paper Boggs (1949) appealed for an “atlas of ignorance”, an honest assessment of what we know we don’t know:

Boggs, S. W.. (1949). An Atlas of Ignorance: A Needed Stimulus to Honest Thinking and Hard Work. Proceedings of the American Philosophical Society, 93(3), 253–258. Retrieved from http://www.jstor.org/stable/3143475

This is the theme of this year's GBIF Challenge: Analysing and addressing gaps and biases in primary biodiversity data. "Gaps" can be gaps in geographic coverage, taxa group, or types of data. GBIF is looking for ways to access the nature of the gaps in the data it is aggregating from its network of contributors.

How to enter

Details on how to enter are on the Challenge website, deadline is September 30th.

Ideas

One approach to gap analysis is to compare what we expect to see with what we actually have. For example, we might take a “well-known” group of organisms and use that to benchmark GBIF’s data coverage. A drawback is that the “well-known” organisms tend to be the usual suspects (birds, mammals, fish, etc.), and there is the issue of whether the chosen group is a useful proxy for other taxa. Another approach is to base the estimate of ignorance on the data itself. For example, OBIS has computed Hurlbert's index of biodiversity for its database, e.g. http://data.unep-wcmc.org/datasets/16 Screenshot 2016 08 18 15 13 59 Can we scale these methods to the 600+ million records in GBIF? There are some clever methods for using resampling methods (such as the bootstrap) on large data sets that might be relevant, see http://www.unofficialgoogledatascience.com/2015/08/an-introduction-to-poisson-bootstrap_26.html.

Another approach might be to compare different datasets for the same taxa, particularly if one data set is not in GBIF. Or perhaps we can compare datasets for the same taxa collected by different methods.

Or we could look at taxonomic gaps. In an earlier post The Zika virus, GBIF, and the missing mosquitoes I noted that GBIF's coverage of vectors of the Zika virus was very poor. How well does GBIF cover vectors and other organisms relevant to human health? Maybe we could generalise this to explore other taxa. It might, for example, be interesting to compare degree of coverage for a species with some measure of the "importance" of that species. Measures of importance could be based on, say, number of hits in Google Scholar for that species, size of Wikipedia page (see Wikipedia mammals and the power law), etc.

Gaps might also be gaps in data completeness, quality, or type.

Summary

This post has barely scratched the surface of what is possible. But I think one important thing to bear in mind is that the best analyses of gaps are those that lead to "actionable insights", in other words, if you are going to enter the challenge (and please do, it's free to enter and there's money to be won), how does you entry help GBIF and the wider biodiversity community decide what to do about gaps?

BioStor updates: nicer map, reference matching service

BioStor now has 150,000 articles. When I wrote a paper describing how BioStor worked it had 26,784 articles, so things have progressed somewhat!

I continue to tweak the interface to BioStor, trying different ways to explore the articles.

Spatial search

I've tweaked spatial search in BioStor. As blogged about previously I replaced the Google Maps interface with Leaflet.js, enabling you to draw a search area on the map and see a set of articles that mention that area. I've changed the base map to the prettier "terrain" map from Stamen, and added back the layer showing all the localities in BioStor. This gives you a much better sense of the geographic coverage in BioStor. This search interface still needs work, but is a fun way to discover content.

Screenshot 2016 08 18 14 04 03

Reference matching

In the "Labs" section of the web site I've added a demonstration of BioStor's reconciliation service. This service is based on the Freebase reconciliation service used by tools such as OpenRefine, see Reconciliation Service API. The goal is to demonstrate a simple way to locate references in BioStor, simply paste references, one per line, click Match and BioStor will attempt to find those references for you.

This service is really intended to be used by tools like OpenRefine, but this web page helps me debug the service.

Suggestions?

BioStor is part labour of love, part albatross around my neck. I'm always open to suggestions for improvements, or for articles to add (but remember that all content must first have been scanned and in the Biodiversity Heritage Library). If you are involved in publishing a journal and are interested in getting it into BHL, please get in touch.

Wednesday, August 17, 2016

Containers, microservices, and data

Docker Some notes on containers, microservices, and data. The idea of packaging software into portable containers and running them either locally or in the cloud is very attractive (see Docker). Some use cases I'm interested in exploring.

Microservices

In Towards a biodiversity knowledge graph (doi:10.3897/rio.2.e8767) I listed a number of services that are essentially self contained, such as name parsers, reconciliation tools, resolvers, etc. Each of these could be packaged up and made into containers.

Databases

We can use containers to package database servers, such as CouchDB, ElasticSearch, and triple stores. Using containers means we don't need to go through the hassle of installing the software locally. Interested in RDF? Spin up a triple store, play with it, then switch it off if you decide it's not for you. If it proves useful, you can move it to the cloud and scale up (e.g., sloppy.io).

Data

A final use case is to put individual datasets in a container. For exmaple, imagine that we have a large Darwin Core Archive. We can distribute this as a simple zip file, but you can't do much with this unless you have code to parse Darwin Core. But imagine we combine that dataset with a simpel visualisation tool, such as VESpeR (see doi:10.1016/j.ecoinf.2014.08.004). Users interested in the data could then play with the data without the overhead of installing specialist software. In a sense, the data becomes an app.

Friday, August 12, 2016

Spatial search in BioStor

I've been experimenting with simple spatial search in BioStor, as shown in the demo below. If you go to the map on BioStor you can use the tools on the left to draw a box or a polygon on the map, and BioStor will search it's database for articles that mention localities that occur in that region. If you click on a marker you can see the title of the article, clicking on that title takes you to the article itself.

This is all rather crude (and a little slow), but it provides another way to dive into the content I've been extracting from the BHL. One thing I've been doing is looking at protected areas (often marked on Open Street Map), drawing a simple polygon around that area, then seeing if BioStor knows anything about that area.

For the technically minded, this tool is an extension of the Searching GBIF by drawing on a map demo, and uses the geospatial indexing offered by Cloudant.