Thursday, August 14, 2008

BNCOD 2008 Workshop


The proceedings of the BNCOD 2008 Workshop on "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge" are online. This workshop was held in conjunction with the 25th British National Conference on Databases (BNCOD 2008) at Cardiff, Wales. The papers make interesting reading.

Exploring International Plant Names Index (IPNI) Data using Visualisation by Nicola Nicolson [PDF]
This paper describes visualisation as a means to explore data from the International Plant Names Index (IPNI). Several visualisations are used to display large volumes of data and to help data standardisation efforts. These have potential uses in data mining and in the exploration of taxon concepts.
Nicky explores some visualisations of the IPNI plant name database. Unfortunately only one of these (arguably the east exciting one) is shown in the PDF. The visualisations of citation history using Timeline, and social networks using prefuse are mentioned, but not shown.

Scratchpads: getting biodiversity online, redefining publication by Vince Smith et al. [PDF]
Taxonomists have been slow to adopt the web as a medium for building research communities. Yet, web-based communities hold great potential for accelerating the pace of taxonomic research. Here we describe a social networking application (Scratchpads) that enables communities of biodiversity researchers to manage and publish their data online. In the first year of operation 466 registered users comprising 53 separate communities have collectively generated 110,000 pages within their Scratchpads. Our approach challenges the traditional model of scholarly communication and may serve as a model to other research disciplines beyond biodiversity science.
This is a short note describing Scratchpads, which are built using the Drupal content management system (CMS). Scratchpads provide a simple way for taxonomists to get their content online. Based in large measure on the success of scratchpads, EOL will use Drupal as the basis of their "Lifedesks". There are numerous scratchpads online, although the amount and quality of content is, um, variable.

Managing Biodiversity Knowledge in the Encyclopedia of Life by Jen Schopf et al. [PDF]
The Encyclopedia of Life is currently working with hundreds of Content Providers to create 1.8 million aggregated species pages, consisting of tens of millions of data objects, in the next ten years. This article gives an overview of our current data management and Content Provider interactions.
This is a short note on EOL itself. I've given my views on EOL's progress (or, rather, lack thereof) elsewhere (here, here and here). The first author on this paper has left the project, and at least one of the other authors is leaving. It seems EOL has yet to find its feet (it certainly has no idea of how to use blogs).


Distributed Systems and Automated Biodiversity Informatics: Genomic Analysis and Geographic Visualization of Disease Evolution by Andrew Hill and Robert Guralnick [doi:10.1007/978-3-540-70504-8_28]
A core mission in biodiversity informatics is to build a computing infrastructure for rapid, real-time analysis of biodiversity information. We have created the information technology to mine, analyze, interpret and visualize how diseases are evolving across the globe. The system rapidly collects the newest and most complete data on dangerous strains of viruses that are able to infect human and animal populations. Following completion, the system will also test whether positions in the genome are under positive selection or purifying selection, a useful feature to monitor functional genomic charac-teristics such as, drug resistance, host specificity, and transmissibility. Our system’s persistent monitoring and reporting of the distribution of dangerous and novel viral strains will allow for better threat forecasting. This information system allows for greatly increased efficiency in tracking the evolution of disease threats.
This paper is was one of two contributions chosen to be proceedings BNCOD 2008 ("Sharing Data, Information and Knowledge", doi:10.1007/978-3-540-70504-8, ISBN 978-3-540-70503-1). Rob Guralnick has put a free version online (see his comment below). It describes the very cool system being developed to provide near real time visualisation of disease spread and evolution, and builds on some earlier work published in Systematic Biology (doi:10.1080/10635150701266848).

LSID Deployment in the Catalogue of Life by Ewen Orme et al. [PDF]
In this paper we describe a GBIF/TDWG-funded project in which LSIDs have been deployed in the Catalogue of Life’s Annual and Dynamic Checklist products as a means of identifying species and higher taxa in these large species catalogues. We look at the technical infras- tructure requirements and topology for the LSID resolution process and characteristics of the RDF (Resource Description Framework) metadata returned by the resolver. Such characteristics include the use of concepts and relationships taken from the TDWG (Taxonomic Database Working Group) ontology and how a given taxon LSID relates to others includ- ing those issued by database providers and those above and below it in the taxonomic tree. Finally we evaluate the pro ject and LSID usage in general. We also look to the future when the CoL LSID infrastructure will have to deal changing taxonomic information, annually in the case of the Annual Checklist and possibly much more frequently in the case of the Dynamic Checklist.

Although I was an early adopter of LSIDs (in my now defunct Taxonomic Search Engine doi:10.1186/1471-2105-6-48 and the very-much alive LSID Tester, doi:10.1186/1751-0473-3-2), I have some reservations about them. The Catalogue of Life uses UUIDs to generate the LSID identifier, which makes for rather ugly looking LSIDs, as David Shorthouse has complained. For example, the LSID for Pinnotheres pisum urn:lsid:catalogueoflife.org:taxon:ef0ae064-29c1-102b-9a4a-00304854f820:ac2008 (gack). Why these ugly UUIDs? Well, one advantage is that they can be generated in a distributed fashion and remain unique. This would make sense for a project like the Catalogue of Life, which aggregates names from a range of contributors, but in actual fact all the LSIDs at present are of the form "xxxxxxxx-29c1-102b-9a4a-00304854f820", indicating that they are being generated centrally (by MySQL's UUID function, in this case).

Ironically, when I was talking to Frank Bisby earlier this year, he implied that LSIDs would change with each release if the information about a name changed, thus failing to solve the existing, fundamental design flaw in the Catalogue of Life, namely the lack of stable identifiers! So, at first glance we are stuck with hideous-looking identifiers that may be unstable. Hmmm...

Workflow Systems for Biodiversity Researchers: Existing Problems and Potential Solutions by Russel McIver et al. [PDF]
In this paper we discuss the potential that scientific work- flow systems have to support biodiversity researchers in achieving their goals. This potential comes through their ability to harness distributed resources and set up complex, multi-stage experiments. However, there remain concerns over the usability of existing workflow systems and re- search still needs to be done to help match the functionality of the soft- ware to the needs of its users. We discuss some of the existing concerns regarding workflow systems and propose three potential interfaces in- tended to improve workflow usability. We also outline the software ar- chitecture that we have adopted, which is designed to make our proposed workflow interface software interoperable across key workflow systems.
Not sure what to make of this paper. Workflows seem to generate an awful lot of publications, and few tools that people actually use.


Visualisation to Aid Biodiversity Studies through Accurate Taxonomic Reconciliation by Martin Graham et al. [doi:10.1007/978-3-540-70504-8_29]
All aspects of organismal biology rely on the accurate identification of specimens described and observed. This is particularly important for ecological surveys of biodiversity, where organisms must be identified and labelled, both for the purposes of the original research, but also to allow reinterpretation or reuse of collected data by subsequent research projects. Yet it is now clear that biological names in isolation are unsuitable as unique identifiers for organisms. Much modern research in ecology is based on the integration (and re-use) of multiple datasets which are inherently complex, reflecting any of the many spatial and temporal environmental factors and organismal interactions that contribute to a given ecosystem. We describe visualization tools that aid in the process of building concept relations between related classifications and then in understanding the effects of using these relations to match across sets of classifications.
The second contribution published in the conference proceedings, but there is also free version available here from the project's blog. The paper describes TaxVis, a project developing visualisation techniques for comparing multiple taxonomic hierarchies.

The paper discusses taxonomic concepts and the difficulty of establishing what a taxonomist meant when they used a particular name. As much as I understand the argument, I can't shake the feeling that obsessing about taxonomic concepts is ultimately a dead end. It won't scale, and in an age of DNA barcoding, it becomes less and less relevant.

Releasing the content of taxonomic papers: solutions to access and data mining by Chris Lyal and Anna Weitzman [PDF]
Taxonomic information is key to all studies of biodiversity. Taxonomic literature contains vast quantities of that information, but it is under-utilised because it is difficult to access, especially by those in biodiverse countries and non-taxonomists. A number of initiatives are making this literature available on the Web as images or even as unstructured text, but while that improves accessibility, there is more that needs to be done to assist users in locating the publication; locating the relevant part of the publication (article, chapter etc) and locating the text or data required within the relevant part of the publication. Taxonomic information is highly structured and automated scripts can be used to mark-up or parse data from it into atomised pieces that may be searched and repurposed as needed. We have developed a schema, taXMLit that allows for mark-up of taxonomic literature in this way. We have also developed a prototype system, INOTAXA that uses literature marked up in taXMLit for sophisticated data discovery.
This is a nice overview of the challenge of extracting information from legacy literature. There are numerous challenges facing this work, including taks that are trivial for people, such as determining when an article starts and ends, but which are challenging for computers (see Lu et al. doi:10.1145/1378889.1378918, free copy here -- there is a job related to this question available now). Related efforts are the TaxonX markup being used by Plazi. My own view is that for legacy literature heavy markup is probably overkill, decent text mining will be enough. The real challenge is to stop the rot at source, and enable new taxonomic publications to be marked up as part of the authoring and publishing process.

An architecture to approach distributed biodiversity pollinators relational information into centralized portals based on biodiversity protocols by Pablo Salvanha et a. [PDF]
The present biodiversity distributed solution using DiGIR / TAPIR protocols and the Darwincore2 schema has been very valuable in the centralized portals, which that can provide distributed information in a very quickly way. Using the same concept this paper presents an architecture based on the case study of pollinators to bring the centralization of the relational information to those portals. This architecture is based on a technological structure to facilitate the implementation and extraction from the providers of that relational information, and proposes a model to make this information reliable to be used with the present specimens information on the portal database.
This is a short note on extending DarwinCore to include information about pollination relationships. The wisdom of doing this has been question (see Roger Hyam's comment on the proposal).

A Pan-European Species-directories Infrastructure (PESI) by Charles Hussey and Yde de Jong [PDF]
This communication introduces the rationale and aims of a new Europe-wide biodiversity informatics project. PESI defines and coordinates strategies to enhance the quality and reliability of European biodiversity information by integrating the infrastructural components of four major community networks on taxonomic indexing, namely those of marine life, terrestrial plants, fungi and animals, into a joint work programme. This will include functional knowledge networks of both taxonomic experts and regional focal points, which will collaborate on the establishment of standardised and authoritative taxonomic (meta-) data. In addition PESI will coordinate the integration and synchronisation of the European taxonomic information systems into a joint e-infrastructure and the creation of a common user-interface disseminating the pan- European checklists and associated user-services results.
This paper describes PESI, yet another mega-science project in biodiversity, complete with acronyms, work packages, and vacuous, buzzword-compliant statements. Just what the discipline needs...

2 comments:

rpg said...

For anyone interested in the full text of the Hill and Guralnick BNCOD paper on disease monitoring, here is a link:
http://robgur.googlepages.com/BNCOD.pdf

Roderic Page said...

Thanks for making this available Rob, I've edited the post to mention the link (and fixed a few typos).