Wednesday, October 29, 2008

OpenURL for Genbank records

Following on from adding specimens to my OpenURL resolver, I've added support for GenBank records. Either an OpenURL request such as http://bioguid.info/openurl?id=genbank:DQ502033, or the short URL http://bioguid.info/genbank/DQ502033 will resolve the GenBank record for accession number DQ502033.

The HTML isn't much to look at, the real goodness is the JSON (obtained by appending "&display=json" to the OpenURL request, or ".json" to the short form, e.g. http://bioguid.info/genbank/DQ502033.json).

The resolver gets the sequence form NCBI, does a little post processing, then displays the result. Postprocesisng includes parsing the latitude and longitude coordinates (something of a mess in GenBank, see my earlier metacrap rant), extracting specimen codes, adding bibliographic GUIDs (such as DOIs, Handles, or URLs), finding uBio namebankID's for hosts, etc. Note that some records have a key called "taxonomic_group". This is to provide clues for resolving museum specimens -- often the DiGIR provider needs to know what kind of taxon you are searching for.

The aim is to have a simple service that returns somewhat cleaned up GenBank records that I (and others) can play with.

Monday, October 27, 2008

A Shared Culture

The A Shared Culture video from the Creative Commons web site.

Modelling GUIDs and taxon names in Mediawiki

Thinking more and more about using Mediawiki (or, more precisely, Semantic Mediawiki) as a platform for storing and querying information, rather than write my own tools completely from scratch. This means I need ways of modelling some relationships between identifiers and objects.

The first is the relationship between document identifiers such as DOIs and metadata about the document itself. One approach which seems natural is to create a wiki page for the identifier, and have that page consist of a #REDIRECT statement which redirects the user to the wiki page on the actual article.



This seems a reasonable solution because:
  • The user can find the article using the GUID (in effect we replicate the redirection DOIs make use of)
  • The GUID itself can be annotated
  • It is trivial to have multiple GUIDs linking to the same paper (e.g., PubMed identifiers, Handles, etc.).

Taxon names present another set of problems, mainly because of homonyms (the same name being give to two or more diferent taxa).The obvious approach is to do what Wikipedea does (e.g., Morus), namely have a disambiguation page that enable the user to choose which taxon they want. For example:



In this example, there are two taxon names Pinnotheres, so the user would be able to choose between them.

For names which had only one corresponding taxon name we would still have two pages (one for the name string, and one for the taxon name), which would be linked by a REDIRECT:



The advantage of this is that if we subsequently discover a homonym we can easily handle it by changing the REDIRECT page to a disambiguation page. In the meantime, users can simply use the name string because they will be automatically redirected to the taxon name page (which will have the actual information about the name, for example, where it was published).

Of course, we could do all of this in custom software, but the more I look at it the power to edit the relationships between objects, as well as the metadata, and also make inferences makes Semantic Mediawiki look very attractive.

Friday, October 24, 2008

Google Books and Mediawiki

Following on from the previous post, I wrote a simpe Mediawiki extension to insert a Google Book into a wiki page. Written in a few minutes, not tested much, etc.

To use this, copy the code below and save in a file googlebook.php in the extensions directory of your Mediawiki installation.


<?php
# rdmp

# Google Book extension
# Embed a Google Book into Mediawiki
#
# Usage:
# <googlebook id="OCLC:4208784" />
#
# To install it put this file in the extensions directory
# To activate the extension, include it from your LocalSettings.php
# with: require("extensions/googlebook.php");

$wgExtensionFunctions[] = "wfGoogleBook";

function wfGoogleBook() {
global $wgParser;
# registers the <googlebook> extension with the WikiText parser
$wgParser->setHook( "googlebook", "renderGoogleBook" );
}

# The callback function for converting the input text to HTML output
function renderGoogleBook( $input, $argv )
{
$width = 425;
$height = 400;

if (isset($argv["width"]))
{
$width = $argv["width"];
}
if (isset($argv["height"]))
{
$width = $argv["height"];
}

$output = '<script type="text/javascript"
src="http://books.google.com/books/previewlib.js">
</script>';
$output .= '<script type="text/javascript">
GBS_insertEmbeddedViewer(\''
. $argv["id"] . '\','. $width . ',' . $height . ');
</script>';

return $output;
}
?>


In your LocalSettings.php file add the line


require("extensions/googlebook.php");


Now you can add a Google book to a wiki page by adding a <googlebook> tag. For example:


<googlebook id="OCLC:4208784" />


The id gives the book identifier (such as an OCLC number or a ISBN (you need to include the identifier prefix). By defaulot, the book will appear in a box 425 × 400 pixels in size. You can add optional width and height parameters to adjust this.

Wednesday, October 22, 2008

OCLC and Google books

I've started to come across more taxonomic books in Google Books, such as Catalogue of the specimens of snakes in the collection of the British museum by John Edward Gray. Google books provides a nice widget for embedding views of books. There is tool for generating the Javascript code. Note that in Blogger (which I use to create this blog) you need to make sure that theJavascript occurs on a single line with no line breaks for it to work.



The Javascript used (with linebreaks that must be removed before using) is:

<script type="text/javascript"
src="http://books.google.com/books/previewlib.js">
</script>
<script type="text/javascript">
GBS_insertEmbeddedViewer('OCLC:4208784',425,400);
</script>


I stumbled across this book whilst searching for the original record for the snake Enhydris punctata. Confusingly, the Catalogue of Life lists this snake as Enhydris punctata GRAY 1849, implying that Gray's original name still stands, whereas in fact it should be Enhydris punctata (Gray, 1849) as the Gray's original name for the snake was Phytolopsis punctata. It's little things like this that drive me nuts, especially as the Catalogue of Life has no obvious, quick means of fixing this (Wiki, anyone?).

I was also interested in using the OCLC numbers a GUID for the book, but there are several to choose from (including two related to the Google Book). Unlike DOIs, a book may have multiple OCLCs (sigh). Still, it's a GUID, and it's resolvable, so it's a start. Hence, one could link GUIDs for the names published in this book to the book itself.

Tuesday, October 21, 2008

OpenURL for specimens


As part of the slow rebuild of bioguid.info, and as part of the Challenge, I've started making an OpenURL resolver for specimens. Partly this is just a wrapper around DiGIR providers, but it's also a response to the lack of GUIDs for specimens. In the same way that I think OpenURL for papers only really makes sense in a world without GUIDs for literature (DOIs pretty much take care of that), given the lack of specimen GUIDs we are left to resolve specimens based on metadata.

For example, the holotype of Pseudacris fouquettei (shown in photo by Suzanne L. Collins, original here) is TNHC 63583. In a digital world, I want the paper describing this taxon, and the specimen(s) assigned to it to be a click away. In this spirit, here is an OpenURL link for the specimen: http://bioguid.info/openurl/?genre=specimen &institutionCode=TNHC &collectionCode=Herps &catalogNumber=63583. Click on this link and you get a page with some very basic information on the specimen. If you want more, append "&display=json" to the URL to get a JSON response.

So, armed with this, TNHC 63583 becomes resolvable, and joining the pieces becomes a little easier.

Sunday, October 19, 2008

Monitor envy


Mike Sanderson's wall of monitors is getting some attention. Cool as it looks (and I'm positively green with envy), this strikes me as the LCD equivalent of Science's suggestion that the reader print out a tree on several bits of paper doi:10.1126/science.300.5626.169. If we can fit the planet on a monitor, can't we fit the tree of life?

Thursday, October 09, 2008

Wired barcoding


The latest issue of Wired has an article on DNA barcoding, entitled "A Simple Plan to ID Every Creature on Earth". The article doesn't say much that will be new to biologists, but it's a nice intro to the topic, and some of the personalities involved.

Tuesday, October 07, 2008

Biodiversity Service Status

The rather frail nature of biodiversity services (some of the major players have had service breaks in the last few weeks) has prompted me to revisit Dave Vieglais's BigDig and extend it to other services, such as uBio, EOL, and TreeBASE, as well as DSpace repositories and tools such as Connotea.

The result is at http://bioguid.info/status/. The idea is to poll each service once an hour to see if it is online. Eventually I hope to draw some graphs for each service, to get some idea of how reliable it is.

Much of my own work depends on using web sites and services, and I'm constantly frustrated when they go offline (some times for months at a time).

My aim is to be constructive. I well aware that reliability is not easy, and some tools that I've developed myself have disappeared. But I think as a community we need to do a lot better if biodiversity informatics is to deliver on its promise.

The list of service is biased by what I use. I'm also aware that some of the DiGIR provider information is out of date (I basically lifted the list from the BigDig, I'll try and edit this as time allows).

Comments (and requests for adding services) are welcome. There is a comment box at the bottom of the web page, which uses Disqus, a very cool comment system that enables you to keep track of your comments across multiple sites. It also supports OpenID.

Monday, October 06, 2008

Global biogeographical data bases on marine fishes: caveat emptor

D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes:
Any biogeographical analysis of fish distributions that uses GIS data on marine fishes provided by FishBase and OBIS 'as is' will be seriously compromised by the high incidence of species with large-scale geographical errors. A major revision of GIS data for (at least) marine fishes provided by FishBase, OBIS, GBIF and EoL is essential. While the primary sources naturally bear responsibility for data quality, global online providers of aggregated data are also responsible for the content they serve, and cannot side-step the issue by simply including generalized disclaimers about data quality. Those providers need to actively coordinate, organize and effect a revision of GIS data they serve, as revisions by individual users will inevitably lead to confused science (which version did you use?) and a tremendous expenditure of redundant effort. To begin with, it should be relatively easy for providers to segregate all data on pelagic larvae and adults of marine organisms that they serve online. Providers should also include the capacity for users to post readily accessible public comments about the accuracy of individual records and the overall quality of individual data bases. This would stimulate improvements in data quality, and generate 'selection pressures' favouring the usage of better quality data bases, and the revision or elimination of poor-quality data bases. The services provided to the global science community by the interlinked group of online providers of biodiversity data are invaluable and should not be allowed to be discredited by a high incidence of known serious errors in GIS data among marine fishes, and, likely, other marine organisms. (emphasis added)

As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information in public databases. I suspect there will be more papers like this, which I hope will inspire database providers and aggregators to take the issue seriously. (Thanks to David Patterson for spotting this paper).