Wednesday, July 30, 2008

iSpecies gets automated tagging

Given that the clones are hot on my heels, I feel the need to add more bells and whistles to iSpecies. The first new feature is automated tagging, and uses Yahoo's Term Extraction API. I send the titles of any papers found, and the Wikipedia snippet, and Yahoo returns keywords ("tags").

As an example, here are the tags for one of my favourite animals, Helice crassa.

mud crabs mangrove estuary muddy sediments mud crab sea coasts mud flats sex ratios habitat preferences activity patterns laboratory conditions estuarine gills burrows endemic respiration original article morphology ventilation dana biology

I think these give a nice sense of what we know about this crab.

I'm storing the tags for future analysis. I think there are some interesting ideas to explore, such as clustering the tags into meaningful groups. I'm also interested in how much we can learn about an organism based on these keywords. Can we automatically infer something about the ecology of the organism?

There is also scope for adding some semantics. Some of these tags are taxon names, and some refer to geographic places. Some are concepts, which could be linked to the relevant page in Wikipedia (Faviki is an example of this approach). At present the tags aren't clickable (i.e., you can't query by tag), but that would be a useful feature. One could get taxa that were tagged with a given term, such as "estuarine". For now, it's a quick way to get a sense of what we know about a taxon.

Wednesday, July 23, 2008

iPhone, barcodes, and natural history museums

One of my pet peeves is how backward natural history museums are in grasping the possibilities the Internet raises. Most electronic displays in museums have low information content, and are doomed to obsolescence. Traditional media (plaques, labels) have limited space, and also date quickly. For example, the Natural History Museum in London has a skeleton of Diplodocus carnegiei (see photo below by EmLah). This is one of many replicas distributed around the world.

The plaque describing this fossil has fairly minimal information. Wikipedia, however, has a nice article on Diplodocus, which includes a public domain image of the replica skeleton being presented to the trustees of the British Museum of Natural History in 1905.

Given the limitations of physical media, museum labels and plaques will always be small, and will often be out of date. Wikipedia, of course, can be kept current, and anybody can contribute.

So, the trick is to link the physical object to the Internet. This is now trivial thanks to mobile tagging. By pointing a mobile phone with a camera at a 2D barcode, one can go from physical object to web site.

Here is a 2D barcode for the URL of the Wikipedia article on Diplodicus. Imagine taking your iPhone, pointing it at this barcode, and being taken to the Wikipedia page. If museums were clever, they could set out their own Wiki, and mobilise the combined skills of the museum staff, volunteers, and visitors to populate it.

Now that the iPhone has applications, imagine creating an application that read these barcodes. Kevin Chiu at Columbia has a made one, and there are others out there. Museums could build on this, brand it with their logo, and greatly enhance visitor experience.

Wonder if anyone is doing this...?

Thursday, July 17, 2008


Stumbled across Zitgist (via UMBEL), and thought the diagram above was so cool I'd have to blog about it. Zitgist is one of a growing number of Semantic Web companies, specialising in Linked Data. This topic is dear to my heart, so I'll need to keep an eye on what Zitgist and others are up to.

Thursday, July 10, 2008

Why isn't EOL using Wikipedia?

Interesting paper by Huss et al. in PLoS Biology entitled "A Gene Wiki for Community Annotation of Gene Function" (doi:10.1371/journal.pbio.0060175). Essentially, the paper describes using Wikipedia to create a comprehensive gene wiki:
In principle, a comprehensive gene wiki could have naturally evolved out of the existing Wikipedia framework, and as described above, the beginnings of this process were already underway. However, we hypothesized that growth could be greatly accelerated by systematic creation of gene page stubs, each of which would contain a basal level of gene annotation harvested from authoritative sources. Here we describe an effort to automatically create such a foundation for a comprehensive gene wiki. Moreover, we demonstrate that this effort has begun the positive-feedback loop between readers, contributors, and page utility, which will promote its long-term success.
Given that the EOL project seems stalled (i.e., the current content hasn't changed), and the existing Wikipedia content is often much richer than EOL's, one has to ask why EOL doesn't give up it's current model and make use of Wikipedia? In other words, create all its taxon pages in Wikipedia.


Brian de Alwis has written a cool Apple Script called OpenDOI that adds support for resolving doi: and hdl: URLs using Safari on a Mac. With it installed, links such as hdl:10101/npre.2008.1760.1 and doi:10.1093/bib/bbn022 become clickable, without having to stick a HTTP proxy in front of them.

Seems that an obvious extension to this would be to add support for LSIDs. Firefox can support LSIDs through the LSID Browser for Firefox, but this won't work with Safari. Something for the to do list.

Wednesday, July 09, 2008

It pays to put things on the Web

Seems obvious in retrospect, but on of the great things about putting stuff online is that it may be useful to other people. What seems like ages ago I developed the Glasgow Taxonomic Name Server to experiment with searching for and display taxonomic names and classifications. As part of that work I developed a SOAP web service, and wrote a tutorial on how to use SOAP from within Microsoft Excel. I did this mainly for my own benefit, so that I wouldn't forget how to do it (much Googling was required). This tutorial has been reproduced, updated and credited, by the World Register of Marine Species (WoMRS). I only realised this after browsing the WoMRS site after following a recent conversation on TAXACOM about the proper name of the sperm whale. The take home message is that you never know who will make use of something you've done, and the chances are that if you've solved a problem, somebody else may well benefit from your solution.

Tuesday, July 08, 2008


Rich Glor brought dechronization to my attention. This is a very active blog "by junior academic scientists whose research focuses on evolution, reconstruction of phylogenetic trees, and comparative methods." There's some nice stuff there, including software reviews, paper appraisals, conference reports, and *cough* porn.

Sunday, July 06, 2008

Library Git

Just a quick not to make a link between David Shorthouse's post about taxonomic consensus and distributed version control (Taxonomic Consensus as Software Creation), and Galen Charlton's article in The Code4Lib Journal (Distributed Version Control and Library Metadata). Some interesting food for thought here. Both mention Git. If you want to know more, watch Linus Torvald's wonderfully direct talk at Google:

Saturday, July 05, 2008


Stumbled across the cool AgeNames service, described on the blog. Agenames takes some text and extracts stratigraphic terms from text. For example, it will extract geological time periods from text. It's a geological equivalent of uBio's taxonomic name extraction services. It would be fun to play with this as part of the iPhylo project.

Charting taxonomic knowledge

Nice paper by Robert Huber and Jens Klump has appeared in Computers & Geosciences entitled "Charting taxonomic knowledge through ontologies and ranking algorithms" (doi:10.1016/j.cageo.2008.02.016). The paper is not open access, but you can get some background from the post How TaxonRank works. Here's the abstract.

Since the inception of geology as a modern science, paleontologists have described a large number of fossil species. This makes fossilized organisms an important tool in the study of stratigraphy and past environments. Since taxonomic classifications of organisms, and thereby their names, change frequently, the correct application of this tool requires taxonomic expertise in finding correct synonyms for a given species name. Much of this taxonomic information has already been published in journals and books where it is compiled in carefully prepared synonymy lists. Because this information is scattered throughout the paleontological literature, it is difficult to find and sometimes not accessible. Also, taxonomic information in the literature is often difficult to interpret for non-taxonomists looking for taxonomic synonymies as part of their research.

The highly formalized structure makes Open Nomenclature synonymy lists ideally suited for computer aided identification of taxonomic synonyms. Because a synonymy list is a list of citations related to a taxon name, its bibliographic nature allows the application of bibliometric techniques to calculate the impact of synonymies and taxonomic concepts. TaxonRank is a ranking algorithm based on bibliometric analysis and Internet page ranking algorithms. TaxonRank uses published synonymy list data stored in TaxonConcept, a taxonomic information system. The basic ranking algorithm has been modified to include a measure of confidence on species identification based on the Open Nomenclature notation used in synonymy list, as well as other synonymy specific criteria.

The results of our experiments show that the output of the proposed ranking algorithm gives a good estimate of the impact a published taxonomic concept has on the taxonomic opinions in the geological community. Also, our results show that treating taxonomic synonymies as part of on an ontology is a way to record and manage taxonomic knowledge, and thus contribute to the preservation our scientific heritage.

Friday, July 04, 2008

How to succeed in evolutionary biology, without really trying

Lab Times has an interesting article by Ralf Neumann that analyses Europe's publications in evolutionary biology for the period 1996-2006. On page 36 there is a table of the 30 most cited authors in Europe, and the top five most cited papers. To my astonishment, I'm there at number 10 (accompanied by a photo taken in New York). What is interesting is that although the top 30 are varied in their interests, and include some well known names in the field, the top five papers in terms of citations are all about phylogenetic methods:
  1. Page, RDM
    TreeView: An application to display phylogenetic trees on personal computers.
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 12 (4): 357-358 AUG 1996 (doi:10.1093/bioinformatics/12.4.357)

  2. Strimmer, K; von Haeseler, A
    Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies.

  3. Ronquist, F; Huelsenbeck, JP
    MrBayes 3: Bayesian phylogenetic inference under mixed models.
    BIOINFORMATICS, 19 (12): 1572-1574 AUG 12 2003 (doi:10.1093/bioinformatics/btg180)

  4. Yang, ZH
    PAML: a program package for phylogenetic analysis by maximum likelihood.
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 13 (5): 555-556 OCT 1997 (doi:10.1093/bioinformatics/13.5.555)

  5. Guindon, S; Gascuel, O
    A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.
    SYSTEMATIC BIOLOGY, 52 (5): 696-704 OCT 2003 (doi:10.1080/10635150390235520)

Note also that most of these papers are short application notes. Of course, the number of pages in the publication bears no relation to the effort involved in writing the actual software. The other thing that's interesting is that of the 30 most cited authors, I have published the second smallest number of papers (40). A quick plot of the number of citations against number of papers suggests published suggests that while there is a correlation between effort (papers) and impact (citations), it's not perfect (ρ = 0.44, R2= 0.19). You can have a reasonable impact without generating lots of papers.

So, what can we learn from this? Well, it would be tempting to offer advice along the lines of "if you want to succeed in this field, write a piece of software that a lot of people find useful, and make sure you have a publication that they can cite." Oh, and getting in early helps. Of course, this advice should be taken with a pinch of salt. Beware the 100th idiot.

iSpecies clones, and taxonomic intelligence

Mauro Cavalcanti has released e-Species, "a taxonomically intelligent biodiversity search engine" written in Python that mimics much of the functionality of iSpecies. The project is open source, with a SourceForge page, although no files seem to be available yet. This is the second iSpecies clone I've seen, David Shorthouse having written a clone that uses only JSON.

One thing which distinguishes e-Species is the use of Catalogue of Life web services to provide some information on the name. However, it doesn't look like e-Species makes use of synonyms in its searches (i.e., what many refer to as "taxonomic intelligence"). Searching on two alternative names for the sperm whale (Physeter catodon and P. macrocephalus) yields different results (unless the underlying source knows that these names are synonyms, such as NCBI). Presumably, a taxonomically intelligent search would be able to merge results from searches using different names, and present those together.

Merging results requires some thought as to how to merge lists from different sources (e.g., merging lists of publications and images). This has been the subject of much study in the context of merging results from different search engines. Some starting points are:

The last link is a student project and is a Microsoft Word document, which I've uploaded to Scribd and embedded below.
Read this document on Scribd: Tadpole: A Meta search engine

Wednesday, July 02, 2008

The end of science, and the end of taxonomy

Mauro Cavalcanti brought Chris Anderson's The End of Theory article in Wired to my attention, part of the July issue on "The End of Science".

Of course, the end of science is hyperbole of the highest order (as, indeed, is the "end of theory"). It is also ironic that in the same issue Wired confess to having gotten 5 predictions of the death of something hopelessly wrong (including web browsers and online music swapping, no less). However, I guess the reason Mauro sent me the link is this section:

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.
Leaving aside whether Venter has indeed "advanced biology more than anyone else of his generation" (how, exactly, can one measure that?), it started me thinking about the yawning chasm between efforts such as the Encyclopedia of Life and the Catalogue of Life on one hand, and, say metagenomics on the other. EoL and CoL have a view of life that is taxon, indeed, species-centric, that appeals to our sense of what matters (basically those organisms we can see comfortably with the naked eye, and interact with). But if you browse the NCBI taxonomy, not only do you see an attempt to classify organisms phylogenetically, you will also encounter "taxa" that are metagenomes (e.g., NCBI Taxonomy ID 408169). These metagenomes are the result of shotgun sequencing environmental samples, they comprise multiple taxa. In this way, they resemble large-scale sampling events such as plankton netting, or tree fogging, which results in masses of material, much of it unidentified. One difference is that the metagenomes are digitised (i.e., sequenced), and hence can be analysed further (as opposed to a mass of specimens in jars). Indeed, this is one motivation behind DNA barcoding -- the ability to digitise massive samples of organisms.

So, perhaps if we overlook the "end of theory" bit (although this is appealing given that some critiques of DNA barcoding have made overblown claims for taxonomy as hypothesis-driven science), the key here is that much of what in an earlier age might have been provisional knowledge unfit for public consumption (e.g., a bunch of unidentified samples) is now very public. In the past, taxonomists wouldn't describe new taxa without sufficient information for a decent description, now the most actively growing taxonomic database (NCBI) has "taxa" that are aggregates of unidentified, unknown (and possibly, unknowable) organisms.