What’s at the bottom of the biodiversity data mine? – Mike Saunders
The following contribution comes to us from Mike Saunders, Director of Digital Media at Kew Gardens, where he oversees their digital platforms. The views expressed below are his own, and not those of Kew Gardens.
As the quantity of data available online reaches ever greater volumes, the question of what value can be derived from that data, is increasingly interesting. Working at the Royal Botanic Gardens, Kew, I’m particularly interested in biodiversity data, what it might be used for, by whom, and to what end. This is both an academic interest and a pressing need given the impending crisis that threatens biodiversity around the world.
Many people, including Professor Nigel Shadbolt, Professor of Artificial Intelligence at Southampton University, describe the supply of online data as a ‘superabundant’ deluge of information. In a paper on semantic responses, Professor Shadbolt and his colleagues estimated that the amount of data generated in 2010 would be around 1.2 million petabytes. To put that into some context: if you tried to read through this data, assuming an average reading speed of say 1,000 characters (or approx 1 kilobyte) per minute, then it would take you about 2 trillion years! So techniques of data mining (not a new term, it has been around for several decades) are increasingly essential in locating and making sense of this incredible data mountain.
Biodiversity data is a subset. There is no reliable estimate of the quantity of biodiversity data available, but it is huge – GBIF (the Global Biodiversity Information Facility) reported in 2010 that it has 216 million records of primary biodiversity data available through its portal, and it estimates that the data records available at partner institutions run into several billion. Kew alone has tens of millions of data records.
So what is the role of data mining in making sense out of the information we have recorded about the planet’s biodiversity?
At Kew, some important aspects of scientific discovery rely on identifying patterns from large data sets. Biodiversity data usually includes accurate location-based information (for example the location a specimen was collected), providing a powerful opportunity to mine data by location. A good example is the work conducted recently to assess the risk to plant life around the world, expressed as the Sampled Red List Index (SRLI) for plants. Researchers at Kew, the Natural History Museum, ZSL and the International Union for Conservation of Nature (IUCN) took a representative sample of 7,000 species from around the world. Using a combination of bespoke and existing tools such as Google Earth, they mined data from the partners’ collections, remote sensing data from satellites, and other sources such as GBIF to arrive at the final assessment.
Another important use of biodiversity data is to derive models that can be used to make ecological predictions, for example when modelling climate scenarios. Projects such as TRY work through a global partnership of institutions that provide primary biodiversity data, which is mined to derive traits used in these models.
More miners – and mines
In a growing number of cases, people are mining data sets that nominally have nothing to do with biodiversity, to reveal new information. A fascinating example is that of a citizen scientist from Maine who used tourist images from Flickr to track the migration journey of a humpback whale from Brazil to Madagascar, publishing her results in the Royal Society’s Biology Letters. With Facebook now reaching over 500 million people, there is bound to be some useful biodiversity information to mine.
From eBird to iSpot, there are no shortage of opportunities for citizen scientists to invest time in documenting biodiversity – and they are doing it in large numbers. Increasingly data providers are also finding ways of making their data available to these groups. In the UK, the National Biodiversity Network offers a set of web services that enable use of data by developers creating applications, and GBIF similarly offers a number of services into its global data. The Encyclopedia of Life (EOL) aggregates biodiversity data aimed at a broader range of audiences, for which there is now an API (application programming interface) that can be used to build apps.
Although we’re not inundated with applications based on this kind of data, there are signs that both specialists and amateurs are approaching the data more creatively. There are for example some good illustrations of what is possible using data visualisation such as species heatmaps, or Google Earth layers showing species distribution. And there are certainly developers keen to get their hands on new data. Take the realm of civil data, for example, where organisations like MySociety create hugely popular apps out of freely available data.
So why are there not more biodiversity apps? Well the data is certainly harder to decipher, and in some cases includes concepts that simply don’t make sense to a non-specialist. So perhaps closer partnership between data publishers and app developers might stimulate more activity – maybe in the form of hack days or so-called ‘crowd-sourced’ projects.
If this happened, what would they build? Perhaps field guides compiled on the fly for a user-defined region, food-chain or ecological modelling, visualisation of the effects of man-made structures such as roads to habitats? The possibilities may be endless, and in some cases could prove genuinely insightful.
The value of more diverse communities using this data may be in the serendipity that it creates. The example of whale tracking via Flickr is a case in point. Not only will different communities look to new data sets with which to combine the primary data (even perhaps social networks such as Facebook or Twitter), but they may also approach the problem from new angles.
A partnership of miners
Although I believe that getting a broader base of people interested in biodiversity data could have significant benefits, I suspect that the cutting edge of mining biodiversity data will remain with the specialists.
Without expert involvement, mining data can lead to misinterpretation or false conclusions, especially where the data is complex and opaque. Only within the bioinformatics communities do you find the combination of taxonomic, GIS and regional expertise needed to make major breakthroughs in understanding from this data. In fact, many of the potential apps imagined above would probably need expert input to create genuinely valuable products.
And there is an important footnote – the data does not digitise itself, curate itself and offer itself up for use. It is an expensive (although valuable) function to create and maintain usable datasets that can be mined. Kew and other institutions are having to consider how to ‘biocurate’ their data for future use.
But where I think we could all benefit is by creating more opportunities for citizen scientists, experts and the public to engage together with this critically important data.