Big data for big ecology

20 December 2012 by Tom Webb, posted in Uncategorized

As buzz words go, ‘big data’ is right up there just now. It seems that every question you care to think of, in every field from public policy to evolutionary biology, can be hit with the big data hammer. Add an ‘omics’ or two too, and you’re laughing.

So I’m slightly ashamed that we decided to call our workshop at the British Ecological Society’s Annual Meeting ‘Big Data for Big Ecology’. But when I say ‘we’ I mean the BES Macroecology Special Interest Group, and Macroecology is – as its name suggests – ‘big ecology’, so it seemed natural to combine this with the buzz word du jour.

And as it turned out, I think we were vindicated. We held the first of two 1 hour workshops in a room that could comfortably seat 50. Over 100 squeezed in, and we had to turn some people away. So clearly the interest is there, perhaps at least partly because ecological ‘big data’ differ from the data collected in other fields, and we’re still feeling for how best to deal with issues of storage, access, and analysis. This contrasts with some other fields. For instance, sequence data take a pretty standard form, and it’s relatively straightforward to design a system to collate all sequence data – Genbank is testament to this. Ecological data are much more heterogeneous – people measure different things in different systems, there’s no universally agreed common unit of measurement, people work at different spatial scales, in different habitats and environments, and so on. There is also the matter of what we mean by ‘big’. Again, there’s a contrast here with genomics, where a million sequences is now almost a trivially small number. I think in ecology we’re much more likely to be dealing with records in the thousands or hundreds of thousands, so again the computational challenges are different: doing something clever with a large quantity of complex data, rather than with an absolutely huge amount of more simple (or at least, relatively standard) data.

The aim of this first workshop was to introduce a couple of major ecological datasets, then to discuss the issues associated with sharing data. Importantly, by involving figshare, we were able to present some solutions rather than simply rehashing the same old (perceived) problems. I posted a storify of this first hour here, but briefly we heard from Paula Lightfoot, data access officer for the UK’s National Biodiversity Network Trust. The NBN holds >80 million distribution records from around 150 data providers, consisting of almost 800 individual datasets. Data cover a very wide range of taxa, although birds, lepidoptera and flowering plants make up ¾ of the records. The NBN gateway has always been a fantastic public-facing portal to biodiversity data (go and have a play if you want to confirm this), but these data are underused in research. So for me it was particularly interesting to learn about recent improvements to the NBN’s data delivery system to try to address concerns such as those raised by a BES working group involving several of the Macroecology group (including myself and group chair Nick Isaac). Some of the data on NBN is sensitive or otherwise restricted access, but now you can trigger a single request which goes to all relevant data owners. Likewise, you can download information from multiple datasets as a single text file – which, as ecological data analysts, is often all that we want.

Charly Griffiths from the Marine Biological Association data team then gave an overview of the data holdings in Plymouth, which was really valuable I think to raise awareness of some of these phenomenal datasets among the overwhelmingly terrestrial community of the BES. Things like the Continuous Plankton Recorder data held by SAHFOS, which at >80y is among the longest-running and most spatially extensive ecological time series in existence. Or the Western Channel Observatory data, which is one of the very few long-term datasets to collect information across an entire community (“from genes to fish, from seconds to centuries”).

Then we changed tack, from talking about where we might find data, to what we should do with our own. A quick show of hands revealed that almost everyone in the room had used other people’s data in their work; rather fewer had shared their own data. Mark Hahnel from figshare gave a quick demo to show how easy it can be to share all kinds of outputs – from static figures to code to very large datasets – on the figshare platform, where it instantly gains a doi, and thus becomes citable.

Given how easy this process is, why don’t more people share their data? Our discussion identified two main objections. First, people remain highly protective of their data, and suspicious that there are armies of people just waiting for it to become public so that they can do the same analyses (only faster) that the data owner had planned. I think this is understandable – ecological data are often collected in pretty extreme environments, involving a huge amount of hard work, and it is natural to want to get the full benefit of this toil before others are able to profit.

There are two counters to this. First, the idealistic one: in most cases you were paid to collect your data, very often with public money; the data are not yours to hoard; you were not funded to advance your career, but to advance science. Second, more pragmatically: it’s unlikely that many people are especially interested in what you do. Only a small fraction of those who are will have both the time to start to work on your data, and the expertise to do anything useful. Fewer still will be inclined to screw you over, especially (and this is important) if you have taken the step of laying out your stall in public (on figshare or wherever). And academic karma will sort them out soon anyway…

The second issue, that of data ownership, is harder to address, regardless of any mandate to make data available. This is a particular problem for someone like me, who uses other people’s data all the time. The value that I add lies in combining existing datasets and analysing them in novel ways. Often I have had to secure various permissions to use the data in the first place, and the extent to which what I have produced is an original data product is not clear. So while my inclination is to share everything, I do have to be very careful that I’m not sharing anything where I have previously signed an agreement to say that I won’t. Even in these cases though it is still possible to share extensive metadata and the code used to access and analyse the data.

Scott Chamberlain, who delivered the second workshop, touched on some of these kinds of issues, as well as potential solutions. Scott and the rest of the ROpenSci team use APIs to access large datasets, and it is perfectly possible for a data provider to restrict access to their data via this API route. In which case, one can publish a load of R code documenting how data were accessed, manipulated and analysed, which could be replicated by anyone having the same data access privileges that you do (often gained through personal contact with the data provider). This could be a really neat solution to accessing multiply-owned datasets. Scott’s presentation is online here, and if you have any interest in accessing data using R, it is a must read, and highly endorsed by all of the 100 or so of us who were at the workshop (see some of the comments in my second storify).

So where do we go from here? That’s a genuine question: we clearly hit a nerve and got a huge amount of interest, so we want to take it forward. But how? Should we be writing a set of standards for ecological data? A catalogue of existing datasets? A set of tutorials? I appreciate that we are far from the only people interested in this, and don’t want to replicate the efforts of others – so maybe a list of these other efforts would be a good place to start? Any thoughts gratefully received, either in the comments here or via Twitter (@besmacroecol, @tomjwebb, #besbigdata) or our facebook group.


5 Responses to “Big data for big ecology”

  1. Karthik Ram Reply | Permalink

    Hi Tom,
    First off, this is a really nice post and a big thanks to you and the Macroecology Special Interest Group for organizing this workshop at BES. I'm also super excited to see more ecologists (across the pond) supporting efforts to leverage and build upon existing data.

    Here are a few thoughts on this post:

    You're absolutely right that 'big data' is a huge buzz word not just in the sciences (where the term is relatively new compared to business). I appreciate that you wrote a bit more about what you meant by big data:

    There is also the matter of what we mean by ‘big’. Again, there’s a contrast here with genomics, where a million sequences is now almost a trivially small number. I think in ecology we’re much more likely to be dealing with records in the thousands or hundreds of thousands, so again the computational challenges are different.

    This is where I find your use of the term quite confusing (and also misleading). First, there is also the problem of big data (used in the canonical sense) in ecology. Several researchers (including me) struggle with leverging large datasets (to be clear I mean in the order of millions of data points spanning large temporal and spatial gradients) to answer pressing ecological questions. There are many ongoing efforts to deal with these types of issues which you seen to have dismissed as not being relevant in ecology. Rather, the problem you describe is really a small-data-scattered-everywhere problem. In other words, heterogeneous and disparate does not equal big data, which happens to be a separate problem in and of itself.

    The really cool thing about recent interest among scientists in sharing and reusing data, and also with data providers making their collections more accessible (via machine readable APIs) is that we can leverage all these difficult-to-retrieve data to answer novel questions or use them in ways not intended by the original authors. At rOpenSci, we are making this process of data discovery and reuse somewhat more easier and reproducible. We aren't (yet) equipped to go after big data although that is only a matter of time. I just wanted to clarify that distinction to avoiding confounding two separate issues.

    As for future directions, it would be great for the Macroecology group to keep the discussion going (through more talks and workshops for folks that didn't attend this one) and to encourage more scientists to share their data. Another great way to faciliate these types of efforts would be to write to data providers (especially those without APIs) asking them to make their data more accessible along with researcher friendly licenses. Doing so lets them know that there is an interest in using the data and also provides rationale for continuing to maintain such repositories.

    Cheers,
    - Karthik

  2. Ethan White Reply | Permalink

    Awesome post. It's very exciting to see this area of research becoming so popular in Great Britain.

    With respect to the idea of building a catalogue of existing datasets, I developed the Ecological Data Wiki (http://www.ecologicaldata.org/) a couple of years ago to facilitate developing a catalogue of existing datasets, as well as a discussion and development of consensus on the strengths and weaknesses of datasets and how to best go about using them. It hasn't really taken off (largely due to my lack of time to help cultivate usage), but it might be a useful resource for some of the things you're interested in working on.

  3. Tom Webb Reply | Permalink

    Karthik - thanks for your post, and for clearing up a couple of ambiguities. You're absolutely right that really 'big' ecological datasets do exist, and that we can use them to address big questions. I use them myself - some of the big biogeographic databases for example (OBIS, GBIF etc.), with millions of data points, combined with large geospatial datasets too (e.g. global estimates of of vegetation, chlorophyl A, bathymetry or whatever, at small (~km) resolution). I should have made that more clear - although I do suspect that the 'ecological' component of the data used by most ecologists working at the 'macro' scale typically is in the region of hundreds of thousands of points, rather than millions or more.

    Regarding the confusion between the cool stuff you're doing at ROpenSci, and hitting really big data - I had a quick chat with Scott about that, and about the limits to the API route at the moment, but as you say I think it's only a matter of time before the two (access + size) are wedded, whether this is through improved web services or by work-arounds (e.g. downloading entire datasets and querying offline).

    Finally, your point about what we could be doing is excellent. If we act as a community to coordinate requests to data providers to build in APIs etc. that will facilitate use of their data, that could be much more powerful than individuals acting alone.

  4. Tom Webb Reply | Permalink

    Thanks Ethan! As I was writing I was sure that there was already a compilation of data, and indeed your name came to mind, but I was on the train with no way of checking, so thanks for supplying the link. As you say, these things need someone with both the time and inclination to really push them forwards, and maybe that's something we can help to contribute through the macroecology group.

Leave a Reply


six − 5 =