Life In The Long Tail of Science

13 March 2012 by The Fourth Paradigm, posted in Uncategorized

The following essay was written by Rob Fatland a Program Manager in the Earth, Energy and Environment Theme in the Microsoft Research Connections Team who is working hard to put Jim Gray’s Laws into practice.

Research is a landscape of sharp topography, toothy peaks and steep terrains of specialization separated by rills, defiles, gullies, ravines and chasms. The metaphor has its limits; but to run with it for a moment, these gaps (traditionally safely ignored) are increasingly the subject of our efforts, particularly in environmental science. First we like the idea of intra-trans-cross-disciplinary research as it implies a unification of formerly unrelated areas of inquiry. But aesthetics aside we just need to build more bridges to make progress. Sure, you can study dissolved organic carbon in sea water as such but you are going to want to have a friend in metagenomics to find out if her microbes are casting off your molecules. So too researchers must build bridges into computer science – our raison d’etre here – to deal with data volume, provenance, versioning, curation, visualization, collaboration, and the other usual data-intensive science suspects.

So far so good; and now to invoke Gray IV (my cool shorthand for Gray’s Fourth Law): If we’re going to build a bridge we need some tools, and a first tool that comes to my mind is 20 Questions. The computer scientist says to the Dissolved Organic Carbon researcher: I will build you a data system provided you first formulate for me 20 Questions you want to ask of your data. The idea is this: The nature of the resulting 20 Questions will permit the computer scientist to formulate the machinery ‘behind the curtain’ that provides the researcher what they need. No doubt you can see some pitfalls already; but then that’s the nature of steep terrain.

Now if you imagine science domain bridge-building is daunting wait ‘til you get a load of the Grand Canyon between science and public policy. For one thing, two different scientific disciplines may have completely different lexicons… but scientists and public policy people may use the same words to mean completely different things. I refer to a theory and this can come across as meaning a crazy guess. And this is only the beginning of our language issues. If you, like me, are happy to see the phrase “consider an integer k greater than 5” then you may—like me—run screaming before reaching the end of Exhibit A: “Knowledge-to-action networks within communities of practice could possibly be enabled by including enough well-rounded individuals to implement a successful iterative co-design partnership across stakeholders, decision makers and domain specialists.” I actually heard this from a public policy person who likes environmental scientists. When Exhibit A gets to the word ’possibly’ I see months and years of my life spent in teleconferences and meeting rooms, time slipping inexorably away under the hum and glow of a projector, never to return. So why go here? In answer I offer two little rays of hope: First, when you cross over into Public Policy there is the chance that your contributions might help improve the world a bit. And second: The word ‘iterative’ snuck into Exhibit A. Iterative we like. We’re scientists and we iterate a lot.

So my remarks here primarily concern my own lesson learned, what I would say to my earlier self if presented with a working time machine. “Listen!” I would say. “Listen you naïve fool! You’re kidding yourself if you think 20 Questions is a tool, like a hammer. It’s not a tool. It’s a contract. It’s a lot of work.” I’m convinced that (my rather liberal reading of) Gray IV is the subject of a future Book, how to use the 20 Questions idea to build working bridges. The main problem with Gray IV as applied to introverted technologists (like me) is the unanticipated cost in time. With a collaborator it is all well and good to sit down for 45 minutes and hammer out 20 questions or 20 data queries; but to use those 45 minutes of effort as the basis for future work (say designing a data system) is ludicrous. What are some of the gotchas? Well maybe you have to teach your collaborator what a query looks like; so how far down the garden path of relational databases must you go?

If the computer science person requests the initial set of questions in a very open-ended manner the result from your collaborator will be about problems (technical or research) that are obvious to them in the moment; but they will not anticipate the problems (again: be they technical or research) that will arise immediately once these first problems are solved. If you do 20 Questions mutually, that is you each produce a set, then they will probably be at cross purposes. The computer scientist will ask “What charts do you need to draw using your data?” and “How do your datasets reference one another?” Meanwhile the Dissolved Organic Carbonologist will ask “How can we expand our methods to capture more molecular formulas?” and “What simple measurement can we make in the field in five seconds that will be an accurate proxy for two hours of lab work?” So the reconciliation begins; we have to learn big chunks of one another’s trade, no shortcuts, to start talking the same language. And this implies lots of meetings, lots of coffee and donuts, lots of scribbling on whiteboards, lots of iteration. (Woo! A little ray of sunshine.)

The exercise doesn’t have to be 20 Questions, either. It can be “Let’s start writing the Nature paper that you will publish once we have analyzed your data.” Never mind we don’t know what the data is going to say really; we can start writing that paper today using a crazy guess. This exercise can reveal a lot about how the scientist views the research problems, and can also illustrate what has become so second nature that it goes unexplained. From the data system building angle you don’t want a developer to create machinery that misses a vital point. Another interesting exercise is to explain your collaborator’s research to a third party. When I can do that without my collaborator interjecting or shaking his head woefully I know I’m getting somewhere and that the process (the contract) is paying off.

Underpinning all this process is time, time, time. Collaborating costs time that sacrifices other things I might rather be doing. Without a little bit – ok a huge bit – of dedication to this process, though, I’ve found that 20 Questions Lite can give me pretty much a null return. This is why I’ve come to the conclusion that Gray IV is a contract more than a tool. The win is finding a person on the other side of a given ravine willing to put just as much time and time and time into the process. If the ravine is all within research territory: The 20 Questions process can be a productive formalism or guide. If the ravine reaches across to Public Policy: Perhaps this is a way to respond to the horror that is Exhibit A. “That’s easy for you to say” we shoot back, “now let’s go get some coffee and donuts and start asking one another some questions.” That’s my personal admonition to myself; if you agree then let’s go write Gray IV The Book. I’d do that now, today, but I have to think about it some more, and besides my day beckons, and also a glowing portal has just opened before me. “Listen!” it is saying, in a voice that sounds like me.

‡ The Fourth Paradigm, p.6.


2 Responses to “Life In The Long Tail of Science”

  1. Mark Parsons Reply | Permalink

    Nice post, Rob. I've always said good data management is about building relationships.

  2. Rob Fatland Reply | Permalink

    Thanks Mark, and nice to hear from you :) I hope the machinery we're building will succeed as the proof of the pudding.

Leave a Reply