The Delights of Data

22 May 2013 by Tania Browne, posted in Epidemiology, Statistics

Raimond Spekking CC-BY-SA-3.0

Three little words. The whole of epidemiology, taking in the whole sweep of humanity and what our health is like at any given time across the globe, can be summed up with three little words:




That's it. Quite technical words, I grant you, but easy enough to get your head around. Frequency just means how much, how many cases of something over a period of time. Distribution - who's getting it? Is it more likely to be someone of a particular age, sex, race, culture, social class? Is it more likely to pop up around there they live, be it a particular country or a particular suburb of a particular town?

And just like there might be lots of things that affect the distribution, there might be lots of determinants. The cause of a disease might be one thing, or it might be a whole big chain of events where that one thing was made worse by a factor in your lifestyle, where you lived, how you worked - all sorts. Health and disease are not random. In previous ages we might have said someone was possessed by demons or struck down for moral lapses by an angry god. Now we know it's a bit more complicated than that.

Epidemiology is about making sense of health with those three little words that cover so much. But all those words rely on one other word:


Because we can't even begin to figure it all out without data. We can do all the fancy detective work in the world, but unless there is data, we have nothing but a nice theory. Data puts our theories into practice.

So where do we get the data from?

There's no delicate way to put this. Epidemiology likes to spread itself about a bit. It likes to play the fields. Lots of fields will claim they started it and it uses their methods, but in reality it's a bit of a magpie. On the good side, this means it picks up a lot of useful methods and data from all over the place. On the bad side, it can make learning about it seem a bit huge and overwhelming.

Epidemiology can use history and demography - the study of populations in the past or present. It can use sociology, statistics, biochemistry... It depends on how large scale or small scale you want to be. Was that food poisoning outbreak among the Rotary Club members because of the history of Rotary clubs and the human need to be charitable? Because they all belong to the club as a nice social thing? Because they went to the same restaurant for a club dinner? Because a certain percentage of them had the prawn mayonnaise? Because the mayonnaise was made with unsterilised raw eggs? Because the distributor had not checked the eggs were properly processed? Because the salmonella bacteria was present in the egg? Is it because the AvrA toxin released by rod shaped salmonella bacteria suppresses the immune system, allowing the bacteria to multiply and produce waste relatively unhindered?

It's down in part to all of them, and it shows the many places our data can come from.

Quality Data - Words

Michael Maggs CC-BY-SA-3.0

We can look at data in two different ways. Epidemiology is partly a social thing - that's not to say we have lots of parties, but rather we spend a lot of time dealing with the lives of social groups and the dynamics of societies when looking for health issues. There might not be a lot of numbers in the resulting data, but there's one heck of a lot of description. Qualitative data is the kind of data that mainly describes things in words - interviewing people, doing surveys, observing people interacting, doing their jobs and the like. Qualitative data comes in categories, in boxes. People are hard to quantify, so in some cases, qualitative data might be the best we can do.

But qualitative data has many problems. It's very hard to generalise from, and very hard to assume that a different group of people somewhere would do the same thing. There's a reason we don't have a "Second law of Whooping Cough Distribution" in the same way we have a Second Law of ThermoDynamics. Societies are just too unpredictable. There are too many factors to think about that may or may not happen. There is also the major issue of potential bias, whether from the scientist studying the population, or the population that behaves differently because they know a scientist is watching them.

There are also a lot of potential problems with analysing and reporting qualitative data. At the end of the day, you have a stack of notes with observations and interviews, and very little to show your "analysis" of them and how you avoided your own biases creeping in. The way we get around that is by using codes.

We've already talked a little about how a coding system is used to record illness and report death, and coding systems are also used on a slightly smaller scale to give some meaning to all kinds of qualitative data. If you can break your big mass of data down into coded chunks or files, it makes everything seem a bit less overwhelming. And also, easier to analyse. You can have a computer programme that pulls everything with a particular code out at once so you can compare, say, all women over 70 who visited their GP with hearing problems last month and were asked about their quality of care. Coding doesn't change all your data to numbers, but it certainly makes it a bit easier to handle objectively and draw theories from.

Quantity Data - Think of a Number

Cjangaritas CC-BY-SA-3.0

When I said earlier that there was no Second Law of Whooping Cough Distribution?  That doesn't mean people haven't tried. The beginning of the 19th century sparked a craze for trying to quantify people and their daily lives through the use of statistics. Huge bureaucracies popped up across Northern Europe chasing a noble ideal of quantifying their populations - birth, death, illness, crime, you name it. Even how happy they were. The idea of the "average" citizen became common, as well as the idea of measuring how far real people deviated from this statistical ideal. Studying society through numbers would lead to better governance and help plan for the future, it was thought - and that idea has never really gone away.

Quantitative data represents something you can either measure or count. If you're counting it, it's most likely what we call discrete data - not the kind of data you can trust with secrets, but data that can be counted only in whole numbers. For instance, your household may own 2 cars. You can't have 2 1/2 cars any more than you could have the 2.4 children that's the famous stereotype of UK suburban life. Apart from anything else it would just be a bit messy, with cars or people. Think of the mopping.

But if course, quantitative data can also be continuous. If you watched any of the Olympic Games last year, you will have seen prime examples of continuous data in action. Runners faster than others by tenths or even hundredths of seconds, tiny measurements of time that mean the difference between a medal and nothing. Watching one of your 2.4 children grow is also a prime example of continuous data. They don't just wake up one morning suddenly a centimetre taller, it happens gradually over time, without being obvious. Until one day they're a teenager and start eating everything in the fridge before heading off to meet their friends at the skate park.


Once you start seeing the world in terms of data it can be quite hard to stop. Take my imaginary nursery class, who are gathered on the field for their annual sports day, parents at the sidelines watching. 38 children are there in total, 22 with both parents watching and 16 with one parent there, making a total of 60 parents (discrete quantitative data). 19 of the children are girls and 19 boys (qualitative data) and 21 of the children have blue eyes (qualitative too) There are 23 extra cars on the road outside because of the spectating parents, making parking hard (discrete quantitative), 7 blue, 9 silver, 2 black and 5 red (qualitative). The heights of everyone on the field range between 92cm and 189cm (continuous quantitative data) and the timings for the 100m race were between 13 and 22 seconds (same again).

Stuff that can be turned into useful data is everywhere. It's just a slightly different way of thinking, and it's the way epidemiologists try to look at the world to investigate our health. Information is everywhere, information can be beautiful. It can also reveal horribly ugly truths and, with prodding, it can reveal secrets. Epidemiologists are people who like to prod, and they're not afraid of the data and analysis that will take them to the answer. So really, those three little words that sum it all up should rightly be four:







Leave a Reply

− 2 = four