Binary Moment: A One-on-One Q&A About the US Government and Open Data with Jeanne Holm, Chief Data Wrangler and Open Data Evangelist
Open data is in the midst of a revolution, and this mild-mannered woman is hacking through government red tape like a superhero, pulling the data strings, to bring the numbers to you.
Jeanne M. Holm
How did you come about an appointment as the Data.gov Evangelist and what do you feel are your primary objectives in this role?
Holm: My appointment was from the first US CIO of Data.gov, Vivek Kundra. I am with the General Services Administration, so I am working away from NASA to support Data.gov as the as the Evangelist. The creation of the position came out of a suggestion from the public. There initially had been an open ideation period with the public about what would help guide an open government and open data site. The number one voted response was to have an Evangelist on board who could get connected with developers, with cities, and others working on open data and to help coordinate their systems.
What was your initial response when hearing about the new legislation regarding open data and open access for governmentally-funded research?
I was jumping up and down for joy! With both the Executive Order and the direction from Director Holdren, I really see this as part and parcel of a larger conversation.
In terms of the President and the Executive Order, I am ecstatic about the fact that we have been able to get the support and leadership at the very highest level here in America for how we expect government agencies to collect data, and that we expect the data to be open.
Everybody understands that there are kinds of data that the government collects that should not be released, for either national security reasons or because it contains personally identifiable information. For example, nobody wants the IRS to release personal records. We protect the rights of people very aggressively and we provide training for those working at Data.gov on these types of issues.
On the other hand, we want to release data that is available on the front end for the public to use. This data is accessible, but it may not yet be usable, understandable, or an acceptable quality at this point. The main thing is that in response to the Executive Order there will be three categories of data inventories. In the first are all the records, all the data sets we collect, that can be found in a database now that is publicly available, and we will explain what that data is and where you can find it. These will be a mix of data sets on Data.gov and data sets found on other related web sites, such as those of an agency, a national lab, a contractor, a university, or a researcher’s home page. Secondly, there are data sets that could be made available to the public but are not yet available, so we need to disclose when and where they will be made available. Finally, there will be data sets that are restricted for national security and privacy issue reasons.
I am very excited about the explosion of events that will happen as a result of the post-November 9th advocacy work for open data for researchers, but I’m even more excited about the fact that the OSTP and Dr. Holdren have set aside the money for funding open data research.
Speaking from both a librarian’s point of view and that of an active researcher (as I happen to be a professor at UCLA), I really think there is a huge well of data we funded, as the government, which are presented in papers and at academic conferences that still need to be made accessible and available for other researchers to use.
There is often a cost to making that data understandable and accessible to other researchers. These costs include, for example, the time to clean the data by removing partial or questionable data, by marking it with metadata, by creating a legend explaining database field tags, and by removing personally identifiable information such as a specific house address or geo-specific location. However, the opportunity cost not to do this is huge because of how much this data could invigorate and enervate new research as well as avoid duplication of research. It can also build a bridge for longitudinal studies in research that is applicable to the sciences, the social sciences, and science policy issues in areas such as climate change.
We are developing a set of metadata standards. Library scientists will be happy to know it is using MARC, so it can be understood from a library science point of view and because it is compliant with many other existing standards. Dealing with all potential fields in legacy data can be difficult, so we created a core set of required metadata that has been modified recently through Project Open Data, as required in the Open Data Executive Order, and that modification has been opened up for public feedback for anyone to comment and make suggestions about policy as well as code. The standards require that each data steward provide a contact name and email address. This was required in case any questions arise about a data set. If the answers are not documented, researchers have someone to contact directly.
In what ways do you see your role providing public access to government data to be related to the recent legislation mandating open data for governmentally-funded scientific and medical research?
The intersection of open data and big data is potentially positively explosive! It will create innovations and new insights beyond what we can imagine today. I know it sounds dramatic, but it is true.
The way that we look at, say, healthcare, what we see today is data from a specific clinical trial or potentially across medical clinical trials. But what we will see in the future is that it will impact more broadly across the spectrum of, say, what that drug class is and how it is being used, and then begin looking at it in different ways.
Early studies are just “thought pieces” about what researchers are starting to put together. As that researcher has an idea about their next breakthrough for, say, pancreatic cancer, there is a limited amount of data, sometimes merely anecdotal data, around that research. That will begin to be coupled with things that are further along in development, such as laboratory studies and clinical trials. Suddenly, then, you will start to see results that cannot be imagined today.
Published journal articles, while they are the fundamental bedrock of academic research, are a conversation between the researcher and the reader. Readers relate these articles to their experiences and come up with novel ideas. However, in some cases, it is not possible to take account of all these conversations, along with the data behind it. Data analytics makes it is possible to look across all those resources, build models as data bridges, and start to arrive at insights without having to look at each piece individually and figure out the big picture. By doing so, it is possible to create projections using predictive analysis that would not be possible with published articles alone.
So if you look at some longitudinal studies about populations, say in the social sciences where researchers follow children for 30 or 40 years to assess health risks or the impact of early head start programs on educational performance, it is possible to gain some amazing insights. However, those are very costly and difficult. Now we can track, rather than at the individual level, at the broader population level, and conduct those kinds of longitudinal studies from a different perspective by using historical research data as well as current and future research. This type of study might require some data normalization, but it will give us the same type of insights at a much lower cost and open it up to a much broader set of researchers.
There is the potential that when you take data from different institutions or different projects, or even different instruments, and then you try to put those together, sometimes you can run into some problems with the data causing you to draw the wrong conclusions. Say, for example, if you look at statistical analysis and data from different projects, and you are trying to integrate the data into models, you need to be cautious when trying to combine data or integrate the data of others because you can run into some technical problems. Would you agree this can be a problem, and what are your suggestions to overcome this? In particular, how would you advise a researcher who wants to use data from their own project and combine it with data from Data.gov?
Yes, absolutely, I would agree it could be a problem.
We view every data set on Data.gov as published. There is something called a mosaic effect, where when you have published data from one agency and it is fine but, when you combine them together with another agency, something happens that you never intended. From a technical perspective, you want to look for the mosaic effect to prevent revealing personally identifiable information or national security information.
Mosaic effect: "The mosaic effect occurs when the information in an individual dataset, in isolation, may not pose a risk of identifying an individual (or threatening some other important interest such as security), but when combined with other available information, could pose such risk. Before disclosing potential PIT or other potentially sensitive information, agencies must consider other publicly available data – in any medium and from any source – to determine whether some combination of existing data and the data intended to be publicly released could allow for the identification of an individual or pose another security concern."
(Checklist to help prevent the Mosaic effect).
In terms of drawing the wrong conclusions, I often see this happening because of the periodicity of data gathering. So perhaps one scientific gathering occurs every day and another scientific gathering occurs every month. A researcher would try to normalize the data. The results are not always exactly true because maybe the insight actually comes from the very correction, so that when they are aggregated, they do not actually show what is happening. I think that when you try to bring together data sets that have some kinds of different elements, whether its periodicity or population size or maybe the historical period, there may be some contextual considerations. If I were the researcher, what I would do is look at the individual data sets and the associated papers to understand what is it that the visualization of this data tells me in order to spot potential problems resulting from normalization.
As we move from an Information Economy to a Knowledge Economy, what are your best pieces of advice for librarians to take advantage of the Open Data legislation and Data.Gov? In particular, would you elaborate on some of the differences between what a librarian does versus a knowledge manager? Specifically, what are your recommendations for how librarians working in scientific and medical libraries might learn from knowledge managers in order to better define their role within the institution working with big data? I am especially referring to librarians who might make tools, become data curators, create data management plans, build systems to ensure open access compliance for federally-funded research in their organization, as well as perform outreach to faculty and encourage deposit of their data sets into a central repository like DSpace.
I think that is a really great question. I think the two fields of library science and knowledge management are so interwoven as to be almost indistinguishable in some ways. I think knowledge management professionals and library science professionals are maybe on the same path, and this is especially true of the subset of librarians who are digital librarians.
I think that for specialized libraries, like scientific and medical libraries, they have already chosen to go either way: either to be in a physical location or to augment that with a virtual library. The next step for digital librarians helping their patrons to get access to information is helping them to take advantage of that information. As we start making more data sets publicly available, most scientific libraries have institutional assets that the library could provide access to, or to provide training for, or to show instances where people are able to go in and use these tools — with the goal of looking at, and making sense of, the data that the institution has access to. Helping your audience become smart, savvy, and sophisticated users of this enhanced access to information that libraries provide today is really a great place for librarians to look and learn. Certainly as they do that, data curators, who might otherwise have evaluated journals, can now evaluate the best data sources to augment traditional publications.
The first half of my career I spent as a librarian for over 15 years, and to some extent what was true then is true now. What I would advise librarians to do is to try to get researchers to look up from inside their own cubicle and realize the power they have to help inform and educate others, even if it is just explaining how open data is important to their institution. From Data.gov’s perspective, getting one researcher to share from their department, as well as the institution, is important.
Librarians are generally comfortable in the role of providing reference services, but I am not sure that all institutional librarians would be comfortable in the dynamic outreach role of becoming the “crier” saying, “You need to share this.”
I can tell you there are two things that helped me when I was in that role myself. One of them is making sure they are aware there is a compliance issue. Every institution has different aspects of what needs to be shared out of their research work. If you are going to share that information with the broader scientific community, the institution is going to have a document process of how they are going to convey the technical details and get that cleared. Help the researcher understand that if they follow your guidance that they will not fall afoul of these many, many, many regulations.
The White House’s Office of Management & Budget (OMB) created a helpful implementation guide for agencies and organizations while creating their institutional policies and compliance tools for the new M-13-13 open data regulations. The guide suggests, among other things, a compliance workflow as well as elements like tagging, reference models and controlled vocabulary, mapping of identifiers across collections, documented uses of the data as an indicator of its value to the organization, and a statement on how the data set achieves open data compliance.
The second thing is helping researchers see what is happening after they share their data. In a variety of sectors, from health services to energy, there are very interesting companies that have built innovative, fabulous services and business models, that have come about (at least partially) due to open data. When I have personally gone back to people who are in the Treasury or in Energy Services, it is possible for me to say to them, “Look at what people are doing with this data.” “See the lives that are being saved?” “See how this is benefiting consumers?” Suddenly the lights go on! They realize it is not just about some departmental objective of publishing data with their journal articles, or a report, from a government perspective, but the huge impact that their work is having. I think that is what makes a transformational difference. It is about helping people to understand how others consume the information they provide to make the world better around them.
Since open data is public and reaches those in America as well as those beyond America, do you feel there is a broader impact that Data.gov and this new science and technology policy are having? For example, do you feel it is helping to provide a more open and transparent government? Is it facilitating the implementation of the fundamental principles of democracy at home as well as helping to promote democracy as part of our foreign policy abroad? How might open data policies be relevant for the creation of collaborative international efforts leading to open data partnerships — say with Iceland, the Faroe Islands, and others who are doing large scale population genetics research, for example? Or, perhaps open data could result in collaborations serving to empower developing countries by improving health care and economic prosperity?
Transparency drives Data.gov. It was created as part of the initial open government initiative that President Obama signed on his first day in office. It came up very quickly right after Data.gov had our very first activity function. It has always been, from our perspective, about open government and transparency — transparency as to understanding the government and evaluating the performance of government. Obviously today, in this interview, we are focusing on scientific data, but it is more than just that. It is really trying to make sure citizens, for whatever reason, have better access to information they have paid for that the government is collecting and the research that drives developments in technology, without having to have any special subscription or membership card. So whether you are an individual trying to start a small business and want to make money off of it or if you are a nonprofit and your goal is to use it for social good, say for alleviating poverty, it does not matter … it is still there and accessible for everyone. I think that is the fundamental principle of what drives us.
Thinking beyond the borders of the United States — and you are right, we know that when we open it up to Americans we are opening it up to the world — that is part of the transparency and the Open Government Partnership, which is a collaborative effort that we participate in with about 45 other nations. Most of these, but not all, are open data partners. They are opening up data not just for the benefits of their citizens but all citizens around the world.
I want to mention two projects in terms working with the developing world. The first is the Open Government Platform (OGPL), a project working with countries like India, Canada, Ghana, and Rwanda. We opened the source code on the back end of Data.gov and modified it for use by these folks to create an open data site as well as provided a guide to best practices. The second thing is an outcome, which is that we frequently have a large conference here in Washington DC, sponsored by the US Department of Agriculture, in conjunction with other activities. This project helps farmers achieve better and more stable crops. You would be surprised, but there are farmers from Africa working with data sets on high-powered laptops to perform data analysis to improve their crop yields. What we found is that working with small groups like iCow, farmers can use their personal mobile phones that have the ability to send text messages, to send a query with their cow’s health symptoms and someone on the other end will find and tell them about the latest treatments for various cow diseases. We are even able to connect open data related to that research to them. Another way open data helps farmers in Africa is using climate modeling to help predict the best planting patterns to improve how crops will perform from year to year.
Overall, open data is leading to better health care and greater economic prosperity in America and around the world.
In what way does the W3C help shape the development and use of standards, ontologies, and taxonomies for the WWW? Does your involvement in the both the policy side and the practical, data collection side of things at Data.gov help to inform and shape your input as co-Chair of the W3C group?
I co-chair the W3C eGovernment Interest Group and we hold a monthly meeting. This group has been collaborating with a few other groups to produce a new best practice guideline for data practices on the web. At Data.gov, we are committed to adopting and following accepted standards, like Dublin Core and MARC, that the W3C writes and promotes. There is no better way to guarantee the future interoperability, accessibility, and longevity of open data. It is very difficult to predict what Data.gov will look like in 10 years as technology advances, so working with those supported standards is powerful. The way that people are pulled into the W3C group, it is a community effort and we all live in the world of open data. There are some policy people in the group. While having those policy people helps lead to a richer conversation, it also brings into the dialogue concepts that lean toward an elevation of standards slightly higher than what actually might be implementable.
Briefly describe your interest and involvement in virtual worlds, immersive environments, and online learning, particularly the use of immersive virtual worlds for simulation and modeling of data.
The whole aspect of visualization — whether it is 2D, or 3D, or immersive — has a huge ability for helping us understand the data. When we visualize data, rather than just looking at a spreadsheet, it results in a visual and “gut” understanding, so the power of visualization is important. Now, take that to the next generation, to virtual worlds, and again you are at a completely new experiential level. The ability to render visualizations of these data sets as “experiences” in a virtual world can be very powerful. One of the things that we did before in Second Life with NASA was a visualization of climate models for future forecasts. We had the visualization as graphs and charts, but then part of what the virtual world immersive experience was that someone could actually walk into a house at sea level and there was a three-degree increase in global temperature, and you could see the water level rise around you. When you have an avatar in a virtual world, you are able to explore the data and experience the consequences of that data as you might in the real world. I am personally involved in virtual worlds and I am very hopeful that folks who are working on that will keep it moving forward to the next iteration.
Correction: UPDATE (11:25 PM ET, September 27, 2013): This blog post originally cited an OGPL partnership with Kenya, when it should have read Canada. It has now been corrected.