What is E-science and How Should it be Managed?
The definition of E-science and its related standards are still tenuous and flexible, but are about to undergo further delineation in the US. This is occurring as a response by many of the organizations called to action by the OSTP in its most recent efforts. (See Part 1 of this series, “Open Access Advocates Trumpet the Fall of the Paywall,” to learn more about the open access debate). That is, scientific and medical researchers, librarians, and publishers are attempting to answer the important question about how the creation of science policy can ensure that the full benefits of scientific information and data being generated from federal funding will be successfully capitalized upon in today's knowledge economy. Library professionals, particularly those employed in science and medical libraries, have been providing participatory feedback in federal policy changes that will affect retention of author rights and federal funding compliance requirements at their home institutions and beyond.
What does E-science mean to me? E-science is the application of computer technology to the undertaking of modern scientific investigation, including the preparation, experimentation, data collection, results dissemination, and long-term storage and accessibility of all materials generated through the scientific process. These may include data modeling and analysis, electronic/digitized laboratory notebooks, raw and fitted data sets, manuscript production and draft versions, preprints, and print and/or electronic publications. Increasingly, E-science materials have been the subject of scrutiny regarding major questions addressing the process of scientific integrity and ethics. For example, reproducibility of many experiments may not be feasible; so, close examination of documented and published data sets may be required by academic journals, particularly in areas where scientists lack vigorous statistical training such as biology. Additionally, the value of supplying data sets and providing clear and transparent methodologies used in experiments is essential for maintaining professional ethics and data integrity in support of conclusions. E-science also allows the reuse of data, saving time and taxpayer dollars by preventing the need to duplicate data collection, particularly in cases where collection takes place over decades. Finally, E-science allows the computer modeling and simulation of collected data enabling theoretical hypotheses to be tested that might not be possible otherwise, improving efficiency, and reducing the time-to-market for scientific and medical products. In short, E-science is a key initiative for developing a nation’s ability to explore both “blue skies” theoretical possibilities as well as compete and thrive in a global marketplace attempting to solve problems in health, environmental, energy, and technology industries. Interestingly, many of these critical elements, like laboratory notebooks, were omitted from the OSTP mandate.
Librarians Managing Data as Science Informationists
It may surprise many who hold outdated stereotypical views of what a librarian knows and does, that librarians often receive training in information science and the data sciences. A very common trend in academic and research libraries is to hold not one but two master’s degrees for tenured positions, one in library science or library and information science, and another in a subject area. Some librarians even have doctorate degrees or training in niche areas like bioinformatics, computer science, or earn a specialization in one or more scientific or medical discipline. So, should librarians be involved in data science, or should this be an area reserved for scientists and computer departments? My former supervisor and the Executive Director of Library and Archives at Cold Spring Harbor Laboratory, Ludmila Pollock, was the primary author, along with other “representatives of an international group of library directors, scientists and research administrators” including Woods Hole, The Rockefeller University, and Memorial Sloan-Kettering Cancer Center, of “Data management: Librarians or science informationists?” Their collective answer in Nature (490, 343 (18 October 2012), doi:10.1038/490343d. Published online 17 October 2012) was that “science librarians have evolved into ‘science informationists’” who are focused on science data management, data curation and training. Why should librarians fill this role? Changes to the profession, they argued, were the result of “increasing volume and complexity of knowledge demands new organizational techniques.” Specifically these included the ability to educate and train others on the tools and best strategies for “searching, visualization, data mining and analysis.” The publication outlined a few of the new responsibilities of librarians in dealing with the data deluge that accompanies the digital age. “Science informationists collaborate with scientists to enhance their research by helping them to assess its impact, and to curate and manage data. They make knowledge accessible, for example by using their skills in tailoring vocabularies and ontologies. They preserve and showcase their institutions’ intellectual output by building networked repositories, and they work with publishers to improve standards, platforms, publication models and search facilities in the interests of better communication,” Pollock et al. stated. “Science informationists also build sustainable systems through broad collaborations and seek out the best ways to develop these relationships within their institutions. Like researchers, they understand that science needs risk-takers, innovators and visionaries." These sentiments were echoed again in “Publishing frontiers: The library reboot” (Nature 495, 430–432 (28 March 2013) doi:10.1038/495430a. Published online 17 October 2012), where librarians were referred to as “the new data wranglers.” According to the Nature science writer, Richard Monastersky, “at Johns Hopkins and many other top universities, libraries are aiming to become more active partners in the research enterprise — altering the way scientists conduct and publish their work. Libraries are looking to assist with all stages of research, by offering guidance and tools for collecting, exploring, visualizing, labelling and sharing data.” This is leading to a fundamental change in the way E-science is managed by librarians and how librarians are hoping to rebrand their roles in the age of Google search. As an example, he cites Sarah Thomas, the head of libraries at the University of Oxford, UK. “‘I see us moving up the food chain and being co-contributors to the creation of new knowledge,’” she said.
Laboratory Notebooks and E-science
Sidestepping by the OSTP in its mandate when it came to open laboratory notebooks was not a complete surprise. As the archivist who worked on Nobel laureate Dr. James D. Watson’s laboratory notebooks collection, I am no stranger to the concerns of the issues surrounding the long-term preservation of laboratory notebooks. Just after I completed the processing and preservation of Dr. Watson’s notebooks, in 2007, two other Nobel laureates in Physiology or Medicine, Sydney Brenner and Rich Roberts wrote a correspondence to Nature, issuing a plea to scientists to “Save your notes, drafts, and printouts.” They argued that “science is one of the greatest cultural achievements of humankind. And yet...there is little systematic preservation of the workings of scientists.” This is how I opened my talk for “Science Archives and History: Facilitating Discovery through Laboratory Notebooks,” a paper I presented at the Sixth Three Societies Conference, an international gathering of historians of science held at the University of Oxford in 2008. This talk was followed by “Laboratory Notebooks,” a paper presented in the “Preserving Digital Research Data in the Health Sciences” panel at the annual meeting of the Society of American Archivists in 2009. What I pointed out in these talks was that “laboratory notebooks document the day-to-day activities of a scientist. They also compliment results in journals and patents by demonstrating how experiments were done in detail, tracing each step and thought leading to a discovery…The public display of laboratory notebooks, then, holds the potential to profoundly influence the future of science education and research by advancing discussion and the sharing of ideas amongst scientists through what has been called open notebook science.” Why are laboratory notebooks both a valuable and a particularly difficult topic for open data and E-science? I mentioned how “Archival materials, due to the way in which professional archivists are trained to treat materials, have also been admissible in a court of law as evidence. Because scientists’ laboratory notebooks are used as evidence in cases involving patent litigation, preserving them in an archives environment enables them to maintain their legal usefulness, perhaps indefinitely." It has been very much in the interests of research institutions and corporations to keep these materials private for commercial purposes relating to potential patents and discoveries. The complicated legal value and legal use of laboratory notebooks, then, may very well be a good reason for their omission in the OSTP mandate. It also helps explain why little has been done to launch a national online notebooks archival depository as I had argued for in my presentations.
Members of the National Academies Hold Public Discussions on Open Data
Not surprisingly, libraries and librarians were represented as both keynote and public commentary speakers in the open data portion of the National Academies' open hall discussions on “Public Access to Federally-Supported Research and Development Data and Publications” held on May 16-17 in Washington DC.
In terms of E-science, Michah Altman (firstname.lastname@example.org), Director of Research and Head Scientist for the Program on Information Science at MIT Libraries stated, “publications are a summary of portions of the science conducted. Often to fully understand, replicate, and extend the science requires data produced by the science, external data, external publication[s], software, and sometimes even 'lab notebooks,' records of data collection, [and] research conduct.” Once again, lab notebooks were emphasized by a librarian as a key component to E-science. He emphasized “the fragility of digital information,” pointing out the fact that “researchers lack archiving capability” and that “individualized incentives for preserving evidence base are weak.” To improve policy and data management plans, Altman argued, it is necessary to understand that “data fits into a research life cycle. There is design, creation, collection, it gets stored somewhere, that gets processed [and] shared internally among resource groups, analyzed, some publications come out of it, and there is a cycle of reuse [and] long term access which leads [back] to creation and collection...and stakeholders come in at different parts of the process.” Data capture during the entire life cycle needs to be preserved, he argued, because different stakeholders are engaged and different points in time and that researchers may be interested in the whole picture. The data are important but so is the associated metadata that is used to understand the contextual information about the data that is necessary to interpret it as well as the provenance of that data. Both are needed to guarantee the data is authentic as trustworthy evidence of the scientific process. Additionally, data preservation is important because it captures the tacit knowledge of scientists for dissemination. Raw or added metadata capture and creation are not the only element adding to the cost of preservation, so too can privacy become a cost barrier. Stakeholders must prepare and store data not only in their raw form, but are legally obligated to create and maintain anonymized and redacted versions of that data. This is done in order to provide for differing accessibility levels for researchers and the public that are compliant with laws and regulations regarding data privacy and confidentiality as well as national security. Some of these include: HIPPA, FERPA, CIPSEA, State Privacy Laws, EAR and ITAR, copyright and trademark laws.
Libraries have been collaborating with one another when dealing with the issues of long-term access and digital asset management for some time and as a result have developed a variety of multi-institutional organizations, best practices guidelines, automated data integrity and authenticity tools, as well as repository standards and certifications -- all of which have emerged from the combined experiences and solved problems faced by many intuitions. Two of the most well known of these are the Lots of Copies Keep Stuff Safe (LOCKSS) program based at Stanford University Libraries and Trustworthy Repositories & Audit Certification (TRAC).
The next keynote speaker was Victoria Stodden, who earned her doctorate in statistics and her law degree at Stanford University. Stodden focuses her research on the reproducibility of computational results and the “epistemology and technology” as Assistant Professor of Statistics at Columbia University.
Stodden’s talk reminded us that the idea of open data is not a new one; indeed, when studying the history and philosophy of science, Robert Boyle is credited with stressing the concepts of skepticism, transparency, and reproducibility for independent verification in scholarly publishing in the 1660s.
The scientific method later was divided into two major branches, deductive and empirical approaches, she noted. Today, a theoretical revision in the scientific method should include a new branch, Stodden advocated, that of the computational approach, where like the other two methods, all of the computational steps by which scientists draw conclusions are revealed. This is because within the last 20 years, people have been grappling with how to handle changes in high performance computing and simulation. What is often referred to as “big data” has revolutionized science. Some examples of this include the Large Hadron Collider (LHC) at CERN which generates around 780 terabytes per year, the Sloan Digital Sky Survey that recently released 60 terabytes, and computational biology, bioinformatics, and genomics which are also highly data intensive modern fields of science, she concluded.
To continue Boyle's tradition in the computational age, Stodden argued, open data needs to be mandated as part of the scientific process to understand how scientists reached their results and that the methodologies employed were correct. Therefore, it follows that a transformation in the dissemination of scientific communication to maintain accuracy of the scientific record would also require a dramatic shift, one that includes data and computer code. The usage of this “abundance of data,” the conclusions that can be drawn, and the ability to repurpose data and gain from these benefits are some of the key reasons for the Obama Administration's Executive Order and Executive Memorandum. However, to fully understand the data and simulations that result from computation require the disclosure of the name and version of the software used, as well as access to it, in order to verify, validate or reuse the data in meaningful ways, she emphasized. Meaningful use and maximized use in scientific and health data is what Jon Claerbout referred to as “Really Reproducible Research.” “Science isn't about knowing the answer,” Stodden said, “it's knowing how you got the answer.” To do this, data and software repositories need to be created that follow a rigorous framework for permanent collection, storage, and software accessibility to the data and these should be linked to published articles in journals that also disclose data sharing and computer code policies in relation to established standards. Finally, citation standards need to be developed for data sets, just as they are for other types of scholarly publications. While Stodden did not mention it, these could also be incorporated into citation style guides. What this all means, looking at it from a scientist's viewpoint, is that there is a relationship between high impact journals and their comfort in placing nontrivial “extra burdens” on authors by implementing data and code policies, she noted. However, Stodden has not yet closely examined this relationship to make a comparative determination on whether or not articles published with their data within these journals often result in higher impact factors than those without them.
David Fearon from the Data Management Services at the Sheridan Libraries of Johns Hopkins University, addressed data management and archiving interests from the perspective of the American academic library. Researchers and faculty reviewers of grant proposals and often consult the library in the creation and evaluation of their Data Management Plans (DMPs). Fearon stated, that in his experience, these individuals in the past have lacked “a clear incentive to produce more than a cursory plan” to meet the minimum compliance requirements of a funding agency rather than to invest in high quality, comprehensive, and effective DMPs that reflect an ideology “among their fellow researchers that DMPs are useful and that data sharing is important.” Fearon stated that he obtained anecdotal reports which indicated that because evaluation criteria for DMPs have not been formulated for faculty reviewers, they have not commented to applicants neither about “the quality of their data management plans” nor “their data sharing efforts.”
As a related aside, DMPs were one of the key areas of an earlier discussion at the at the sixth Digital Curation Conference (DCC) hosted by the JISC-funded Digital Curation Center in collaboration with the Cambridge University Library held on 9-11 November 2011. In a three-day intensive conference I attended, research institutions from around the UK were already grappling with how to manage open data. One of the key problem areas presented was that while researchers knew about their responsibilities for preserving their data, at that time there were no specific guidelines that institutions had to follow and little to no enforcement by the funders for following the guidelines when publishing. It also became apparent that many researchers were generally reluctant to share their data in addition to the published findings, which concurs with what Fearon was saying at this meeting. What has emerged from this recent discussion in 2013 has been that even if data are provided openly, the infrastructure is not yet ready to fully support it and make use of it. But let us examine a few of the particular barriers preventing ease of data set publication raised at Cambridge. David Shotton of the Image Bioinformatics Research Group at the University of Oxford pointed out the key findings from The Royal Society’s “Science as an Open Enterprise” study (the final report was published in 2012) wherein barriers for researchers included the time involvement needed for “preparing data for publication,” the fact that “metadata concepts are foreign to most biomedical researchers,” and that little professional prestige and promotion potential hinged upon publishing data sets when compared with the peer reviewed articles. Elin Strangeland, DSpace@Cambridge Repository Manager at Cambridge University Library discussed as a case study the university’s infrastructure and actual advances in research data management, emphasizing the how “responsibilities and procedures for the storage and disposal of data and samples should be made clear at the commencement of any project.” According to Strangeland, data sharing has been so compartmentalized that sharing even within the same university has become problematic because “research groups tend to run their own little fiefdom.”
R. Michael Tanner, representing the Association of Research Libraries (ARL), emphasized policy formation initiatives that minimize “cost and complexity” for “administrative overhead compliance with grant requirements for both principal investigators and research administration.” He recommended that federal funding agencies coordinate resources to ensure the development of an “infrastructure for [the] curation, description, storage and preservation of digital data” through “well-managed, sustained preservation archives that enable a legal and policy-compliant peer-to-peer model for sharing.” The OSTP goes a long way in addressing many of these concerns. However, the definition of ‘data’ in OMB circular A110, Tanner notes, is incomplete because it omits critical aspects related to documenting research that I mentioned earlier in my definition of E-science, such as the preservation and sharing of professional correspondence, manuscript drafts, and, especially, laboratory notebooks.
John L. King, W.W. Bishop Professor in the School of Information and Library Science of the University of Michigan, also was a keynote speaker. As a former Interim University Librarian, he pointed out that the single largest line item at an academic institution’s budget is often the library, and as such he found it problematic that the costs and benefits for the open data OSTP directive have not yet been fully quantified, and therefore posed the question of where the true burden of cost will fall.
Ginny Steel of the University of California Libraries, like Stodden, directly specified the needs of libraries when it comes to governance and policy-making, specifically as it pertains to providing patrons with data in compliance with copyright law:
Facts are not copyrightable under United States law. However, because not all data users realize this, because facts may be copyrightable in other jurisdictions, and because it is not always clear whether data are purely factual or contain copyrightable expression, copyright concerns can inhibit productive reuse of shared data. Agencies should enable the widest possible reuse of data by recommending clear and permissible reuse terms, such as the CC0 mark. Agencies can encourage and fund working groups to create frameworks for standardized data use agreements. The University of California recognizes that some data, such as patient records, are sensitive or classified and that immediate sharing and reuse of this data is not practical. Agencies can carve out specific exceptions and limited time embargoes for those cases while maintaining an overall standard of openness.
Mark Newton of Columbia University Libraries addressed the issues of power and money when it comes to control and management over the nation’s digital data:
Individual government agencies increasingly have an important role to play in encouraging the benefits of publicly available data, both by creating data-aggregating portals that provide a unified point of access to disparately archived data and by promoting and incentivizing best-practice solutions for data archiving and preservation...Given the variability of agency funding, we believe the wisest policy is to encourage the growth of existing repositories and the development of new ones that will be managed by individual academic institutions, consortia, and/or scholarly societies in partnership with government, rather than by any individual government agency alone. [In terms of funding,] costs into the granting process, while an absolutely vital first step, will not be the end of the story. In addition to setting the stage for further evaluation of the costs and benefits of different data types, agencies must pay attention to the ongoing, unanticipated costs of data stewardship, such as data migration, and create mechanisms for meeting those emergent needs that cannot be integrated into and accounted for in the existing grant funding workflows.
Complete documentation for the events held on May 14-17, 2013, including video recorded lectures, PowerPoint slides, written statements and transcripts for public commentators can be found here.
Publishers Punch Back with their Own Proposed Plan,
but does CHORUS Fall Flat on Open Data?
On 4 June, The Association of American Publishers held their own open forum, similar to the one that the National Academies held for their stakeholders, and proposed their response to the OSTP memo: a new framework and website, the Clearinghouse for the Open Research of the United States (CHORUS). The publisher-focused blog, “The Scholarly Kitchen,” hosted by the Society for Scholarly Publishing that day posted an article and podcast on the benefits of CHORUS. However, criticism of the proposed CHORUS solution arrived swiftly from librarians and academics when it came to text and data mining. Library Journal reported that “CHORUS does not yet address the data or text mining portions of the [OSTP] memo. On data, Serene said ‘that’s not what CHORUS is about, CHORUS is about the publication side, though we’re certainly open to the intersection of that as it becomes clearer.’ Pentz told LJ that FundRef ‘is just for publications, but the same principle can apply to data sets. There’s definitely plans to replicate this for data as well’” and that “CHORUS plans to ‘work out the system architecture and technical specifications over the summer and have an initial proof of concept completed by August 30.’" FundRef is an “identification service” that “provides a standard way to report funding sources for published scholarly research. CrossRef facilitates FundRef by encouraging collaboration between funding bodies and scholarly publishers,” and the funders included in the pilot project include the US Department of Energy, US National Aeronautics and Space Administration (NASA), US National Science Foundation, and the Wellcome Trust. The FundRef Registry is a “a taxonomy of 4000 funder names” designed to address one challenge of of text mining, the “lack of standard funding sources names and metadata [which] makes it difficult to analyze or mine the text.”
Michael Eise, co-founder of PLOS published his response in a blog article entitled, “A CHORUS of boos: publishers’ ‘solution’ to public access undermines government mandates and would invariably cost more money.” In it he pointed out “the CHORUS document makes no mention of enabling, let alone encouraging, text [and data] mining of publicly funded research papers, even though the White House clearly stated that these new policies must enable text mining as well as access to published papers. Subscription publishers have an awful track record in enabling reuse of their content, and nobody should be under any illusions that CHORUS will be any different.” While hosted by The London School of Economics, there is a disclaimer that the author’s views do not necessarily reflect those of the LSE. However, by putting the question on the table before some of the world’s top economists, Eise appears to be seeking a cost-benefit analysis comparing both approaches, particularly when it comes to data, by one or more of LSE's scholars. “The federal government already has PubMed Central–a highly functional, widely used and popular system. This system already does everything CHORUS is supposed to do, and offers seamless full-text searching (something not mentioned in the CHORUS text), as well as integration with numerous other databases at the National Library of Medicine,” Else pointed out. “It would not be costless to expand PMC to handle papers from other agencies, and there would be some small costs associated with handling each submitted paper. However, these costs would be trivial compared to the costs of the funding the research in question, and would produce tremendous value for the public. What’s more, most of these costs would be eliminated if publishers agreed to deposit their final published version of the paper directly to PMC–something most have steadfastly refused to do.” CHORUS it seems will not be able to achieve data and text mining alone, and will need to rely on a federated effort with FundRef and CrossRef, and possibly additional participants.
SHARE, a Plan Where University Libraries,
not Government or Publishers, Retain Control
Another federated response to the OSTP memo called SHared Access Research Ecosystem (SHARE) was unveiled on 7 June by a consortium of major research universities comprised of the Association of American Universities (AAU), the Association of Research Libraries (ARL), and the Association of Public and Land-grant Universities (APLU). The SHARE plan is based on an interconnected system of repositories managed at the state level, in which “University-based digital repositories will become a public access and long-term preservation system for the results of federally funded research…and will “collaborate with the Federal Government and others to host cross-institutional digital repositories of public access research publications that meet federal requirements for public availability and preservation.” The SHARE plan is designed to be executed in four stages of increasing use of metadata using tools that are predominantly funded and developed by research libraries. In particular, they address workflows, bulk harvesting, APIs, semantic data, and linked data in Phases three and four.
In the three plans that have been proposed, it is easy to see that the various stakeholders—government, publishers, and academia (including scholars and academic libraries)—are vying for primary control over open access publications and open data. In fact, the entire future of scientific dissemination, as far back as the journal model of disseminating analog information (along with authority and profits) dating to the seventeenth century, as Victoria Stodden emphasized, seems up for grabs. There is surely more to come as stakeholders battle it out to determine their new place in the digital, internet-based E-science publishing realm in the twenty-first century and beyond. The US, the UK, and the EU are all working on similar issues in terms of open access and open data, and different solutions or combinations of solutions may be tested and refined with the emergence of various public policies.
In the third article on open access and open data, I will evaluate new and suggested tools for libraries when it comes to making the most of the open access and open data OSTP mandates. In particular, I will look at some of the tools mentioned at the DCC and how PubMed Central might make use of added metadata from the field of health informatics.
UPDATE (4:27 PM ET, July 3, 2013): This blog post originally misspelled David Fearon's name. This has now been corrected. Apologies.