A Four Part Series on Open Notebook Science (Part 3)

16 January 2014 by Shannon Bohle, posted in Interviews, Open Access, Uncategorized

The second article in this series examined laboratory notebooks and the law, particularly what some open access advocates might call a "missed opportunity" in 2013 by the Office of Science and Technology Policy (OSTP) to mandate open access to laboratory notebooks along with articles and their associated data sets that result from federally-funded (that is to say, taxpayer-funded) research. In this, the third of four articles in the series, I conduct an infrastructure model debate with Associate Professor of Chemistry and E-Learning Coordinator for the College of Arts and Sciences at Drexel University, Dr. Jean-Claude Bradley, looking at different models for managing open notebook science, and ask the question, "Should participation in open notebook science be voluntary or mandated?"

A Controversial Debate

First of all, the concept of open notebook science is a controversial one. Scientists, after all, are people too. Many scientists have very human concerns about making their laboratory notebooks public. They are secretive because they do not want to be out-scooped when it comes to publishing. They are afraid of facing embarrassment should a mistake be found in their work. Some may even be possessive and "wrongly" consider laboratory notebooks to be their "private property."

Scientists, like most people, also do not like being told what to do. Many scientists hold doctorate degrees and view their positions in the academy as privy to "academic freedom." Like their fellow academics, they generally are resistant to being forced to comply with external standards, including the idea of being mandated to do anything with their research results. Take for example the backlash to the new OSTP open access mandates launched by Cary Nelson, the former national president of the American Association of University Professors (2006-2012). In a November 2013 article published in Inside Higher Ed he wrote:

Cary Nelson (2007), speaking at Yale as president of the American Association of University Professors. Image credit: Sage Ross / Wikimedia Commons.

Cary Nelson (2007), speaking at Yale as president of the American Association of University Professors. Image credit: Sage Ross / Wikimedia Commons.

The bottom line is that universities should move forward with increased gold and green publishing opportunities, not with mandates, prohibitions, and penalties ... The national American Association of University Professors (AAUP) stands firmly behind the principle that academic freedom guarantees faculty members the right not only to decide what research they want to do and how to do it but also the right to decide how the fruits of their research will be disseminated. Academic freedom does not terminate at the moment when you create a publishable book or essay.

Given these very human reactions, the voluntary model is very probably the most likely to gain the support of the majority of scientists. But, needless to say, there are some concerns that if a voluntary model is selected for open notebook science that scientists (other than the very few who already participate) will verbally issue support for a voluntary model so as to have an easy scapegoat to avoid participating altogether, saying that it is voluntary and they do not wish to volunteer their notebooks. If a voluntary model really worked, there would already be many scientists participating, right? Well, maybe, but the issue is a bit more complicated.

A Necessary Debate

This debate is really needed, I believe, because right now the open notebooks movement appears to be stalled for several reasons. Firstly, there is the lack of participation due to scientists’ fears of contribution that I mentioned above as well as a lack of incentives for them to contribute their notebooks. Secondly, is the lack of funding for tool creation as well as supported and sustainable software. Lastly, is the fact that a scalable digital infrastructure that is capable of supporting national and international open notebook science collaboration and communication is not yet available.

There are some very good specialized data repositories in the sciences, but most of them do not yet incorporate laboratory notebooks. (One notable exception is OpenWetWare). In the future, though, some of them might. Take for example, Digital Science that is teaming up with Figshare for their new tool, Projects. According to Julia Giddings of Digital Science:

The ‘Notes’ feature is just simple text entry that is linked to and kept with the files and folders instead of separately. This simple text entry for files/folders with the Notes feature, along with the Timeline which tracks when you’ve added notes to the files, gives you similar functionality to an ELN. However if the question is whether Projects works with other ELNs - then it isn't linked with any cloud-based ELNs specifically at the moment, but is something we might look at in the future.

Such commercially developed tools might indeed help facilitate a voluntary model of participation. One argument to consider, however, is that the mandates are helping to drive commercial development of these repositories and their analytical tools. Industry experts know there will be a steady supply of material available and so they are gearing up to use those raw materials for commercial purposes which will have added benefits to the scientific community as a whole.

"Every year the amount of research data being generated increases by 30%, and yet worryingly a massive 80% of scientific data is lost within two decades. If data continues to be managed poorly then science will ultimately suffer. It is time to start practicing safe science and protect your data. To help, Digital Science created Projects, a simple desktop app that lets you safely manage your research data." Infographic description and credit: Projects (@projects). (Click image to enlarge).

"Every year the amount of research data being generated increases by 30%, and yet worryingly a massive 80% of scientific data is lost within two decades. If data continues to be managed poorly then science will ultimately suffer. It is time to start practicing safe science and protect your data. To help, Digital Science created Projects, a simple desktop app that lets you safely manage your research data." Infographic description and credit: Projects (@projects).

Looking at the concept of open notebook science in detail is important because any policies that might be adopted for their management could have a global impact on the way laboratory notebooks are kept.

There also certainly will be financial impacts that would need to be considered to accomplish the anticipated costs of these changes. However, in the first article of this series I discussed the costs involved if a change toward an open notebooks policy is not implemented. For example, I pointed out how one agency alone, NASA, lost $193.1 million USD due to a simple mistake—a unit conversion that ultimately resulted in mission failure. Open science can help to prevent simple mistakes through better data analysis and caution in methodology by opening up laboratory notebooks to scrutiny, even if the notebooks are only first available within an organization and later released to the public. This is a logical, proactive step to prevent the careless loss of millions in U. S. taxpayer dollars in applied sciences and engineering mistakes.

Every unforeseen positive or negative consequence when implementing open notebook science broadly and globally cannot be anticipated. However, just as with any new system, adjustments and transitions are not always easy or inexpensive, and accommodations will need to be made to handle unforeseen troubles as they arise, and a plan for helping to pay for better ways to deal with science data and notebooks should be discussed in science policy forums.

A Friendly Debate

Let me begin by saying that Bradley and I both agree that open notebook science is a good idea. In the following debate, Bradley will argue for a voluntary, distributed model while I will argue for a mandated, centralized model. The benefits and drawbacks of each approach to open notebook science are summarized, and each argument aims to be well developed. As the author of the article, however, I acknowledge that despite my efforts to present a balanced argument, it might be a bit slanted to favor my perspective. For those who might like to listen, our 1 hour 48 minute sound recording of this interview/debate is available online.

If any two people are well-suited to debate the theoretical and philosophical concerns, the infrastructure models, and hands-on approaches regarding the management of open laboratory notebooks it is indeed Bradley and I. Both of us were early proponents of open access laboratory notebooks beginning around seven years ago, but we come at the topic with very different viewpoints due to our differing professional objectives and methodologies.

Jean-Claude Bradley Argues for a Voluntary, Distributed Model
for ONS Infrastructure Management

Jean Claude Bradley

Jean-Claude Bradley and his open notebook badge system.
Image credits: ONSClaims, Jean-Claude Bradley (Slideshare), and HubZero.

Bradley is the scientist who coined the term "open notebook science" in a correspondence to Nature Proceedings in June 2007 and described their value from a scientist’s point-of-view.

Bradley came to open notebook science after a turn of irony. He himself had been designing software specifically to prevent the sharing of laboratory notebook information. One of his patents was for creating a knowledge management software program and method for protecting proprietary information designed to keep people from accessing information, like laboratory notebooks. It turns out he did not like what he was doing and so it became his primary motivation for going about the process in an entirely different way.

Conceptually, Bradley’s idea of open notebook science is a decentralized, bottom-up approach that takes advantage of collaborative tools on the internet. It is about real-time sharing and collaboration of notebooks along with all of the raw data associated with those notebooks. Sharing includes what was done in the experiment as well as raw data that was obtained. The goal from his perspective is to provide the ability to agree or disagree with the interpretation and having all of the data available to do so. Bradley’s UsefulChem project was described in detail in a Wiley publication, where his lab managed open notebooks in a blog format, but it was determined that tracking different versions would be better managed through a wiki solution. Today all of his open notebooks in his lab use wikis, specifically WikiSpaces, in addition to machine-readable information.

Bradley is no stranger when it comes to filing for patents. He "has published articles and obtained patents in the areas of synthetic and mechanistic chemistry, gene therapy, nanotechnology and scientific knowledge management." After years of attempting to provisionally patent every paper — essentially everything coming out of his lab — Bradley realized that most of this research was not going to be patented. In the past, when filing provisional patents, he stated that some of his notebook pages were referenced in the patent applications but neither scans of pages nor entire notebooks were required to be submitted. This meant that his institution(s) indeed filed for many patents (or at least provisional patents) and paid the fees. Had submission of notebooks been required, I would argue, he probably would have submitted them rather than foregoing the opportunity to patent. If submitting laboratory notebooks were required by the USPTO, Bradley noted, the definition of what constitutes a scientific notebook is different across different scientific fields: It may not just be recording what a scientist did in the lab, but should include all of the associated data like recordings from all of the scientific instruments, which is of equal, if not greater importance. Even the question of defining what constitutes “raw data” is subjective and dependent upon the researcher’s judgment and varies from field to field, Bradley said. “There is an enormous difference” between “published data sets versus all data sets,” he emphasized. “Very little of the information generated in a particular lab is ever going to make it as supporting information in a paper.”

Patented or not, Bradley believes his laboratory notebooks hold value and he has been able to show that some data from his published notebooks have been used by others. In fact, Bradley is a case study where his open notebooks science data is being cited and used by educators and students. Bradley said he is receiving about 700 queries a day. In particular, he noted, high school educators and their students are looking for solubility and melting point data. Earlier I cited scientists who undervalued their own notebooks, saying that those of Nobel laureates should be used instead for case studies. This just goes to show that people are indeed interested in open notebooks data re-use generated by the average scientist. So too did early critics of my comments on the value of laboratory notebooks of the average scientist argue that students could instead turn to textbooks or other traditional sources for their information information. However, Bradley’s case study shows that, from a scientific point-of-view, students can see how melting points of compounds are indeed variable and not fixed as a labeled table in a textbook or a top result manually inserted by a computer science engineer at Google or Bing might suggest. (And, yes, I knew someone whose job was doing just that).

As Co-Editor-in-Chief at Chemistry Central Journal, Bradley also has concerns about scientific data and reproducibilty issues. He said that the standards for what and how much information is needed by journals varies widely by field. “The reality is,” he admonished, “that the editors have been pretty lenient about how much information should be required before it is accepted.” Essentially, a lack of standards and established benchmarks within and across disciplines is a problem. In the next section, I will argue that a mandated, centralized model can help to fix that problem.

Shannon Bohle Argues for a Mandatory, Centralized Model
for ONS Infrastructure Management

Jim Watson

Jim Watson in the library at Cold Spring Harbor Laboratory, December 2006.

Since I am not a scientist, like Bradley, I would like to begin by providing some details about my background and professional experience on the subject of laboratory notebooks. I publicly presented the value of public laboratory notebooks at the largest history of science conference, the Three Societies conference, in July of 2008, held at the University of Oxford, speaking from an information professional and historian’s point-of-view. My motivation derived from working to organize, preserve, and help make available online a collection of laboratory notebooks belonging to a Nobel laureate, James D. Watson. I was filled with wonder and simple amazement of using laboratory notebooks to see "behind the scenes," so to speak, and observe a scientist in action. The experience left me wanting to see more notebooks like this and then to be able to collate that information together in one place and learn from that collection of material. In 2007, the same year that Bradley had coined the term “open notebook science” two scientists, Nobel laureates in Physiology or Medicine, Sydney Brenner and Rich Roberts, wrote a correspondence to Nature with a plea to scientists (pointed out again in Part 1) to “Save your notes, drafts, and printouts,” arguing, “science is one of the greatest cultural achievements of humankind. And yet...there is little systematic preservation of the workings of scientists” (S. Brenner, R. Roberts, Nature 446, 725 (2007).). Given the nature of materials online, linking of notebooks, data sets, papers, manuals, and museum images and descriptions, a fuller picture of the history of science can be achieved, I had argued. Laboratory notebooks also record details about scientific data and may be linked to or contain data summaries.

Next, I would like to provide some specific critiques about Bradley's position. There are both benefits and drawbacks when it comes to a voluntary system. Generally, despite what some surveys might suggest, there is an unspoken culture of fear and overwork that prevents data sharing, even within institutions, between groups within departments, or even other researchers inside the same group, such that a voluntary participatory model would require a fundamental shift in the scientific culture and the behavior of scientists. It seems this is unlikely to occur unless regulations require this to happen. Evidence of this is clear when looking at underutilized institutional article and data repositories or talking with those managers who have to “pull teeth” to get contributions or establish embedded librarians, who like secret agents, delve into the scientists’ environment, blend in, and try to convince faculty for their cooperation so they can get at relevant papers and data. (Okay, well maybe that was an exaggeration, but you get the idea). Librarians and institutional repository managers are struggling to get things deposited. If people are asked on a survey “If you won a million dollars, would you give it to charity?” People might present their best selves, and say “yes.” However, that does not mean they would do it in practice. Similarly, if scientists are asked in a survey, “Would you voluntarily donate your data to a repository?” They might say “yes,” but in practice that is not happening even when repositories are available and librarians are asking for their contributions.

Part of the problem with a voluntary system is that it is not going to involve every organization around the world that is conducting science, or even all members within a single organization. It will just be a few people here and there. Unfortunately, that is not comprehensive enough to really make a difference. A mandated method for open notebooks would very likely affect the way laboratory notebooks are kept within that institution and around the nation, even perhaps around the world, and thus have a greater impact. Implementing mandates would entail all sorts of secondary consequences that would need to be addressed and discussed. Some of these, due to their complex nature, lay beyond the scope of this paper other than mentioning them, such as the teaching of personal archiving skills and creating data management plans. In addition, of course, there is the problem of funding all these changes in a long, global economic recession and placing the burden of cost in the right places to make it happen.

Despite the fact that Bradley personally opposes mandates, some of his reasoning actually supports it. In an 18 April 2013 article in Chemistry World, he wrote, “Optimally, trust should have no place in science.” Of course, there is some truth to this, but transparency builds trust, and maintaining public trust in science should be of paramount importance to scientists. Similarly, in the book chapter where he was lead author, “Collaborating Using Open Notebook Science in Academia,” he writes, “the ability to share only translates into actual sharing if there is a motivation to do so” (426). If researchers “choose to replicate an experiment, then they can do so with the prior knowledge of what happened in all previous attempts” (427).

“Cherry picking” of data is misleading through exclusion, and it is a problem in both the closed and both models of laboratory notebook keeping. It is a problem, it seems, that is hard to avoid. When the sharing of notebooks is voluntary, Bradley noted, researchers tend to “cherry pick” what they want to share. Bradley opposes both mandated sharing and cherry picking. But mandating sharing and regulatory standards would formally reject “cherry picking” data as bad practice. And under a mandated system these standards might be enforced, which is a hopeful outcome to prevent this bad practice. The major down-side, as Bradley pointed out, is that once scientists know their notebooks will be scrutinized, they will be more selective about what they include in them and perhaps withhold information or be more tempted to falsify this information. In the worst-case scenario, Bradley suggested, mandating open notebooks might begin a fraudulent “double bookkeeping system.” As long as these things do not happen, however, problems would be more easily detected and, ideally, corrected. In addition to audits, I am betting there are also ways to design a computer-based laboratory notebook system to prevent this type of fraud.

Scientists and Their Laboratory Notebooks (1912). Image credit: Mediawiki Commons. Source: Harris & Ewing photograph via the Library of Congress website.

Scientists and Their Laboratory Notebooks (1912). Image credit: Mediawiki Commons. Source: Harris & Ewing photograph via the Library of Congress website.

One way to ensure lab notebook fraudulence does not occur is by having the manufacturers of electronic laboratory notebooks (ELNs) build preventative measures into their systems. Rules already exist for regulating the keeping of paper notebooks for research submitted for patents. For example, the pages must be consecutively numbered, pages cannot be ripped out, et cetera. Right now, however, there are no regulations on how to keep notebooks if a patent is not the goal, Bradley rebuffed. Additionally, the expense and complexity of actually having people post their laboratory notebooks would be a secondary consequence. Convincing a scientist to share information considering any associated costs is an uphill battle, and very difficult. Open source software (or at least free software), therefore, is essential to a voluntary system where data is open, argued Bradley. I countered that in a mandated system, costs could be absorbed by the institution — not the individual researcher’s personal funds/time or his/her efforts to obtain grants — as it would be in a voluntary system (and what Bradley is having to do personally). As long as there were industry standards ensuring interoperability, commercial products could be used and purchased, and not be limited open source. Under a mandated system, institutions that have scientists working for them would be obligated to provide the infrastructure and equipment for their faculty at the expense of the institution and not the researcher’s personal expense, I suggested. In this way, a mandated system could bring everyone on board into the era of massively widespread use of ELNs around the same time, by encouraging everyone to make the switch from paper to digital.

Now I would like to offer some thoughts as to the benefits of my position. Here is why I believe mandating is better for institutions as well as scientists. As I mentioned, a mandated system would benefit the scientist by obligating institutions, either in the form of salary or grant funds, to pay for the purchase of implementing ELNs, as well as for their time and their graduate students’ time involved in the entering or scanning information and writing any associated computer programming. Bradley argued, that in some fields, using a paper notebook allows a great deal of leniency, whereas with ELNs certain fields can be required in order to control data input consistency and prevent those who might circumvent the system from doing so. Commercial ELNs could certainly be designed according to the most stringent requirements, and thereby they could apply the same standards (though not necessarily the exact form fields) in a cross-disciplinary fashion. Arriving at these standards has its own problems, but the responsibility for implementing them into the software design would ultimately fall upon ELN vendors, not scientists, though scientists certainly should be able to have a say in standards development if they choose to. I also noted that with advancements in speech-to-text technologies, inputting much of the notebook information orally using handheld devices in the lab could simplify the process as could automated uploading of associated raw data from scientific equipment. Simultaneously, ELN vendors would need to create easy-to-use data sharing mechanisms into their software, bridging their software to embrace the new open science philosophy and the new open data (and potentially open notebook) mandates. This change could force all researchers, but particularly graduate students, to bring their notebooks up to a higher quality standard. The automated nature would also help prevent fraudulence.

Finally, I will make my case outright. As a library scientist, I believe that centralization as opposed to decentralization is needed, whether that centralization is accomplished through physical proximity (think books on a shelf or data on a single central server) or intellectual arrangement (consider the library catalog or URL links). In a voluntary participatory situation it becomes very difficult centralize notebooks or even to track the number of software downloads to know how many scientists are actually participating in open notebook science. Project creators would need to rely on Google to locate data since it is distributed and it is likely that, like Bradley, they will be developing tools without institutional funding. Bradley himself described his struggles to find out via IP addresses and search queries, who exactly is participating. Distributed systems also make it very difficult for Principal Investigators (PIs), project directors, and software developers to prove the re-use of data and site visitor statistics. Essentially freely distributed software makes it difficult to establish sufficient metrics on software downloads. Open source software can be copied and put on other websites—which is great for distribution, but in terms of metrics, tracking of the numbers of user of the tool and adoptees to open notebook science within and across the scientific and medical professions becomes a near impossibility.

As a librarian, I also believe that the principles of information management and library science are necessary to develop information architectures. Considering Bradley espouses a very non-librarian like model for the management of laboratory notebooks, I found it somewhat ironic that Bradley was invited to speak on 12 November 2013 about “open notebooks” in the session called “Open Science Tools” hosted by the Federal Library and Information Network (FEDLINK) and held at the Library of Congress.

From a librarian’s perspective, unregulated, freely distributed materials rely too heavily on third party search engine algorithms to turn up relevant examples. Sometimes there may be “tricks” to finding the notebooks, but these “tricks” might be known only to a small few or are simply more complex than necessary for undergraduates or cross-disciplinary researchers. For example, Bradley’s project used ChemSpider to look up organic chemistry compound numbers to locate related laboratory notebooks. Then there is the problem of the variety of user-generated search terms that could omit important studies. Linking published data and articles with their associated laboratory notebook records also becomes more difficult. Another benefit is that centrally stored data can be extracted, modeled and visualized more easily than distributed data. This is why big companies like Wolfram Alpha copy and store open source databases (like the Protein Data Bank) locally. Graphical reports can provide overviews of findings by feature, method, or result for each search term. A wiki, like Bradley uses, does provide a wonderful collaborative environment enabling updates in a fluid manner and the means to track revisions, and it theoretically could provide researchers with the ability for controlled vocabularies and subject headings that can accommodate a variety of fields. However, it is a highly untraditional format for data management. Free hierarchical databases (created using either SQL, PHP, or Oracle Database Express) usually contain structured controlled vocabularies as well as subject terms, and this is simply the best “tried and true” solution used throughout the IT industry.

While a voluntary, distributed, and non-standardized method for making notebooks openly available using free software, as Dr. Bradley suggests, and which he claims is more enticing and may therefore gain voluntary researcher participation, it is ultimately less valuable to researchers and other end users than the mandated system I propose. While ease of use is, of course, important, it is equally or more important to have controlled vocabularies, metadata standards, metrics and feedback mechanisms. It is logical to design a system with built-in controls where monitoring, assessment, and feedback are sufficient to write papers on the tool’s performance as well as to gain or retain grant funding for the creation and expansion of such tools. Regulating where the software can be downloaded as a single installation package aids in metrics collection to inform how many researchers are using your tools and where they are from (such as country, city, and institution metrics gleaned from IP addresses), and simplifies installation for the user. Pop up site surveys can be annoying, but they are useful and some people do take the time to complete them knowing they are helping to improve the services that ultimately benefit them. One aspect I did like was Bradley’s system of logos for display on websites indicating the project’s level of data sharing in terms of their policies (real time sharing of all data, real-time sharing of some data, embargoed/delayed sharing of all content, etc.), similar to the display of Creative Commons logos.

Example of a metadata interoperability chart featuring archival standards like METS, MODS, RDF, MARC, OAI-PMH, and Dublin Core. Image credit: Primaver, from Wikimedia Commons.

Example of a metadata interoperability chart featuring archival standards like METS, MODS, RDF, MARC, OAI-PMH, and Dublin Core. Image credit: Primaver, from Wikimedia Commons. (Click image to enlarge).

The ultimate proof that the voluntary system does not work well is the fact that most notebooks are not presently available to the public despite a number of years of advocacy for voluntary participation. To me, mandates seem the only way to ensure that governmentally-funded scientific documents will be properly preserved. It seems too that mandates will also be needed for widespread public research accessibility to laboratory notebooks. With future improvements in automation and machine learning, ideally everyone would rise to the same standards eventually despite the problems with working in different disciplines and how much data are generated, especially astronomical and biological sciences where there are big data.

Conclusion: Experts Disagree and the Future is Uncertain

As I mentioned, this is a current debate. The fact is, people are uncertain where the chips should fall. Many are still having difficulty grappling with the existing open access mandates for federally funded research, and there is some backlash going on. So, I would like to close the article with the opinion of a third party. This article is not an attempt at a literature review. However, I think it worthwhile to mention that one of the leading voices on open access science and open data in the library and information science community has been Christine L. Borgman, Professor and Presidential Chair in Information Studies at UCLA. Borgman said (in a personal communication dated 30 November 2013) that she herself has not yet formulated a position on open access mandates. She noted, however, "I certainly have not made any explicit recommendations against OA mandates."

Like most academics, Borgman is sensitive to the subtleties and rippling implications of the open access movement. Overall Borgman said she takes "a nuanced perspective, explaining the difficulties of taking data out of context, the vast diversity of data and data practices across fields, and arguing against a one size fits all approach to open data." For example, she emphasized, "What works for genomics does not work well for ethnography." Generally, Borgman argues that “open access to publications and to data are very different practices for scholarship; open data is not simply an extension of OA pub[lication]s."

Her views on open data, are best illustrated in the following four publications: “The conundrum of sharing research data 2011 DRAFT,” Journal of the American Society for Information Science and Technology (2011), "If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology,” PLOS One, (2013), “Who’s got the data? Interdependencies in Science and Technology Collaborations” (with Jillian C. Wallis and Matthew S. Mayernik), Journal of Computer Supported Cooperative Work (2012), and her new book from MIT Press, Big Data, Little Data, No Data: Scholarship in the Networked World (2013).

Given that experts are disagreeing on whether open notebook science should be mandated in cases of federally-funded research, at the end of this series of articles, I am going to host a poll and link to an online petition about this issue so that readers may participate by expressing their views.

Disclaimer: The author received a financial incentive from Digital Science for including their infographic in this article.

Leave a Reply

× 4 = thirty six