Open Data Tools: Turning Data into ‘Actionable Intelligence’
My previous two articles were on open access and open data. They conveyed major changes that are underway around the globe in the methods by which scientific and medical research findings and data sets are circulated among researchers and disseminated to the public. I showed how E-science and ‘big data’ fit into the philosophy of science though a paradigm shift as a trilogy of approaches: deductive, empirical, and computational, which was pointed out, provides a logical extenuation of Robert Boyle's tradition of scientific inquiry involving “skepticism, transparency, and reproducibility for independent verification” to the computational age.
There has been a strong support in the belief that information should be freely available, a tradition that libraries have advocated since the days of Andrew Carnegie’s campaign to build free public libraries in the United States, Canada, the United Kingdom, and elsewhere, beginning in 1883. That policy is now being backed through legislative policy in the US, UK, and the EU mandating that scientific and medical articles and their data be freely accessible to the public when they are a result of taxpayer’s funding. The control over the dissemination of this scientific information seems up for grabs and challenges the traditional model of subscription-based journals as the primary mode by which this information is circulated. Libraries, publishers, commercial entities, as well as government agencies, are all competing against one another for the unfettered access and publication of these “free” materials for the potential revenue streams they will bring in. Supposedly, these revenues will occur through memberships and advertising in the case of journal publishers, continued state and grant funding in the case of libraries, advertising and new product creation in the case of commercial enterprises, as well as job continuation or shifted positions with new duties for civil servants and government contractors through compliance with legislative mandates, Congressional requests, and Presidential directives. Competition, as usual, is also occurring within these sectors amongst one another. Hopefully, all this disruption and competition will lead to valuable new economic growth and benefits for society. Making the published articles and credited data openly available is just the first step. How can that data be cleaned and merged with similar data perhaps from disparate sources? How will it be visualized to make the data clearer? How will the data be stored and preserved so that it is not corrupted over time?
Video credit: Digital Curation Centre. "Managing Research Data."
Produced by Piers Video Production, (Duration: 0:12:36).
This third article on open access and open data evaluates new and suggested tools when it comes to making the most of the open access and open data OSTP mandates. According to an article published in The Harvard Business Review’s “HBR Blog Network,” this is because, as its title suggests, “open data has little value if people can't use it.” Indeed, “the goal is for this data to become actionable intelligence: a launchpad for investigation, analysis, triangulation, and improved decision making at all levels.” Librarians and archivists have key roles to play in not only storing data, but packaging it for proper accessibility and use, including adding descriptive metadata and linking to existing tools or designing new ones for their users. Later, in a comment following the article, the author, Craig Hammer, remarks on the importance of archivists and international standards, “Certified archivists have always been important, but their skillset is crucially in demand now, as more and more data are becoming available. Accessibility—in the knowledge management sense—must be on par with digestibility / 'data literacy' as priorities for continuing open data ecosystem development. The good news is that several governments and multilaterals (in consultation with data scientists and - yep! - certified archivists) are having continuing 'shared metadata' conversations, toward the possible development of harmonized data standards...If these folks get this right, there's a real shot of (eventual proliferation of) interoperability (i.e. a data platform from Country A can 'talk to' a data platform from Country B), which is the only way any of this will make sense at the macro level.”
From a business perspective, the management of open data in the health sciences, for example, holds both the potential to reduce losses and increase profits. Preserving, storing, and retrieving data in a manageable fashion, then, will affect not only data consumers but also data producers. According to a National Law Review article published in late June this year, the “increased availability of health care data means more oversight and more litigation” because “data is the lifeblood of health care fraud enforcement efforts” which affects the overall cost structure of service provision. According to a report by the U.S. Federal Bureau of Investigation, “health care fraud costs the country an estimated $80 billion a year…[so] rooting out health care fraud is central to the well-being of both our citizens and the overall economy.” At the same time as cutting fraud and waste, data tools developed by U.S. “digital health startups net[ted] $849M in investments in first half of 2013," and $78M of that was attributed to analytics and 'big data.'
What Percentage of Caregivers
Conduct the Following Online Health-Related Activities?
Additionally, a report by issued in June 2013 by The Pew Research Center and the California HealthCare Foundation indicated that as many as "39% of U.S. adults are caregivers and many navigate health care with the help of technology," but of those, "39% of caregivers manage medications for a loved one; few use tech to do so." Similarly, it reported, "Most caregivers say the internet is helpful to them" and "nine in ten caregivers own a cell phone and one-third have used it to gather health information." It seems, then, that a viable window is open for new open data tools in the area of internet and mobile technologies to provide caregivers with more information about medical tests, medications, and clinical trials using metadata descriptors.
Nature will launch Scientific Data in 2014.
"Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets. It introduces a new type of content called the Data Descriptor, which will combine traditional narrative content with curated, structured descriptions of research data, including detailed methods and technical analyses supporting data quality. Scientific Data will initially focus on the life, biomedical and environmental science communities, but will be open to content from a wide range of scientific disciplines. Publications will be complementary to both traditional research journals and data repositories, and will be designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery."
Video credit: Nature, "Scientific Data."
In addition to financial rewards, there are also financial incentives sponsored by government agencies, non-profit charities, publishers, and for-profit businesses to develop tools and to create successful commercial projects engaged in data re-use. The Obama administration is offering a “Big Data Research Initiative” backed by $200M for new projects routed through six departments: DARPA, DOE, DOD, HHS/NIH, NSF, and USGS. The deadline to reply to this year’s call for projects involving big data collaborations is September 2, 2013. Interested individuals can send proposals to the Networking and Information Technology Research and Development (NITRD) program (BigDataprojects@nitrd.gov). A detailed description of the requirements can be found on their website. According to a White House press release, Dr. John P. Holdren, Assistant to the President and Director of the White House’s Office of Science and Technology Policy, stated that,
“In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching...promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security,” This is the second year of the Big Data Initiative, and “the Administration is encouraging multiple stakeholders including federal agencies, private industry, academia, state and local government, non-profits, and foundations, to develop and participate in Big Data innovation projects across the country.”
Victoria Costello (email@example.com) of PLOS I OPEN FOR DISCOVERY manages the ASAP awards program, which is sponsored by 27 global organizations including Google, PLOS, and the Wellcome Trust. “The ASAP Program will award three top awards of $30,000 each” in October 2013 to recognize the best in creative data “reuse, remixing, [and] repurposing—which enables countless clinical translations and subsequent discoveries based on previously published (OA) research.” Similarly, Eli Lilly will be accepting submissions until October 2, 2013, for a "Clinical Trial Revisualization Design" competition, and is giving away $75K in cash and prizes. Their goal is to encourage "designers and developers to re-imagine clinical trial information in a patient-centric way [because] clinical trial information can often be dense and difficult to digest from a patient’s perspective."
Going back to 2011, before the recent open access and open data mandates, the National Library of Medicine sponsored their own contest, "NLM Show Off Your Apps: Innovative Uses of NLM Information", with 35 entries using their biomedical data. Wondering if databases such as PubMed Central might make use of added metadata from the field of health informatics, and open data elsewhere, I contacted Betsy L. Humphreys (firstname.lastname@example.org), Deputy Director of the National Library of Medicine, to discuss the feasibility of adding health information metadata tags found in electronic health records (EHRs) to the records in their database, either on their side or on the entrepreneurial side of things.
According to Humphreys, my concept of “connecting EHRs with NLM databases is very sensible,” but directly adding metadata tags is not a very practical approach. Part of the problem in doing this lay in the sheer number of PubMed records, more than 22 million of them, and the fact that they are updated nightly. Another major problem is that whenever the Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT) or the Logical Observation Identifiers Names and Codes (LOINC) (used to identify lab tests and clinical observations) is updated, some of the records with those tags would need updating. Perhaps the best reason why this approach should not be implemented by the NLM, Humphries informed me, is that within PubMed Central, there is a lot of interoperability, including a correspondence table that includes SNOMED CT using the Unified Medical Language System (UMLS) Metathesaurus, and whenever possible, records are mapped with their synonyms. But, on the entrepreneurial side of things, she said, "there are indeed many EHR vendors who use the MedlinePlus Connect API, which enables the use of SNOMED CT, LOINC, or RxNorm in search arguments, in order to integrate NLM’s data into an electronic health record through a patient portal.” The interview with Humphreys left me with hope that there are indeed ample opportunities for entrepreneurs to create new data tools or physical products that create something new through mashups that combine electronic health records, published medical research records in existing databases like Medline Plus, PubMed, and PubMed Central, as well as other data.
Rating the Quality of Open Data
When working with government data it may be helpful to keep a few key guidelines in mind. The problem is, there are many guidelines. A working group within OpenGovData.org developed "8 Principles of Open Government Data" which are: "1. Data Must Be Complete... 2. Data Must Be Primary... 3. Data Must Be Timely... 4. Data Must Be Accessible... 5. Data Must Be Machine processable... 6. Access Must Be Non-Discriminatory... 7. Data Formats Must Be Non-Proprietary... 8. Data Must Be License-free." This is very similar to the Sunlight Foundation's "Ten Principles for Opening Up Government Information"— "1. Completeness... 2. Primacy... 3. Timeliness 4. Ease of Physical and Electronic Access... 5. Machine readability... 6. Non-discrimination... 7. Use of Commonly Owned Standards...8. Licensing... 9. Permanence... 10. Usage Costs." Open government data initiatives could also be held up to a 5-star rating method, which has been proposed by Tim Berners-Lee, the British computer scientist credited with inventing the World Wide Web:
★ Available on the web (whatever format), but with an open licence to be Open Data
★★ Available as machine-readable structured data (e.g. Excel instead of image scan of a table)
★★★ As (2), plus non-proprietary format (e.g. CSV instead of Excel)
★★★★ All the above, plus use W3C open standards (RDF and SPARQL)
★★★★★ All the above, plus link your data to other people’s data to provide context
The Open Data Institute has created an "Open Data Certificate" for data and rates it against a checklist. Certificates awarded grade data as "Raw: A great start at the basics of publishing open data, Pilot: Data users receive extra support from, and can provide feedback to the publisher, Standard: Regularly published open data with robust support that people can rely on, and Expert: An exceptional example of information infrastructure." In 2009, The White House created a scorecard by which open data can be evaluated according to 10 criteria: "high value data, data integrity, open webpage, public consultation, overall plan, formulating the plan, transparency, participation, collaboration, and flagship initiative." The U.S. government's simple stoplight-like rating system was as follows: green for data that "meets expectations," yellow for data that demonstrates "progress toward expectations," and red for data that "fails to meet expectations." At the other end of the spectrum, there is an exceptionally complex checklist offered by OPQUAST. On May 9, 2013, President Obama issued an Executive Order "Making Open and Machine Readable the New Default for Government Information" wherein a new "Open Data Policy" has just been established and being newly implemented through "Project Open Data" in which there are seven key principles: "public, accessible, described, reusable, complete, timely, and managed post-release." There does not seem to be an associated rating system, however, to evaluate how well the data complies with the principles. Finally, Nature has has set up three criteria for data: firstly, "experimental rigor and technical data quality," secondly, "completeness of the description," and lastly, "integrity of the data files and repository record."
Finding Solutions with Data Analysis and Visualization
Personally, while at Cambridge, I veered from historical preservation a bit and spent some time studying about the preservation of modern science data. As a librarian and archivist, knowing how to preserve scientific and and medical history meant learning about some of the standard file formats in which science data is typically stored, the software tools used to create and analyze scientific data (which may also be required for reproducibility and long-term access to preserved data sets), along some hands-on training to actually use those software tools.
I completed computer training useful in scientific computing through the UCS service, such as "Programming Concepts for Beginners," "Python: Introduction for Absolute Beginners," "Unix Intro," "Unix: Simple Shell Scripting for Scientists," "Programming Concepts-Pattern Matching," "Emacs," "mySQL," and "Condor and CamGRID" used in parallel, distributed, and grid computing. I did not have the opportunity to work with Cambridge’s COSMOS, “the world's first national cosmology supercomputer,” which was “founded in January 1997 by a consortium of leading UK cosmologists, brought together by Stephen Hawking.” But it would have been pretty exciting to use this data-intensive system.
The truth is, many scientists are routinely employed in hacking their own programs to link measuring equipment, data analysis, and visualization together and my best guess is that they would benefit from more end-to-end integrative open source software systems development. While by no means exhaustive, I compiled a list linking to 349 subject specific tools and 123 general tools that are useful for all of this newly available open data. Certainly, some if not all of the tools in this list might have its quality rated according to the eight aforementioned standards.
More than 349 Subject Specific Open Data Tools
• Earth, geology, ecology, climate, and weather sciences tools include Microsoft’s SciScope. Metadata standards include EML (The Ecological Metadata Language) and the ISO 19115:2003 International Standard for Geographic Information. Polymaps is a tool for making dynamic, interactive maps.
• GoGeo is a UK-based site that provides a listing of 161 free software products, 50 data services, and 10 search portals relating to geographical information. A standard for this field is the CSDGM (Content Standard for Digital Geospatial Metadata).
• Berkeley compiled a good listing of molecular biology resources including databases and tools for protein and nucleotide sequencing as well as model organisms. Foldit is a game that enables players to solve real problems in protein folding through its simulation software. Some of the metadata standards in the biological fields include: ABCD (Access to Biological Collections Data) Schema developed by the (Taxonomic Databases Working Group (TDWG) of Australia and with the International Union of Biological Sciences, Darwin Core, developed in Australia by TDWG for natural history specimens and observations, Genome Metadata, MIGS/MIMS (Minimum Information About a (Meta)Genome Sequence), and FGDC. GenBank Flat File Format is a standardized format for biological data.
• In astronomy, NASA maintains the FITS file format standard and documentation. A new metadata astronomy thesaurus to provide better linking among scholarly astronomy articles, called the Unified Astronomy Thesaurus (UAT), was recently released. The World Wide Telescope (WWT) “is an application that runs in Windows that utilizes images and data stored on remote servers enabling you to explore some of the highest resolution imagery of the universe available in multiple wavelengths.” An international consortium comprised of the Spitzer Science Center, ESA/Hubble, California Academy of Sciences, IPAC/IRSA, and the University of Arizona established a metadata standard for astronomy called the Astronomy Visualization Metadata Standard.
• CERN developed ROOT, a tool for ‘big data’ analysis in physics, and CASTOR (Cern Advanced STORage Manager) that is written in Scientific Linux for storage management of ‘big data’ physics files.
• Chemistry tools include PubChem, ChemSpider, Chemical Markup Language (CML), as well as eBank UK, a digital repository for crystallographic data.
• In medicine, several online tools allow patients to better manage their health care through a Personal Health Record (PHR) including: Microsoft HealthVault, PatientsLikeMe, getHealtZ, onpatient, WebMD Health Record, and Patient Ally. Other tools include EM data analysis and visualization of the brain like NeuroTrace. CARMEN (Code, Analysis, Repository and Modeling for e-Neuroscience) in the UK is a virtual neurophysiology laboratory. Amira was designed for data analysis and visualization in the life sciences along with MeVisLab for medical image processing and visualization. ClinicalTrials.gov Protocol Data Element Definitions and Infobuttons are useful when working with data in the health sciences. Standard tools include PubMed, PubMed Central, MedlinePlus, and GenBank.
123 General Open Data Tools (Organized by Topic)
Data Analysis and Calculation, Data Mining, Graphics Plotting, Image Processing, Data Visualization and Simulation, Reports: Avizo, Cave5D, ELKI, Excel, Feko, Fiji, ggplot2, GIMP, Gnuplot, GoogleVis, Graph, IDL, ImageJ, JMP, KaleidaGraph, MATLAB, Multiphysics, NumPy, Omniscope, OpenCV, OpenGL, OpenGL Vizserver, Open Office, OriginLab, ParaView, PCA/PLS plots, Photoshop, RGGobi, SCIDAVIS, SCILAB, SciPy, SenseWeb, Simulink, Stata, Tableplot, Tabplots, VisIt, Wolfram Alpha, Word. (39 tools).
Data Integrity, Data Recovery, Digital Forensics: AIR (Automatic Image and Restore), Autopsy, CRC-32, dc3dd, dcfldd, FITools, FTK Imager, Guymager, LOCKSS Box, MagicRescue, OSFClone, OSForensic, PhotoRec, PyFlag, SCALPEL (Source Code Analysis, Libre and Portable Library), Sleuthkit. (16 tools).
Data Management Planning and Audit Assessment: CARDIO (Collaborative Assessment of Research Data Infrastructure and Objectives), DMP Online, DRAMBORA (Digital Repository Audit Method Based On Risk Assessment), DROID (Digital Record Object Identification). (4 tools).
Instrument Control: LabVIEW. (1 tool).
Metadata Standards, Protocols, Preservation Formats, and Registries: AGLS, AGRkMS (Australian Government Recordkeeping Metadata Standard), DCAT (Data Catalog Vocabulary),DOI (Digital Object Identifier), ISO 15386 DCMI (The Dublin Core Metadata Initiative), EAC-CPF (Encoded Archival Context -Corporate Bodies, Persons, and Families), EAD (Encoded Archival Description),
e-GMS (e-Government Metadata Standard) 1.0-2002, GDFR (Global Digital Format Registry), ISO 8:1977 Publishing, ISO 215:1986 Publishing, ISO 23081, ISO International Standards on Archives and Records Management, ISO IT Applications in Science, ISO Medical Science and Health Care Facilities, ISO Natural Sciences (07), JHOVE, JHOVE2, METS (Metadata Encoding and Transmission Standard), MODS (Metadata Object Description Schema), NISO Z39.87-2006 Metadata for Images in XML, OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), OData (Open Data Protocol ), PREMIS (Preservation Metadata: Implementation Strategies), PRONOM, RAD (Rules for Archival Description), RDF(Resource Description Framework), RIR (Representation Information Registry Repository), SKOS (Simple Knowledge Organisation System) Core, Standard Archive Format for Europe (SAFE), UDFR (Universal Digital Format Registry),
XML Formatted Data Unit (XFDU), XMP. (33 tools).
Will ‘Actionable Intelligence’ Ultimately
be a Result of ‘Artificial Intelligence’?
Despite the fact that my research proposal to enter the computer science graduate program at Cambridge (below) was not accepted, as a proposal, I thought it raised some relevant points about the future applications of artificial intelligence and machine learning for integration and analysis of science data across platforms or in the cloud:
Methodological Approaches in Artificial Intelligence
for the Reuse of "Big Data" in Scientific Computing
Preservation of the science cyberinfrastructure requires an understanding of long-term and archival digital storage formats, data provenance, metadata schemas, and storage repositories essential for both reproducibility and accountability of scientific results and reuse of existing science data sets. These two factors drive mandated availability of prepublication data sets in scientific research projects generated from governmentally-funded agencies like the NSF, NIH, MRC and Wellcome Trust. Rules mandating data reuse hold the potential to optimize research fiscal spending, spur unprecedented new scientific discoveries in digital repositories by finding relationships across differently funded projects, and promote greater scientific collaboration amongst geographically separated institutions. Yet, in a 2005 symposium on digital biology, additional problems cited “bottlenecks” that “occur owing to our limited capacity to control quality and integrate data from myriad sources, to share data across multiple tasks and to exchange data among different people and organizations...Problems with data integration affect all data tasks, including semantic interpretation, data representation, modeling, data storage and query." Despite the fact funders require data set policies, they leave standards to individual institutions; concurrency of metadata standards within and across disciplines has not occurred, though some advocate RDF for semantic interoperability. Legacy data provide substantial difficulty in updating to new standards and backlogs exist: "Even if standards to facilitate data discoverability, access, and use were to be introduced in the future ... huge problems ... applying ... standards retrospectively to the huge amounts of existing data that do not conform to them [was forseen]." Another problem is that achieving scalability requires supervised, semi-supervised or autonomous automation. I would like to explore the use of artificial intelligence, establishing and monitoring probabilistic relationships across legacy and new linked and/or unlinked datasets housed within distributed or cloud-based data repositories, as a more robust way to search legacy data and data with missing or minimal metadata. This approach may hold the potential for innovative breakthrough discoveries in fields such as basic biological research, clinical medical research, genomics, proteomics, climate modeling, and computational astrophysics.
1. Toronto International Data Release Workshop Authors. (9 September 2009). Nature 461, 168-170 doi:10.1038/461168a
2.National Science Foundation. (January 2011). Chapter II - proposal preparation instructions. Retrieved from http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#dmp
3. National Institutes of Health. (April 17, 2007). NIH data sharing policy. Retrieved from http://grants.nih.gov/grants/policy/data_sharing/
4.Medical Research Council. (September 2011). MRC policy on research data-sharing. Retrieved from http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/Policy/index.htm
5. Wellcome Trust. (August 2010). Policy on data management and sharing. Retrieved from http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm
6. Morris, R. W., et al. (2006). Digital biology: an emerging and promising discipline. Trends in Biotechnology, doi:10.1016/j.tibtech.2005.01.005
7. Smithsonian Institution. Office of Policy Analysis. (March 2011) Sharing Smithsonian digital scientific research data from biology. Retrieved from http://www.si.edu/content/opanda/docs/Rpts2011/DataSharingFinal110328.pdf
The future of AI lies in the imagination. Fortunately, ideas and concepts can be simulated even if it is not yet possible to actualize them in the real world. In my award-winning Mars environment simulation, Curiosity AI, an embodied artificial intelligent agent could search through academic databases in various subjects like astronomy and provide answers, and robot equipment could perform in-situ analysis of data. The humanoid robot could command and operate expert AI systems and launch a swarm designed to take temperature and pressure readings in the environment. The data I used in my expert system for the Phoenix Lander were raw archived data from the NASA mission (because that was what I had available through the Mars Data Archive), but the concept was designed with the idea in mind that AI would enable the processing and calculation of real-time in-situ mission data directly from instruments on Mars.
The amount of raw scientific and medical data is swelling beyond what can be viewed and interpreted by individuals without the intermediary aid of machines filtering and computing that data deluge for us. As such, the paradigm shift introducing the computational approach mentioned at the beginning of this article may become Kuhnian, sweeping away and gaining domination over the two older scientific approaches.
I can only conclude this article on open data tools by saying that when attempting to be knowledgeable about many things, it usually results in becoming an expert at nothing. Find the tools that work best for you and stick with them; however, every now and again it is good to see what new tools might be out there, and now is that time.
If you enjoyed this three-part series and would like more information, NISO will be conducting two webinars later this year on “Research Data Curation:” Part 1: E-Science Librarianship (September 11, 2013) and Part 2: Libraries and Big Data (September 18, 2013).
Disclaimer: I own a startup company (not mentioned here) related to computer technology, primarily engaged in library and archives consulting, computer technology R&D, 3-D modeling and simulation, and artificial intelligence research.