Can Big Data Save the Planet?
By Charles Primm
Late in the evening of April 20, 2010, an explosion rocked the Deepwater Horizon drilling rig pumping crude oil from beneath the Gulf of Mexico. The platform burst into flames and sank two days later, spilling thousands of barrels of oil into the water with each passing day.
Soon it became clear that conventional methods of stanching the oil flow were failing. The growing slick presented a huge risk to hundreds of miles of shoreline and multitudes of wildlife. As the potential for disaster increased, a group of environmental researchers took action.
They combined bird observation data compiled by the Cornell Lab of Ornithology with data on Gulf wind and water circulation patterns from the National Oceanic and Atmospheric Administration (NOAA). The resulting data set allowed researchers to identify bird populations that were likely to be affected by the spreading oil and take proactive steps to mitigate the negative impact on those populations.
The tools that facilitated such quick action were developed by the Data Observation Network for Earth (DataONE) project, a massive effort to enable new science and knowledge creation by expanding access to data about life on Earth and the environment that sustains it.
DataONE is a five-year, $20 million National Science Foundation program dedicated to developing the cyber-infrastructure linking research data collected by environmental scientists to libraries and laboratories around the world, while ensuring that data can be used effectively.
In 2009, UT’s College of Communication and Information received a $3.2 million grant—the largest single award in the college’s history—for its portion of the project.
The team of researchers, from UT and Oak Ridge National Laboratory, integral to making it all happen are School of Information Sciences professors Carol Tenopir, Suzie Allard, and Kimberly Douglass, and post-doctoral researcher Miriam Davis; Bruce Wilson, who holds a joint appointment with UT and ORNL; Maribeth Manoff and Eleanor Read of UT Libraries; and UT research associates Robert Waltz and Mike Frame (who is also a USGS researcher). DataONE’s principal investigator is William Michener of the University of New Mexico.
July 2012 marked the full public release of the initial technology, and things are looking pretty good, according to Wilson.
“So far, we’ve done a lot with how to integrate our efforts across policy, technology, and sociocultural factors in order to understand how to improve the availability and reuse of data sets,” Wilson says. “We’re helping to determine what the world needs in terms of data management, and how we can help scientists to do a better job of how to use, share, and reuse their data.”
Building the Backbone
The DataONE cyber-infrastructure comprises three distinct parts: member nodes, coordinating nodes, and the investigator toolkit.
Member nodes are computing centers that store data generated by earth science researchers, government agencies, and even citizens involved in environmental studies. These data sets can include field notes about animal populations or measurements of temperatures, acidity levels in streams, carbon dioxide levels, or bacteria counts.
UT’s Stokely Management Center houses a member node that is also linked with Trace, the UT Libraries’ digital archive that preserves works by faculty, departments, programs, research centers, and institutes.
Coordinating nodes are regional networks of computers that connect all of the member nodes to each other and allow data in each member node to be indexed, searched, and replicated. Tenopir explains that replication of the data is important in preserving data into the future.
“Library and information science professionals know that good preservation requires lots of copies,” Tenopir says. “The coordinating nodes will duplicate data from member nodes and help reroute visitors in order to smooth the flow of Internet traffic through the system.”
UT and ORNL jointly host a coordinating node. Similar nodes are located at the University of New Mexico and the University of California, Santa Barbara.
The investigator toolkit gives scientists a way to access the data without having to learn a whole new system, so they can focus on doing and sharing science, rather than dealing with arcane computer incantations.
Environmental science touches many kinds of subjects, such as animals, plants, water, soil, and health effects on humans, including socio-psychological effects. “Because we are building on what exists now, and enabling new data as we go along, the toolkit becomes very important to this process,” Tenopir says.
Engaging the Community
As the infrastructure is being built, DataONE team members also are working on public outreach and engagement efforts.
Suzie Allard and Kimberly Douglass are heading up the project’s sociocultural working group. They are examining the role of DataONE in the scientific community and seeking ways to increase adoption of its investigator toolkit. Each working group is composed of a small team of researchers from around the world who collaborate online and then meet twice a year.
“We’ve been building profiles of potential users, what we call ‘personas,’” Allard says. Some of the personas include educator, academic librarian, and field scientist.
Once these archetypal descriptions are fleshed out with predictive models of what services and functionality they are likely to need, the information is plugged back into the planning process for the investigator toolkit and for the functionality of each member node’s online interface.
Another UT-based working group is the usability and assessment group, headed by Tenopir and Frame.
“We have the charge of making sure the system and the materials meet the needs of the stakeholders,” Tenopir says. The group is conducting baseline assessments to find out what Earth and environmental researchers, librarians, data managers, and publishers are doing now, in terms of sharing and using data and then planning for the future. This information is then passed on to the cyber-infrastructure design team.
These assessments also help the working group recommend how to deal with sensitive data and manage embargoed releases on research findings and their associated data sets.
In addition, the assessments improve education, Tenopir says. “We increase our engagement with the scientific community when we are able to teach them the best ways of creating metadata as they collect their data. Good metadata, which classifies and describes the data in question, is critical to helping others discover these data sets and use them in novel ways, so it’s an important part of our work.”
Bob Cook, the ORNL lead for the DataONE project, said the response to the Deepwater Horizon oil spill is a prime example of how the DataONE tools can benefit the environment. Scientists were able to help prioritize beach protection and cleanup efforts to the sites with the greatest concentrations of vulnerable birds and most important habitats in the Gulf.
“Once researchers see they can combine and integrate data sets in these new ways, they really start exploring the data in ways that have never been done before,” Cook says. “It improves science and helps us manage our valuable natural resources.”