FROM: NATIONAL SCIENCE FOUNDATION
Developing infrastructure for data sharing around the world
NSF-supported organization coordinates US participation in global data-sharing and infrastructure-building effort
How can we support agricultural productivity around the world? How can we develop public health models that leverage social data, health data and environmental data? What are best practices to ensure the stewardship of research data today and tomorrow?
Solutions to these and other critical challenges are being advanced through the sharing and exchange of research data. To increase data sharing and overcome the critical challenges associated with making data accessible, an international group of leaders in the data community joined together in 2013 to form the Research Data Alliance (RDA).
With support from the U.S. National Science Foundation (NSF), the European Commission and the Australian government, RDA has grown in just two years from a core group of committed agencies to a community that now comprises more than 2,600 members from more than 90 countries, all dedicated to pragmatically removing the barriers to data sharing and raising awareness of those challenges among regions, disciplines, and professions.
NSF supports U.S. participation in RDA as part of a grant to promote coordination and develop infrastructure for data sharing.
Twice a year, RDA members meet face to face at plenary meetings held in various locations worldwide to coordinate activities and advance their efforts. At these meetings, researchers, policymakers and representatives from funding agencies speak on pressing data issues, and members meet to collaborate on projects through interest and working groups.
RDA holds its 5th Plenary through March 11 in San Diego, hosted by the U.S. members of RDA. The event featured an "Adoption Day", focusing on the use of new RDA-developed products and guidelines in various domains.
"RDA continues to experience tremendous growth in response to global interest," said Bob Chadduck, a program officer at the National Science Foundation. "RDA-developed tools will have a tremendous impact throughout science, and the plenary provides a place where interested communities from around our world get an opportunity to test-drive the tools."
RDA members collaborate to develop, coordinate and adopt data sharing infrastructure, addressing a broad spectrum of challenges. RDA's working and interest groups design and implement specific tools, recommendations or products within a 12 to 18 month time frame, and these products are adopted and used by other organizations and communities within the alliance. Leveraging diverse perspectives, these groups tackle data sharing challenges pertaining to interoperability, stewardship, sustainability, policy and use.
"Impact is a primary focus for RDA," said Fran Berman, chair of RDA/U.S. "In only two years, RDA has begun fulfilling its mission to build the social and technical bridges that enable the open sharing of data. It's exciting to see the start of a pipeline of adopted infrastructure efforts that will accelerate data sharing and data-driven innovation."
A good example of such infrastructure is RDA's Data Type Registries. The registries make it easier to create machine-readable and researcher-accessible data by designing an archive of common data structures that researchers can turn to when deciding how to organize their data. The creation of such a registry will support the accurate use of data to reproduce experiments, confirm findings and interoperate among data sets.
Formed at the first RDA Plenary in early 2013, the Data Type Registries working group has collaborated during the last year to develop and test its new system. The infrastructure products of this group are already being adopted by European Data Infrastructure (EUDAT), the National Institute of Standards and Technology in the U.S., and other groups who are applying it to their own research activities.
Another effort underway, which is still in its early stages, is RDA's Wheat Data Interoperability working group. Comprised of members from the French National Institute for Agricultural Research, the International Maize and Wheat Improvement Center, and other agriculture-related organizations, the group's objective is to build an integrated wheat information system for the international community of wheat researchers, growers and breeders. With approximately three-quarters of all U.S. grain products made from wheat flour, advancing and sustaining wheat-related science is critical. Finding ways to improve the sharing of data is an important first step.
As RDA enters its second year, its community of data researchers continues to grow. The organization is working closely with countries, communities and agencies to expand the alliance to include new participants. These include partners in Japan, Brazil, Canada and South Africa, and U.S. projects and organizations such as CENDI (Commerce, Energy, NASA, Defense Information Managers Group), the National Data Service, EarthCube and the Sustaining Digital Repositories group.
Throughout its expansion, the alliance's focus will remain on the development of products that promote data sharing and exchange and the establishment of diverse collaborations.
"With its tremendous success in the first two years, its growing reputation as a gathering place for the global research data community and its targeted focus on impact and infrastructure, RDA is capitalizing on its momentum to reach a broader community, and fulfill its goal of research sharing without barriers," Berman said.
-- Aaron Dubrow, NSF
-- Yolanda Meleco, Research Data Alliance
Investigators
Francine Berman
Beth Plale
Mark Parsons
Laurence Lannom
Related Institutions/Organizations
Rensselaer Polytechnic Institute
A PUBLICATION OF RANDOM U.S.GOVERNMENT PRESS RELEASES AND ARTICLES
Showing posts with label DATA SHARING. Show all posts
Showing posts with label DATA SHARING. Show all posts
Saturday, March 14, 2015
Sunday, December 7, 2014
NSF SUPPORTS SCIENCE BIG DATA SHARING THROUGH SCISERVER
FROM: NATIONAL SCIENCE FOUNDATION
SciServer: Big Data infrastructure for science
Research team from Johns Hopkins extends tools from Sloan Digital Sky Survey to new scientific communities
Big Data comes naturally to science. Every year, scientists in every field, from astronomy to zoology, make tremendous leaps in their ability to generate valuable data.
But all of this information comes at a price. As datasets grow exponentially, so do the problems and costs associated with accessing, reading, sharing and processing them.
A new project called SciServer, supported by the National Science Foundation (NSF), aims to build a long-term, flexible ecosystem to provide access to the enormous data sets from observations and simulation.
"SciServer will help meet the challenges of Big Data," said Alex Szalay of Johns Hopkins University, the principal investigator of the five-year NSF-funded project and the architect for the Science Archive of the Sloan Digital Sky Survey. "By building a common infrastructure, we can create data access and analysis tools useful to all areas of science."
SciServer's heritage: Big Data in astronomy
SciServer grew out of work with the Sloan Digital Sky Survey (SDSS), an ambitious, ongoing project to map the entire universe.
"When the SDSS began in 1998, astronomers had data for less than 200,000 galaxies," said Ani Thakar, an astronomer at Johns Hopkins who is part of the SciServer team. "Within five years after SDSS began, we had nearly 200 million galaxies in our database. Today, the SDSS data exceeds 70 terabytes, covering more than 220 million galaxies and 260 million stars."
The Johns Hopkins team created several online tools for accessing SDSS data. For instance, using the SkyServer website, anyone with a web browser can navigate through the sky, getting detailed information about stars or searching for objects using multiple criteria. The site also includes classroom-ready educational activities that allow students to learn science using cutting-edge data.
To allow users--scientists, citizen scientists, students--to run longer-term analyses of the Sloan data, they created CasJobs, an online workbench where registered users can run queries for up to eight hours and store results in a personal "MyDB" database for later analysis.
With each new tool, the community of users grew, leading to more and more scientific discoveries.
The problem: data without infrastructure
One major challenge in managing and extracting value from Big Data is simply preserving the data as file formats change and scientists retire. Another challenge is that most datasets are stored in an ad hoc manner with insufficient metadata for describing how the data should be interpreted and used. Yet another challenge is unequal access to data and expertise among researchers.
Even when individual datasets are well-preserved, the difficulty of combining data for joint analysis means that researchers miss opportunities for new insights. The result is that scientists work inefficiently and miss chances to grow their research projects in new directions.
A variety of projects have developed approaches to preserving and managing datasets, but providing easy access so all researchers can compare, analyze and share them remains a problem. The SciServer team has spent the last two decades addressing these problems, first in astronomy and then in other areas of science.
From SkyServer to SciServer: the new approach
Led by Szalay, the team began work on SciServer in 2013 with funding from NSF's Data Infrastructure Building Blocks program.
Set to launch in phases over the next four years, SciServer will deliver significant benefits to the scientific community by extending the infrastructure developed for SDSS astronomy data to many other areas of science.
"Our approach in designing SciServer is to bring the analysis to the data. This means that scientists can search and analyze Big Data without downloading terabytes of data, resulting in much faster processing times," Szalay said. "Bringing the analysis to the data also makes it much easier to compare and combine datasets, allowing researchers to discover new and surprising connections between them."
Szalay and his team are working in close collaboration with research partners to specify real-world use cases to ensure that the system will be most helpful to working scientists. In fact, they have already made significant progress in two fields: soil ecology and fluid dynamics.
To help ease the burden on researchers, the team developed "SciDrive," a cloud data storage system for scientific data that allows scientists to upload and share data using a Dropbox-like interface. The interface automatically reads the data into a database, and one can search online and cross-correlate with other data sources.
SciServer will extend this capability to a new citizen science project called GLUSEEN (Global Urban Soil Ecological & Educational Network), which aims to gather worldwide distributed data on soil ecology across a range of climatic conditions. SciDrive will offer extensive new collaborative features and will allow individuals to connect remote sensor measurements to weather and other datasets that are available from external worldwide providers.
"Our approach with SciDrive and citizen science immediately will be useful to many other areas of science where datasets managed by individual researchers must be combined with larger publicly-available datasets," said Szalay.
SciServer also has a major initiative underway to develop an "open numerical laboratory" for the access and processing of large simulation databases. Working with the Turbulence Simulation group at Johns Hopkins, they are developing a pilot system to integrate data sets and processing workflows from simulation of turbulence into SciServer.
As the SciServer system becomes more mature, the team will expand to benefit other areas of science including genomics--where researchers must cross-correlate petabytes of data to understand entire genomes--and connectomics--where researchers explore cellular connections across the entire structure of the brain. These collaborations will be spread over a five-year period from 2013 to 2018, and will allow SciServer to be incrementally architected and developed to support its growing capabilities.
"Our conscious strategy of 'going from working to working'--building tools by adapting existing, working tools--is a key factor in ensuring the success of our project," Szalay said. "The tools we build will create a fully-functional, user-driven system from the beginning, making SciServer an indispensable tool for doing science in the 21st century."
-- Mike Rippin, Johns Hopkins University (202) 431-7217 mike.rippin@jhu.edu
-- Aaron Dubrow, NSF (703) 292-4489 adubrow@nsf.gov
Investigators
Alexander Szalay
Randal Burns
Michael Rippin
Steven Salzberg
Aniruddha Thakar
Charles Meneveau
Related Institutions/Organizations
Johns Hopkins University
SciServer: Big Data infrastructure for science
Research team from Johns Hopkins extends tools from Sloan Digital Sky Survey to new scientific communities
Big Data comes naturally to science. Every year, scientists in every field, from astronomy to zoology, make tremendous leaps in their ability to generate valuable data.
But all of this information comes at a price. As datasets grow exponentially, so do the problems and costs associated with accessing, reading, sharing and processing them.
A new project called SciServer, supported by the National Science Foundation (NSF), aims to build a long-term, flexible ecosystem to provide access to the enormous data sets from observations and simulation.
"SciServer will help meet the challenges of Big Data," said Alex Szalay of Johns Hopkins University, the principal investigator of the five-year NSF-funded project and the architect for the Science Archive of the Sloan Digital Sky Survey. "By building a common infrastructure, we can create data access and analysis tools useful to all areas of science."
SciServer's heritage: Big Data in astronomy
SciServer grew out of work with the Sloan Digital Sky Survey (SDSS), an ambitious, ongoing project to map the entire universe.
"When the SDSS began in 1998, astronomers had data for less than 200,000 galaxies," said Ani Thakar, an astronomer at Johns Hopkins who is part of the SciServer team. "Within five years after SDSS began, we had nearly 200 million galaxies in our database. Today, the SDSS data exceeds 70 terabytes, covering more than 220 million galaxies and 260 million stars."
The Johns Hopkins team created several online tools for accessing SDSS data. For instance, using the SkyServer website, anyone with a web browser can navigate through the sky, getting detailed information about stars or searching for objects using multiple criteria. The site also includes classroom-ready educational activities that allow students to learn science using cutting-edge data.
To allow users--scientists, citizen scientists, students--to run longer-term analyses of the Sloan data, they created CasJobs, an online workbench where registered users can run queries for up to eight hours and store results in a personal "MyDB" database for later analysis.
With each new tool, the community of users grew, leading to more and more scientific discoveries.
The problem: data without infrastructure
One major challenge in managing and extracting value from Big Data is simply preserving the data as file formats change and scientists retire. Another challenge is that most datasets are stored in an ad hoc manner with insufficient metadata for describing how the data should be interpreted and used. Yet another challenge is unequal access to data and expertise among researchers.
Even when individual datasets are well-preserved, the difficulty of combining data for joint analysis means that researchers miss opportunities for new insights. The result is that scientists work inefficiently and miss chances to grow their research projects in new directions.
A variety of projects have developed approaches to preserving and managing datasets, but providing easy access so all researchers can compare, analyze and share them remains a problem. The SciServer team has spent the last two decades addressing these problems, first in astronomy and then in other areas of science.
From SkyServer to SciServer: the new approach
Led by Szalay, the team began work on SciServer in 2013 with funding from NSF's Data Infrastructure Building Blocks program.
Set to launch in phases over the next four years, SciServer will deliver significant benefits to the scientific community by extending the infrastructure developed for SDSS astronomy data to many other areas of science.
"Our approach in designing SciServer is to bring the analysis to the data. This means that scientists can search and analyze Big Data without downloading terabytes of data, resulting in much faster processing times," Szalay said. "Bringing the analysis to the data also makes it much easier to compare and combine datasets, allowing researchers to discover new and surprising connections between them."
Szalay and his team are working in close collaboration with research partners to specify real-world use cases to ensure that the system will be most helpful to working scientists. In fact, they have already made significant progress in two fields: soil ecology and fluid dynamics.
To help ease the burden on researchers, the team developed "SciDrive," a cloud data storage system for scientific data that allows scientists to upload and share data using a Dropbox-like interface. The interface automatically reads the data into a database, and one can search online and cross-correlate with other data sources.
SciServer will extend this capability to a new citizen science project called GLUSEEN (Global Urban Soil Ecological & Educational Network), which aims to gather worldwide distributed data on soil ecology across a range of climatic conditions. SciDrive will offer extensive new collaborative features and will allow individuals to connect remote sensor measurements to weather and other datasets that are available from external worldwide providers.
"Our approach with SciDrive and citizen science immediately will be useful to many other areas of science where datasets managed by individual researchers must be combined with larger publicly-available datasets," said Szalay.
SciServer also has a major initiative underway to develop an "open numerical laboratory" for the access and processing of large simulation databases. Working with the Turbulence Simulation group at Johns Hopkins, they are developing a pilot system to integrate data sets and processing workflows from simulation of turbulence into SciServer.
As the SciServer system becomes more mature, the team will expand to benefit other areas of science including genomics--where researchers must cross-correlate petabytes of data to understand entire genomes--and connectomics--where researchers explore cellular connections across the entire structure of the brain. These collaborations will be spread over a five-year period from 2013 to 2018, and will allow SciServer to be incrementally architected and developed to support its growing capabilities.
"Our conscious strategy of 'going from working to working'--building tools by adapting existing, working tools--is a key factor in ensuring the success of our project," Szalay said. "The tools we build will create a fully-functional, user-driven system from the beginning, making SciServer an indispensable tool for doing science in the 21st century."
-- Mike Rippin, Johns Hopkins University (202) 431-7217 mike.rippin@jhu.edu
-- Aaron Dubrow, NSF (703) 292-4489 adubrow@nsf.gov
Investigators
Alexander Szalay
Randal Burns
Michael Rippin
Steven Salzberg
Aniruddha Thakar
Charles Meneveau
Related Institutions/Organizations
Johns Hopkins University
Subscribe to:
Posts (Atom)