Showing posts with label COMPUTER SCIENCE. Show all posts
Showing posts with label COMPUTER SCIENCE. Show all posts

Sunday, December 7, 2014

NSF SUPPORTS SCIENCE BIG DATA SHARING THROUGH SCISERVER

FROM:  NATIONAL SCIENCE FOUNDATION 
SciServer: Big Data infrastructure for science

Research team from Johns Hopkins extends tools from Sloan Digital Sky Survey to new scientific communities

Big Data comes naturally to science. Every year, scientists in every field, from astronomy to zoology, make tremendous leaps in their ability to generate valuable data.

But all of this information comes at a price. As datasets grow exponentially, so do the problems and costs associated with accessing, reading, sharing and processing them.

A new project called SciServer, supported by the National Science Foundation (NSF), aims to build a long-term, flexible ecosystem to provide access to the enormous data sets from observations and simulation.

"SciServer will help meet the challenges of Big Data," said Alex Szalay of Johns Hopkins University, the principal investigator of the five-year NSF-funded project and the architect for the Science Archive of the Sloan Digital Sky Survey. "By building a common infrastructure, we can create data access and analysis tools useful to all areas of science."

SciServer's heritage: Big Data in astronomy

SciServer grew out of work with the Sloan Digital Sky Survey (SDSS), an ambitious, ongoing project to map the entire universe.

"When the SDSS began in 1998, astronomers had data for less than 200,000 galaxies," said Ani Thakar, an astronomer at Johns Hopkins who is part of the SciServer team. "Within five years after SDSS began, we had nearly 200 million galaxies in our database. Today, the SDSS data exceeds 70 terabytes, covering more than 220 million galaxies and 260 million stars."

The Johns Hopkins team created several online tools for accessing SDSS data. For instance, using the SkyServer website, anyone with a web browser can navigate through the sky, getting detailed information about stars or searching for objects using multiple criteria. The site also includes classroom-ready educational activities that allow students to learn science using cutting-edge data.

To allow users--scientists, citizen scientists, students--to run longer-term analyses of the Sloan data, they created CasJobs, an online workbench where registered users can run queries for up to eight hours and store results in a personal "MyDB" database for later analysis.

With each new tool, the community of users grew, leading to more and more scientific discoveries.

The problem: data without infrastructure

One major challenge in managing and extracting value from Big Data is simply preserving the data as file formats change and scientists retire. Another challenge is that most datasets are stored in an ad hoc manner with insufficient metadata for describing how the data should be interpreted and used. Yet another challenge is unequal access to data and expertise among researchers.

Even when individual datasets are well-preserved, the difficulty of combining data for joint analysis means that researchers miss opportunities for new insights. The result is that scientists work inefficiently and miss chances to grow their research projects in new directions.

A variety of projects have developed approaches to preserving and managing datasets, but providing easy access so all researchers can compare, analyze and share them remains a problem. The SciServer team has spent the last two decades addressing these problems, first in astronomy and then in other areas of science.

From SkyServer to SciServer: the new approach

Led by Szalay, the team began work on SciServer in 2013 with funding from NSF's Data Infrastructure Building Blocks program.

Set to launch in phases over the next four years, SciServer will deliver significant benefits to the scientific community by extending the infrastructure developed for SDSS astronomy data to many other areas of science.

"Our approach in designing SciServer is to bring the analysis to the data. This means that scientists can search and analyze Big Data without downloading terabytes of data, resulting in much faster processing times," Szalay said. "Bringing the analysis to the data also makes it much easier to compare and combine datasets, allowing researchers to discover new and surprising connections between them."

Szalay and his team are working in close collaboration with research partners to specify real-world use cases to ensure that the system will be most helpful to working scientists. In fact, they have already made significant progress in two fields: soil ecology and fluid dynamics.

To help ease the burden on researchers, the team developed "SciDrive," a cloud data storage system for scientific data that allows scientists to upload and share data using a Dropbox-like interface. The interface automatically reads the data into a database, and one can search online and cross-correlate with other data sources.

SciServer will extend this capability to a new citizen science project called GLUSEEN (Global Urban Soil Ecological & Educational Network), which aims to gather worldwide distributed data on soil ecology across a range of climatic conditions. SciDrive will offer extensive new collaborative features and will allow individuals to connect remote sensor measurements to weather and other datasets that are available from external worldwide providers.

"Our approach with SciDrive and citizen science immediately will be useful to many other areas of science where datasets managed by individual researchers must be combined with larger publicly-available datasets," said Szalay.

SciServer also has a major initiative underway to develop an "open numerical laboratory" for the access and processing of large simulation databases. Working with the Turbulence Simulation group at Johns Hopkins, they are developing a pilot system to integrate data sets and processing workflows from simulation of turbulence into SciServer.

As the SciServer system becomes more mature, the team will expand to benefit other areas of science including genomics--where researchers must cross-correlate petabytes of data to understand entire genomes--and connectomics--where researchers explore cellular connections across the entire structure of the brain. These collaborations will be spread over a five-year period from 2013 to 2018, and will allow SciServer to be incrementally architected and developed to support its growing capabilities.

"Our conscious strategy of 'going from working to working'--building tools by adapting existing, working tools--is a key factor in ensuring the success of our project," Szalay said. "The tools we build will create a fully-functional, user-driven system from the beginning, making SciServer an indispensable tool for doing science in the 21st century."

-- Mike Rippin, Johns Hopkins University (202) 431-7217 mike.rippin@jhu.edu
-- Aaron Dubrow, NSF (703) 292-4489 adubrow@nsf.gov
Investigators
Alexander Szalay
Randal Burns
Michael Rippin
Steven Salzberg
Aniruddha Thakar
Charles Meneveau
Related Institutions/Organizations
Johns Hopkins University

Saturday, April 19, 2014

BOOSTING COMPUTER PROCESSING POWER

FROM:  NATIONAL SCIENCE FOUNDATION 
Shaving nanoseconds from racing processors
University of Wisconsin researcher finds hidden efficiencies in computer architecture

The computer is one of the most complex machines ever devised and most of us only ever interact with its simplest features. For each keystroke and web-click, thousands of instructions must be communicated in diverse machine languages and millions of calculations computed.

Mark Hill knows more about the inner workings of computer hardware than most. As Amdahl Professor of Computer Science at the University of Wisconsin, he studies the way computers transform 0s and 1s into social networks or eBay purchases, following the chain reaction from personal computer to processor to network hub to cloud and back again.

The layered intricacy of computers is intentionally hidden from those who use--and even those who design, build and program--computers. Machine languages, compilers and network protocols handle much of the messy interactions between various levels within and among computers.

"Our computers are very complicated and it's our job to hide most of this complexity most of the time because if you had to face it all of the time, then you couldn't get done what you want to get done, whether it was solving a problem or providing entertainment," Hill said.

During the last four decades of the 20th century, as computers grew faster and faster, it was advantageous to keep this complexity hidden. However, in the past decade, the linear speed-up in processing power that we'd grown used to (often referred to as "Moore's law") has started to level off. It is no longer possible to double computer processing power every two years just by making transistors smaller and packing more of them on a chip.

In response, researchers like Hill and his peers in industry are reexamining the hidden layers of computing architecture and the interfaces between them in order to wring out more processing power for the same cost.

Ready, set...compute

One of the main ways that Hill and others do this is by analyzing the performance of computer tasks. Like a coach with a stopwatch, Hill times how long it takes an ordinary processor to, say, analyze a query from Facebook or perform a web search. He's not only interested in the overall speed of the action, but how long each step in the process takes.

Through careful analysis, Hill uncovers inefficiencies, sometimes major ones, in the workflows by which computers operate. Recently, he investigated inefficiencies in the way that computers implement virtual memory and determined that these operations can waste up to 50 percent of a computer's execution cycles. (Virtual memory is a memory management technique that maps memory addresses used by a program, called virtual addresses, to physical addresses in computer memory, in part, so that every program can seem to run as if is alone on a computer.)

The inefficiencies he found were due to the way computers had evolved over time. Memory had grown a million times bigger since the 1980s, but the way it was used had barely changed at all. A legacy method called paging, that was created when memory was far smaller, was preventing processors from achieving their peak potential.

Hill designed a solution that uses paging selectively, adopting a simpler address translation method for key parts of important applications. This reduced the problem, bringing cache misses down to less than 1 percent. In the age of the nanosecond, fixing such inefficiencies pays dividends. For instance, with such a fix in place, Facebook could buy far fewer computers to do the same workload, saving millions.

"A small change to the operating system and hardware can bring big benefits," he said.

Hill and his colleagues reported the results of their research in the International Symposium on Computer Architecture in June 2013.

Computer companies like Google and Intel are among the richest in the world, with billions in their coffers. So why, one might ask, should university researchers, supported by the National Science Foundation (NSF), have to solve problems with existing hardware?

"Companies can't do this kind of research by themselves, especially the cross-cutting work that goes across many corporations," said Hill. "For those working in the field, if you can cross layers and optimize, I think there's a lot of opportunity to make computer systems better. This creates value in the U.S. for both the economy and all of us who use computers."

"The National Science Foundation is committed to supporting research that makes today's computers more productive in terms of performance, energy-efficiency and helping solve problems arising from the entire spectrum of application domains, while also studying the technologies that will form the basis for tomorrow's computers," said Hong Jiang, a program director in the Computer Information Science and Engineering directorate at NSF.

"In the process of expanding the limits of computation, it's extremely important to find both near-term and long-term solutions to improve performance, power efficiency and resiliency. Professor Mark Hill's pioneering research in computer memory systems is an excellent example of such efforts."

The "divide and conquer" approach to computer architecture design, which kept the various computing layers separate, helped accelerate the industry, while minimizing errors and confusion in an era when faster speeds seemed inevitable. But Hill believes it may be time to break through the layers and create a more integrated framework for computation.

"In the last decade, hardware improvements have slowed tremendously and it remains to be seen what's going to happen," Hill said. "I think we're going to wring out a lot of inefficiencies and still get gains. They're not going to be like the large ones that you've seen before, but I hope that they're sufficient that we can still enable new creations, which is really what this is about."

Most recently, Hill has been exploring how graphic processing units (GPUs), which have become common in personal and cloud computing, can process big memory tasks more efficiently.

Writing for the proceedings of the International Symposium on High-Performance Computer Architecture, Hill, along with Jason Power and David Wood (also from the University of Wisconsin), showed that it is possible to design virtual memory protocols that are easier to program without slowing down overall performance significantly. This opens the door to the use of GPU-accelerated systems that can compute faster than those with only traditional computer processing units.

Accelerating during a slow-down

Improvements to virtual memory and GPU performance are a few examples of places where cross-layer thinking has improved computer hardware performance, but they are also emblematic of a wholesale transformation in the way researchers are thinking about computer architecture in the early 21st century.

Hill led the creation of a white paper, authored by dozens of top U.S. computer scientists, that outlined some of the paradigm-shifts facing computing.

"The 21st century is going to be different from the 20th century in several ways," Hill explained. "In the 20th century, we focused on a generic computer. That's not appropriate anymore. You definitely have to consider where that computer sits. Is it in a piece of smart dust? Is it in your cellphone, or in your laptop or in the cloud? There are different constraints."

Among the other key findings of the report: a shift in focus from the single computer to the network or datacenter; the growing importance of communications in today's workflows, especially relating to Big Data; the growth of energy consumption as a first-order concern in chip and computer design; and the emergence of new, unpredictable technologies that could prove disruptive.

These disruptive technologies are still decades away, however. In the meantime, it's up to computer scientists to rethink what can be done to optimize existing hardware and software. For Hill, this effort is akin to detective work, where the speed of a process serves as a clue to what's happening underneath the cover of a laptop.

"It's all about problem solving," Hill said. "People focus on the end of it, which is like finishing he puzzle, but really it's the creative part of defining what the puzzle is. Then it's the satisfaction that you have created something new, something that has never existed before. It may be a small thing that's not well known to everybody, but you know it's new and I just find great satisfaction in that."

--

NSF has been crucial in supporting Mark D. Hill's research throughout his career. For more than 26 years, he has been the the recipient of 19 NSF grants, which supported not only Hill and his collaborators, but also three dozen PhD students from his group, who themselves have trained more than a 100 scientists. Hear his distinguished lecture at NSF from December 2013.

-- Aaron Dubrow, NSF (512) 820-5785 adubrow@nsf.gov
Investigators
Mark Hill
David Wood
James Larus
Gurindar Sohi
Michael Swift
Related Institutions/Organizations
University of Wisconsin-Madison

Search This Blog

Translate

White House.gov Press Office Feed