[info] Supercomputing's Next Revolution

Eugen Leitl <eugen at leitl.org> on Mon Mar 19 20:43:27 UTC 2007

http://www.wired.com/news/technology/computers/1,72090-0.html

By Paul Tulloch 02:00 AM Nov, 09, 2006

Video gamers' cravings for ever-more-realistic play have spawned a
technological arms race that could help cure cancer, predict the next big
earthquake in San Francisco and crack many other mathematical puzzles
currently beyond the reach of the world's most powerful computers.

At the SuperComputing 2006 conference next week in Tampa, Florida,
researchers from the University of North Carolina at Chapel Hill will release
benchmark tests showing how specialized graphics processing units, or GPUs,
developed for the games industry over the past few years compare with
all-purpose central processing units, or CPUs, that currently bear the brunt
of most computing tasks.

The lab tests come amid growing efforts to harness the GPU for general
high-performance computing, and the UNC paper promises to be something of a
showstopper at the weeklong gathering of the supercomputing elite: According
to the Chapel Hill team, a low-cost parallel data processing GPU system can
conservatively surpass the latest CPU-based systems by two to five times in a
wide variety of tasks.

Those results follow on the heels of a major GPU experiment by Stanford
University's Folding at Home project, which last month opened a public beta test
of software aimed at harnessing otherwise unused graphics processing power in
PCs and game consoles connected over the internet. As of Tuesday, data in
that test showed breathtaking performance gains of 20 to 40 times over CPUs:
An array of 536 GPUs donated to the project significantly outperformed some
17,485 CPUs from Linux boxes, with the GPUs producing 35 trillion
calculations per second compared to 21 trillion calculations per second for
the CPUs.

Signs of a breakthrough are coming as Nvidia and ATI, the two dominant GPU
makers, are opening up their technology for non-graphics related
applications.

On Wednesday, Nvidia announced the industry's first C-compiler development
environment for the GPU, called CUDA, a move that will make it easier to tap
the GPU for custom applications, from product design to number crunching.
Nvidia general manager for GPU computing, Andy Keane, said the company
created a completely new architecture for its newest GPU, the GeForce 8800,
adding a

cache that allows the chip to work in two modes--one for graphics that uses
"stream processing" and a second so-called load-store mode for more complex
logic-based operations.

"The GPU now looks like a CPU," Keane said. "CUDA provides a very flexible
and accessible way to access the amazing performance inside the GPU in a way
people can actually use."

ATI, meanwhile, is preparing to release some of its proprietary technology to
the public domain in order to help drive third-party development of
non-graphics-related GPU applications. A major announcement on this front is
expected soon, ATI spokesman Chris Evenden told Wired News.

"ATI believes that in order to maximize the potential of stream processing, a
necessary ecosystem must be established," he said. "ATI is committed to
realizing and enabling this ecosystem with various innovators within the
stream processing environment." However, Evenden gave no firm date and did
not reveal specifics of the technology to be released.

Fifty years after the Maniac II debuted at the Los Alamos lab in New Mexico,
experimental high-performance computing is reaching new heights on the back
of the consumer gaming industry. This summer, IBM announced the Roadrunner,
based on 16,000 AMD Opteron dual-core chips and the same number of IBM Cell
processors (which are at the heart of Sony's new PlayStation3 console due to
be released later this month). When completed, the device will generate 1,000
trillion calculations per second, or one petaflop.

Such machines can tackle complex problems that until now have been
computationally intractable. Another leap in performance would bring within
reach even the most challenging calculations, potentially spawning entirely
new fields of research that have been impractical until now.

A small group of researchers believe those gains can be made by tapping the
processing power of graphics processors developed by the consumer video
gaming industry. "There's a real revolution in the works," said Folding at Home
director Vijay Pande in an e-mail to Wired News.

The GPU is a number-crunching workhorse that for the past five years has
offered computing improvements at a fantastic clip in the form of
ever-crisper graphics coveted by video game fans. High-end devices can run up
to $600, which generally limits them to the more expensive gaming machines
and devices, although they are still much cheaper than top CPU products based
on processors such as the $2,150 AMD Opteron 8220 SE.

ATI and Nvidia have battled relentlessly for dominance in this market,
producing a competitive environment with such rapid and robust innovation
cycles that the two companies are now served up as models for the tech
industry. In a sign of the growing importance of graphics processors,
chipmaker Advanced Micro Devices inked a deal in July to acquire ATI for $5.4
billion, and then unveiled plans to develop a new "fusion" chip that combines
CPU and GPU functions.

Academic interest has picked up over the past two years, but the real spur
for GPU innovation has been intense competition for high-volume and commodity
applications like computer gaming, says Dinesh Manocha of UNC Chapel Hill's
Gamma Research Team, which will present some of its GPU performance findings
next week in Tampa.

"Their peak throughput power of GPUs for rasterization appears to grow as a
factor of two (or more) every year, because of the video gaming industry,
which provides the economic motivation," he wrote in a reply to e-mailed
questions. "Whether the GPUs are widely used for (high-performance computing)
or not, they will continue to grow." How Fast Is Fast?

There are four basic things you need to know about GPUs. First, they are fast
and about to get a whole lot faster. Second, they are cheap, measured on a
performance-per-dollar basis. Third, they use a lot less power than CPUs when
compared on a performance-per-watt basis.

So you're probably wondering, if a GPU is faster, cheaper and uses less power
than a CPU, why doesn't your computer run on one? That brings us to the
fourth thing you need to know about GPUs, namely their limitations.

GPUs are only good for tasks that perform some type of number crunching. As a
result, you will not be running your word processor on a GPU; that is the job
of the more serially logic-oriented CPU. The GPU operates within a parallel
processing environment, which is quite conducive to fast computation but not
branching and complex, layered decision-making algorithms.

The GPU was designed specifically to process graphics, and that means
processing streams of data. What it gives up in flexibility it makes up in
speed. To deliver the graphics required by the latest games means it has to
process data really fast.

How fast?

This is the subject of quite a bit of speculation. ATI provided the following
"hockey stick" chart comparing GPU and CPU performance, although this is
subject to important caveats described below:

The graph compares the latest x1900 series of GPU manufactured by AMD/ATI to
the latest dual-core AMD Opteron CPU processors produced by the same company.
The performance measures they provided are measured in gigaflops, or billions
of calculations per second.

As you can see, the current GPUs have rocketed ahead of the performance of
CPUs on pure, raw processing power. And it would seem from the above graph
one would expect at least a 4 to 5 times increase in speed of GPUs over CPUs.
However, rumors are circulating that peg the latest dual ATI x1900 GPUs
running in cross fire mode up near the one teraflops range, so it would be a
safe bet that a four to five times speed increase shown above should be
viewed as a conservative estimate.

That is simply an amazing amount of processing power for less than a thousand
dollars. Just a few short years ago, one gigaflop of processing power running
in a Beowulf cluster setup would have run you about $30,000.

On paper this comparison seems to put the GPU in the stratosphere of
processing power; however, in reality many variables can influence the final
performance of processors embedded within a system to perform a given task.
Measurements based on flops alone can sometimes be misleading. So although
these new GPUs out of the box have some of the highest measures of raw
processing power ever witnessed, how do they perform when embedded within a
system?

The UNC Chapel Hill Gamma Research Team under laboratory-type conditions put
an Nvidia 7900 GTX GPU up against two different leading-edge optimized
CPU-based implementations running on high-end, dual-3.6-GHz Intel Xeon
processors or dual AMD Opteron 280 processors. The research team, which
included Manocha, Naga K. Govindaraju and Scott Larsen from UNC and Jim Gray
from Microsoft Research, put these systems through three fairly standard
numeric-based computational algorithms, including sorting, FFT (fast Fourier
transform) and matrix multiplications.

The results they recorded show that the GPU performed at anywhere from two to
five times the speed of the CPU-based systems on these specific applications.
Naga Govindaraju, the main developer of these algorithms, will present the
results at the SuperComputing conference in Tampa.

Earlier this year, some of the Gamma group researchers, in collaboration with
Microsoft's Gray, developed GPUTeraSort, which sorted 590M records in 644
seconds on a system with an Nvidia 7800GT and costing less than $1,200. It
was enough to win the coveted PennySort benchmark for sorting.

The co-lead of the Gamma group, Ming C. Lin, is leading the development of
many new GPU-based technologies for physics simulation -- including collision
detection, motion planning and deformable simulations -- with speeds in many
cases increasing 10 to 20 times beyond previous methods.

Gamma group members have received very strong support from Nvidia in
developing these new GPU-based technologies over the last three to four
years.

The Gamma Research Team's work would seem to align well with the ATI
comparisons. There is, however, plenty of variance in the outcomes when
comparing GPU and CPU performance. This has a lot to do with the nature of
the processing involved in the computation.

Some algorithms fit nicely with the programming environment the GPU offers
and some do not. A lot of this has to do with the design of the GPU and the
parallel processing environment from which it gets its speed. Recall that the
entire technology from head to toe was designed for the gaming industry, not
general purpose mathematical computing.

There are ways to trick the processing system to perform general purpose
computation. However, these deceptions can only take you so far before the
GPU runs up against the wall in its ability to encapsulate the requirements
of a particular algorithm. So it would seem, based on the Gamma work, that
rather than the GPU's raw processing power limiting its output, the litmus
test in many cases becomes how well a particular computational algorithm's
paradigm fits with the design of the GPU's computation hardware and its
parallel processing environment. This gets a bit technical but it goes back
to the old adage, square pegs do not fit into round holes.

Lab benchmarks are one thing, and field research is another.

Folding at Home director Pande says early results in his group's GPU experiment
confirm some speed gains for specific tasks, but, similar to the UNC results,
some variance was experienced.

The Folding at Home project is an extremely large computational research project
dedicated to modeling protein folding behavior and its relationship to
different diseases such as Alzheimer's, Huntington's, Parkinson's and various
forms of cancer. It is exactly the type of project for which GPU technology
could provide a low-cost, high-performance computing solution.

The highly complex mathematics involved in modeling protein folding requires
many millions upon millions of calculations. Even today's largest
supercomputers, assuming Pande's team could afford the processing time, would
not be adequate to perform these calculations in a timely manner. So, as an
alternative, Pande distributed a software package over the internet to people
across the world to allow participants to run small portions of the
calculations on their home desktop computers.

This established distributed supercomputing capacity through the internet by
utilizing the spare processing capacity of the world's home computers. The
capacity is determined by the number of users participating in the project,
and at peak times Pande's team commands more computational power than several
supercomputers.

Not satisfied with that, the team expanded the reach of the computational
capacity, enlarging the project to include tapping into idle GPUs sitting on
people's home computers as well. It is one of the first large-scale
applications of non-graphics GPU technology in the world.

I arranged to meet up with Pande to discuss the team's experiences so far
with the GPU technology.

When we met, there were two things that immediately struck me about Pande.
First, he is a man obsessed with understanding the biological process of
protein folding. Second, he is a man obsessed with extracting every last
spare computational processing cycle in the world to model the behavior of
protein folding.

When he started reading about the huge potential of raw number-crunching
capabilities developing within the GPU chipset, he acted quickly to find out
how much.

Members of his project team started researching this potential a couple of
years back, he said, and are now in the midst of beta testing the rollout of
their work.

"We have been quite pragmatic about what technology we use and where it comes
from for the Folding at Home project," he said. "In fact we are again looking
into the gaming industry at some of the developments occurring with the
physics engine GPU-based technology for gaming. We also are working quite
hard on the multi-GPU technology. We could see some amazing results from both
initiatives."

Pande indicated that in some cases where his team spent upward of a year
grooming the code, it achieved a 40-fold increase in speed. In other cases
where less time was spent preparing the code and the nature of the numerical
processing task was not well-suited for GPU processing, the researchers
witnessed no performance gain at all. Overall, they typically registered
gains on the order of 10 to 20 times.

They spent a large amount of time grooming the code necessary for getting
GPUs to perform tasks unrelated to the graphics processing they are designed
for, Pande said. With the latest release of graphics cards, the process was
somewhat easier to program, but still required some extra effort.

Not only are the programmers required to basically trick the GPU into
performing non-graphics-based computations, but the GPU further challenges
the programmer with its parallel processing environment. Both of these tasks
are made more difficult by the fact that much of the team's understanding of
the inner workings of the GPU was gained through trial and error.

This is due to proprietary knowledge being kept under lock and key by the two
main suppliers of GPUs, ATI and Nvidia. Attempting to understand the inner
workings of the GPU formed a major roadblock in harnessing this technology,
Pande said.

Manocha said that although the hardware end of things has produced a
legitimate platform to begin the quest of harnessing GPU processing power, on
the software end of the equation, developing the required infrastructure to
bring this technology to maturity has a long way to go.

One of the first software initiatives in an organized commercial sense of
things to take up the GPU challenge is a company called PeakStream, which
aims to make it possible "to easily program new high-performance processors
such as multicore CPUs, graphics processing units and cell processors,"
according to a published statement from the company. Another startup tackling
this field is RapidMind.

One other wild card is to what extent ATI and Nvidia plan on supporting the
development of non-graphic GPU processing. This lack of support is one of the
larger issues preventing the diffusion of this technology.

ATI's and Nvidia's commitment to accessibility in the public knowledge base
will be pivotal in developing the potential for GPU technology and is a major
innovation for the future, Manocha believes. Moreover, game physics has the
potential of becoming the killer application of the technology.

"By opening up the GPU, the vendors will greatly increase the pace of
research, development and application of this technology," he said. "After
that the target will be for somebody to develop the killer app, and that may
be the last pillar needed to see the non-graphics GPU technology attract the
economic interests required to launch it into the mainstream."

Nvidia did not return calls seeking comment.

More information about the info mailing list