View Full Version : More Japanese Vector Supercomputer
Robert Myers
08-20-2003, 08:40 PM
On Wed, 20 Aug 2003 22:26:42 GMT, "Felger Carbon" <fmrfne@jps.net>
wrote:
Thanx to Robert Myers for the URL of the Dongarra presentation on theJapanese Earth Simulator. This is a synopsis of the Dongarra 1.7megPDF (36 slides):The first thing is to ignore slide 4, which is the general spec of theNEC SX-7 (proposed) computer. Totally unrelated to the EarthSimulator (ES), which is an SX-6.The fundamental unit of the ES is a silicon chip that runs at 500MHzand contains one scalar processor and 8 vector processors. This chipis called the Arithmetic Processor (AP). 8 vector processors at500MHz, performing multiply-adds, is 8GFlops/sec per chip.
So, to get back to your original question, why would Japan/NEC endure
the stunning development costs of a completely new vector processor
that only produces 8GFlops/sec/chip when AMD has an evolutionary
design that can deliver 3.8GFlops/sec/chip?
The answer I think, relative to your original post is in two parts:
1. The market is bigger than you think. This is more of a comp.arch
question than a csiphc question, but problems that can be run
efficiently on many parallel processors can in general also be
efficiently vectorized. The general case is that there is a
formulation in which, for some part of the computation, the entire
problem breaks up into N independent parallel processes. Index those
processes j=1...N and that becomes the variable you vectorize on, not
any variable that naturally would index what you would normally
associate with a vector.
In order for this to work, the data have to be laid out in memory in
some predictable way, which occurs often enough naturally or can be
forced with a scatter/gather, and the processor usually needs to be
able to access data in a vector fashion on a non-unit stride in
memory. The ability to access memory on a non-unit stride is what
separates a what I would call a genuine vector unit from SIMD, which
requires either a unit stride or time-consuming pack/unpack
operations.
But, you might still say, who cares? So what if Japan controls the
market for supercomputers? What's at stake here is ultimately the
highest stakes of all human enterprises to be undertaken, which is
molecular biology. We're a *long* way from being able to do molecular
biology with confidence on computers, but the day will come, and if
the US wants to be a player, it has to be a player in the
supercomputer business. Designing atom bombs and space shuttles, by
comparison, is kid's stuff, and the "earth simulator" part of it all
is just an excuse for Japan to provide public financing for a major
move in the direction of technology leadership.
2. Once you've endured the stunning up-front costs, then *you* can get
into the evolutionary design business. The NEC processor is
delivering its 8GFlops/sec/chip lumbering along at 500MHz. If you
only got it up to Madison speeds, 1.5GHz, it would be delivering
24GFlops/sec/chip.
The NEC Vector processor has the potential to more or less blow the US
off the map permanently as far as the supercomputer business is
concerned, and that's why the U.S., which hitherto had been happy
enough to let PC users finance processor development, suddenly got
back into the business of buying specialized high-performance
processors, including a new vector processor from Cray.
_________
The processor is only one piece of the puzzle and maybe not even the
most important piece. The high-speed Black Widow interconnect fabric
being developed for Red Storm is the real magic in that machine, not
the Opteron. Once the fabric is developed, any processor with a
low-latency interconnect to the processor core can become a player.
Since the earth simulator is so geographically spread out, it must
involve some significant interconnect engineering, to put it mildly,
and that has to have gotten the attention of Washington, which had
previously let the Cray T3E team just evaporate, as well.
Once Washington was roused from its permanent pre-retirement afternoon
nap, it also probably realized that supercomputers wasn't the only
business the US was about to blown out of the water on, and that we
were becoming an also-ran in the high-speed interconnect business as
well, all the hooha about hypertransport and infiniband
notwithstanding. That is to say, Washington couldn't rely on PC users
to pay for state-of-the-art high speed inteconnect development,
either.
All told, PC users have been paying for alot of R&D.
RM
chrisv
08-21-2003, 05:17 AM
On Thu, 21 Aug 2003 09:39:59 GMT, "Felger Carbon" <fmrfne@jps.net>
wrote:
"Robert Myers" <rmyers@rustuck.com> wrote in messagenews:q3g8kvklo9f3oh9ff3p2v0ec9kclp8e2v2@4ax.com... <snip> Since the earth simulator is so geographically spread out, it must involve some significant interconnect engineeringUh, Robert, you went through those Dongarra slides too quickly. Thegeographically spread out system illustrated on slide 21 is for aproposed Japanese national grid system, not the Earth SimulatorComputer (ESC).Slide 2 shows the ESC fully contained in a single dedicated building,and the final comment on that slide lists *centralized* as one of theoutstanding merits of the ESC.BTW: Slide 3 shows the very rapid development of the NEC vectorcomputer line:1995 SX-4 2GFlops 148 LSI chips1998 SX-6 8GFlops 32 LSI chips2002 SX-6 8GFlops 1 chip(The ESC is a 640-node SX-6)I don't think the U.S. will overtake Japan's vector processordevelopments easily. For one thing, Japan is willing to spend $400million on one vector processor. The U.S. is not. ;-(
Aren't supercomputers obsolete now anyways, with clustering and all?
I'm sure there's a few applications where it's better to have one big
machine, but at what cost?
Guest
08-21-2003, 06:05 AM
chrisv <chrisv@nospam.invalid> wrote: Aren't supercomputers obsolete now anyways, with clustering and all? I'm sure there's a few applications where it's better to have one big machine, but at what cost?
It's often far easier to code for the shared-memory supercomputers than
for the distributed-memory clusters, and in many cases it is necessary to
optimize the human side of things too, and not just the computers.
Also, the memory bandwidth of the typical supercomputer is generations
ahead of most clusters, and most of my problems are memory-bound, not
cpu-bound.
--
Bjørn-Ove Heimsund
Robert Myers
08-21-2003, 07:27 AM
On Thu, 21 Aug 2003 09:39:59 GMT, "Felger Carbon" <fmrfne@jps.net>
wrote:
"Robert Myers" <rmyers@rustuck.com> wrote in messagenews:q3g8kvklo9f3oh9ff3p2v0ec9kclp8e2v2@4ax.com... <snip> Since the earth simulator is so geographically spread out, it must involve some significant interconnect engineeringUh, Robert, you went through those Dongarra slides too quickly. Thegeographically spread out system illustrated on slide 21 is for aproposed Japanese national grid system, not the Earth SimulatorComputer (ESC).
Well, actually, I didn't even know about the proposed Japanese
national grid system (that is to say, I looked into the briefing only
far enough to assure myself that it contained the information you were
requesting), otherwise I might have been careful not to leave the
impression that I was referring to it. The one dedicated building the
Earth Simulator occupies is, what, what, the size of a couple of
basketball courts? At 500MHz, one CPU clock=2 ns=60 cm. Takes alot
of dribbling to get from one end of the court to the other.
Wiring the backplane of a Cray involves carefully measured wire
lengths. A basketball court sized computer must involve either an
awful lot of measured wire or some very careful measurements and alot
of tuning.
RM
Felger Carbon
08-21-2003, 12:22 PM
"chrisv" <chrisv@nospam.invalid> wrote in message
news:ehh9kvkis9i7bfoakatctfdmctntmj8i2q@4ax.com... Aren't supercomputers obsolete now anyways, with clustering and all? I'm sure there's a few applications where it's better to have one
big machine, but at what cost?
You are correct, there are some applications that really need a
low-latency vector machine. At what cost? What are you willing to
pay? The Japanese government was willing to pay $400 million to get
the Earth Simulator vector machine. This machine was actually
delivered over a year ago, and has been busily crunching numbers ever
since.
The U.S. government has other priorities. It is willing to _study_
vector machines, but is not willing to actually buy a full-scale
vector machine in the ES class.
Felger Carbon
08-21-2003, 12:22 PM
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:pko9kvkacilddrldpi0v4t3236ik7e16pl@4ax.com... The one dedicated building the Earth Simulator occupies is, what, what, the size of a couple of basketball courts? At 500MHz, one CPU clock=2 ns=60 cm. Takes alot of dribbling to get from one end of the court to the other. Wiring the backplane of a Cray involves carefully measured wire lengths. A basketball court sized computer must involve either an awful lot of measured wire or some very careful measurements and
alot of tuning.
Today, all supercomputers, even the Earth Simulator (ES) vector
machine, involve forms of clustering. The fundamental unit of a
P4-based cluster is one P4. Of an Operon-based cluster, 4 or 8
Opterons. The fundamental unit of the ES is the node, which has a
vector length of 64. Two nodes can fit into one cabinet in the ES
building. So the measured wire problem is contained in half a
cabinet, not in the entire ES building.
The ES is a cluster of nodes. Intercommunication among the cluster
fundamental units is a problem of all clusters, not just the ES.
Again, all supercomputers these days are clusters.
Robert Myers
08-21-2003, 03:45 PM
On Thu, 21 Aug 2003 20:22:32 GMT, "Felger Carbon" <fmrfne@jps.net>
wrote:
<snip>
Today, all supercomputers, even the Earth Simulator (ES) vectormachine, involve forms of clustering. The fundamental unit of aP4-based cluster is one P4. Of an Operon-based cluster, 4 or 8Opterons. The fundamental unit of the ES is the node, which has avector length of 64. Two nodes can fit into one cabinet in the ESbuilding. So the measured wire problem is contained in half acabinet, not in the entire ES building.
I don't know how you get the settling time for the entire cluster down
to a reasonable number without considering physical path delays over
the entire building. It was the first question on my mind when I
heard about the earth simulator. Letting the entire thing run
asynchronously without considering physical location and path delays
between clusters would give decent performance only for problems that
didn't require much global communication. Unless you want the time
step to be set by the speed of sound, fluid mechanical calculations
require global communication at every time step.
RM
Robert Myers
08-21-2003, 05:04 PM
On Thu, 21 Aug 2003 20:22:32 GMT, "Felger Carbon" <fmrfne@jps.net>
wrote:
<snip>The U.S. government has other priorities. It is willing to _study_vector machines, but is not willing to actually buy a full-scalevector machine in the ES class.
That actually may not be the dumbest move the U.S. has ever made,
since vector processors may be yesterday's technology. I'd certainly
want to look hard at things like Cell and processor in memory before I
dumped even a measly 400 mill on a super duper new vector processor.
Since superfast processors have national security implications, it
would be a reasonable to wonder if we know everything about what the
U.S. is funding in the advanced processor department. My guess would
be that we don't.
RM
Felger Carbon <fmrfne@jps.net> wrote:
: "chrisv" <chrisv@nospam.invalid> wrote in message
: news:ehh9kvkis9i7bfoakatctfdmctntmj8i2q@4ax.com...
::
:: Aren't supercomputers obsolete now anyways, with clustering and all?
:: I'm sure there's a few applications where it's better to have one big
:: machine, but at what cost?
:
: You are correct, there are some applications that really need a
: low-latency vector machine. At what cost? What are you willing to
: pay? The Japanese government was willing to pay $400 million to get
: the Earth Simulator vector machine. This machine was actually
: delivered over a year ago, and has been busily crunching numbers ever
: since.
:
: The U.S. government has other priorities. It is willing to _study_
: vector machines, but is not willing to actually buy a full-scale
: vector machine in the ES class.
Well, you have to admit. When our government is busy spending
$1,000,000,000 (that's billion) a week getting our sons and daughters
killed, it's a foregone conclusion that there will be NO money for
anything else. Agreed? :-(
J.
--
--------
The end to "Personal Computing" as we know it is just around the corner.
TCPA will take away ALL rights from you, the consumer. Learn more
about it here: http://www.againsttcpa.com/what-is-tcpa.html and
here: http://www.againsttcpa.com/tcpa-faq-en.html
Piotr Sawuk
08-22-2003, 11:26 AM
In article <2llakv0fk3qp2b2de2of4okbvnogcmq1rq@4ax.com>,
Robert Myers <rmyers@rustuck.com> writes: On Thu, 21 Aug 2003 20:22:32 GMT, "Felger Carbon" <fmrfne@jps.net> wrote: <snip>Today, all supercomputers, even the Earth Simulator (ES) vectormachine, involve forms of clustering. The fundamental unit of aP4-based cluster is one P4. Of an Operon-based cluster, 4 or 8Opterons. The fundamental unit of the ES is the node, which has avector length of 64. Two nodes can fit into one cabinet in the ESbuilding. So the measured wire problem is contained in half acabinet, not in the entire ES building. I don't know how you get the settling time for the entire cluster down to a reasonable number without considering physical path delays over the entire building. It was the first question on my mind when I heard about the earth simulator. Letting the entire thing run asynchronously without considering physical location and path delays between clusters would give decent performance only for problems that didn't require much global communication. Unless you want the time step to be set by the speed of sound, fluid mechanical calculations require global communication at every time step.
as far as I have learned in a MP-course the whole point of
vector-computers is to have a single command being applied
to many consecutive array-positions. not global communication,
but communication in the sense of "the third next array-pos
needs to send data" is required. then it also isn't too
difficult to predict the next steps of the execution-path.
I guess that's why they got abandoned: there aren't many
applications which would need vectors with hundreds of
dimensions (since 3 or 4 dimensions can already be handled
by a simple 64-bit-processor with some smart use of the
registers). for example if I would try to simulate the
earth (or similarily complex system) then I could imagine
that a lot of variables need a similar treatment whenever
something changes (like each atom needs to be moved in the
same direction when the object gets moved). Somehow I
suspect that the uses for vector-computers are foreign
for us simply because there aren't many such computers
around. At least I could think of some nice games I
would wish to play on a vector-computer...
did anybody actually look up what this japanese
vector-supercomputer has been used for?
--
Better send the eMails to netscape.net, as to
evade useless burthening of my provider's /dev/null...
before complaining because of my rudeness, read
http://www.unet.univie.ac.at/~a9702387/en/adl/liar-faq.txt
and killfile me...
P
Robert Myers
08-23-2003, 09:06 AM
On 22 Aug 2003 19:26:12 GMT, piotr5@unet.univie.ac.at (Piotr Sawuk)
wrote:
In article <2llakv0fk3qp2b2de2of4okbvnogcmq1rq@4ax.com>, Robert Myers <rmyers@rustuck.com> writes:
<snip> I don't know how you get the settling time for the entire cluster down to a reasonable number without considering physical path delays over the entire building. It was the first question on my mind when I heard about the earth simulator. Letting the entire thing run asynchronously without considering physical location and path delays between clusters would give decent performance only for problems that didn't require much global communication. Unless you want the time step to be set by the speed of sound, fluid mechanical calculations require global communication at every time step.as far as I have learned in a MP-course the whole point ofvector-computers is to have a single command being appliedto many consecutive array-positions. not global communication,but communication in the sense of "the third next array-posneeds to send data" is required. then it also isn't toodifficult to predict the next steps of the execution-path.I guess that's why they got abandoned: there aren't manyapplications which would need vectors with hundreds ofdimensions (since 3 or 4 dimensions can already be handledby a simple 64-bit-processor with some smart use of theregisters). for example if I would try to simulate theearth (or similarily complex system) then I could imaginethat a lot of variables need a similar treatment wheneversomething changes (like each atom needs to be moved in thesame direction when the object gets moved). Somehow Isuspect that the uses for vector-computers are foreignfor us simply because there aren't many such computersaround. At least I could think of some nice games Iwould wish to play on a vector-computer...
You're conversing with a veteran Cray programmer who is still trying
to get used to the idea of cache and who cut his teeth on the notion
of chime or chain slot time. Want a look at what a real computer
looks like? Check out
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html
The issues of vector processing and global communication sometimes get
tangled up because of data locality problems, but in general the two
issues are related only weakly.
Its hard to know from your description whether you are referring to
data dependency or a branch embedded in an inner loop as obstacles to
vectorization, but there are ways around both of those obstacles for
many cases of great interest. There is a general strategy for
vectorizing most multi-dimensional simulations of physical phenomena
that I tried to describe in an earlier post, and so the number of
problems to which vector methods are adaptable is quite large.
Vector processors and serious funding for the high-speed interconnect
fabrics were the victims of post-cold-war US DoD and DoE self-delusion
about something called "COTS", or "commercial off-the-shelf," not for
any reason having to do with the class of problems that vector
processing could be applied to.
The Cray 1 was capable of 133 megaflops, cost, as I recall, about $13
million, and required extensive plumbing. A P4 system capable of
delivering performance in the gigaflop range fits into an ATX tower,
can be had for about $1000, and requires no plumbing. Anybody who
could tell a bit from a byte knew this was coming by the time the
Berlin Wall came down, and the US DoD and the DoE decided that letting
PC users fund computer R&D was not such a bad deal.
The NSF had a supercomputer effort going through the nineties, and it
god alot of press, but the NSF doesn't have the funding clout of the
DoD and the DoE, and things got so bad that Cray could not survive
independently as a manufacturer of computers.
The Earth Simulator changed all that. Cray is back in business and
building vector processors again.
That does not mean that the COTS problem has gone away. This very
month IBM had to tell the US government, once again, that it was not
interested in funding R&D for computers that could not be
commercialized, and that, if the US wanted cutting edge supercomputers
for special purposes, it would have to come up with the money.
did anybody actually look up what this japanesevector-supercomputer has been used for?
The earth simulator is billed as being used for earth sciences.
Atmospheric modeling, aka weather prediction, has long been one of the
most demanding applications for high-performance computing. Weather
prediction is useless if the model doesn't run faster than real time,
and that's a real challenge.
RM
Piotr Sawuk
08-25-2003, 08:04 AM
In article <lj4fkvck59ol4fjl115q44h1ueuqlnkqia@4ax.com>,
Robert Myers <rmyers@rustuck.com> writes: On 22 Aug 2003 19:26:12 GMT, piotr5@unet.univie.ac.at (Piotr Sawuk) wrote:In article <2llakv0fk3qp2b2de2of4okbvnogcmq1rq@4ax.com>, Robert Myers <rmyers@rustuck.com> writes: <snip> I don't know how you get the settling time for the entire cluster down to a reasonable number without considering physical path delays over the entire building. It was the first question on my mind when I heard about the earth simulator. Letting the entire thing run asynchronously without considering physical location and path delays between clusters would give decent performance only for problems that didn't require much global communication. Unless you want the time step to be set by the speed of sound, fluid mechanical calculations require global communication at every time step.as far as I have learned in a MP-course the whole point ofvector-computers is to have a single command being appliedto many consecutive array-positions. not global communication,but communication in the sense of "the third next array-posneeds to send data" is required. then it also isn't toodifficult to predict the next steps of the execution-path.I guess that's why they got abandoned: there aren't manyapplications which would need vectors with hundreds ofdimensions (since 3 or 4 dimensions can already be handledby a simple 64-bit-processor with some smart use of theregisters). for example if I would try to simulate theearth (or similarily complex system) then I could imaginethat a lot of variables need a similar treatment wheneversomething changes (like each atom needs to be moved in thesame direction when the object gets moved). Somehow Isuspect that the uses for vector-computers are foreignfor us simply because there aren't many such computersaround. At least I could think of some nice games Iwould wish to play on a vector-computer... You're conversing with a veteran Cray programmer who is still trying to get used to the idea of cache and who cut his teeth on the notion of chime or chain slot time. Want a look at what a real computer
that I didn't understand. how do you think a cache would
cause damage to the aceptance of vector-computers? and
what do the other 2 notions mean? I'm just a beginning
programmer (assembler) and from this POV I am merely
interested in chip-design and am quite ignorant on this
topic, while you seem to know a lot in this area...
looks like? Check out http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html The issues of vector processing and global communication sometimes get tangled up because of data locality problems, but in general the two issues are related only weakly.
yes, sorry, my error. I should have said that no MP-programming
is required for vector-supercomputers since their Instruction-set
does already contain commands which can easily get paralellized
when the data-locality is handled smartly enough. i.e. vector
commands could get spread onto multiple processors without the
programmer even noticing a difference. if you are doing conscious
MP-programming on such a computer then of course global communication
is an issue, but otherwise (when all the MP-stuff is handled by
the processor internally) the problem is well known from cpu-design
where multiple execution-units work in parallel on some pre-fetched
commands with branch-prediction and stuff. I'm just saying that
theoretically the whole multi-processor stuff could be hidden
in a supercomputer with vector-computer's instruction-set simply
because data is represented as vectors and not memory-positions... Its hard to know from your description whether you are referring to data dependency or a branch embedded in an inner loop as obstacles to vectorization, but there are ways around both of those obstacles for many cases of great interest. There is a general strategy for vectorizing most multi-dimensional simulations of physical phenomena
Of course, it's just that I was referring to the similarity
between a vector-computer's capabilities and general MP-strategies.
for example when I have 4 bytes and need to double each of
them, then loading them into a single 32-bit register and
shifting that, with some bit-masking afterwards for the
overflow. that's what we all do nowdays, we use the 32-bit
wide execution-unit as if it where 4 8-bit-processors...
that I tried to describe in an earlier post, and so the number of problems to which vector methods are adaptable is quite large.
basically I was just repeating your argument that global
communication (actually the need for syncronization of
the current execution-process to match procedual execution
instead of asynchronous use of the processor-power currently
available) does destroy decent performance, but not just in
some super-computer, but in some well-designed vector-computer
as well. C just isn't the language in which vector-computers
should be programmed... Vector processors and serious funding for the high-speed interconnect fabrics were the victims of post-cold-war US DoD and DoE self-delusion about something called "COTS", or "commercial off-the-shelf," not for any reason having to do with the class of problems that vector processing could be applied to. The Cray 1 was capable of 133 megaflops, cost, as I recall, about $13 million, and required extensive plumbing. A P4 system capable of delivering performance in the gigaflop range fits into an ATX tower, can be had for about $1000, and requires no plumbing. Anybody who could tell a bit from a byte knew this was coming by the time the Berlin Wall came down, and the US DoD and the DoE decided that letting PC users fund computer R&D was not such a bad deal. The NSF had a supercomputer effort going through the nineties, and it god alot of press, but the NSF doesn't have the funding clout of the DoD and the DoE, and things got so bad that Cray could not survive independently as a manufacturer of computers. The Earth Simulator changed all that. Cray is back in business and building vector processors again. That does not mean that the COTS problem has gone away. This very month IBM had to tell the US government, once again, that it was not interested in funding R&D for computers that could not be commercialized, and that, if the US wanted cutting edge supercomputers for special purposes, it would have to come up with the money.
I understand this, you are certainly more experienced than me,
I just think that users paying for R&D of vector-computers could
have been possible too. in the mid-eighties it was quite clear
for many people that MP is the future and I'm still wondering
why noone did come up with vector-computers as a basis for that.
I always envisioned a computer where I plug in a processor, and
then another processor into that and so on, until I have a real
super computer, but somehow my dream didn't come true. not technical
obstacles did block this path, lack of research in vector-computers
did. Just my humble opinion...did anybody actually look up what this japanesevector-supercomputer has been used for? The earth simulator is billed as being used for earth sciences. Atmospheric modeling, aka weather prediction, has long been one of the most demanding applications for high-performance computing. Weather prediction is useless if the model doesn't run faster than real time, and that's a real challenge.
I guess Japan is merely interested in earthquakes and maybe
in hurricanes, they don'T seem to have much agriculture... :-)
but seriously, earth-sciences is much bigger than mere
Atmospheric modeling, there are enough computers already
working on weather-prediction, prediction of ocean-behaviour,
consequences from global warming and vulcanic activities
are much less researched areas of earth-sciences. therefore
I ask again: are you sure that weather is a major application
of this particular Vector Supercomputer (as opposed to
supercomputers in general)?
--
Better send the eMails to netscape.net, as to
evade useless burthening of my provider's /dev/null...
before complaining because of my rudeness, read
http://www.unet.univie.ac.at/~a9702387/en/adl/liar-faq.txt
and killfile me...
P
Robert Myers
08-25-2003, 10:59 AM
On 25 Aug 2003 16:04:20 GMT, piotr5@unet.univie.ac.at (Piotr Sawuk)
wrote:
In article <lj4fkvck59ol4fjl115q44h1ueuqlnkqia@4ax.com>, Robert Myers <rmyers@rustuck.com> writes:
<snip>
You're conversing with a veteran Cray programmer who is still trying to get used to the idea of cache and who cut his teeth on the notion of chime or chain slot time. Want a look at what a real computerthat I didn't understand. how do you think a cache wouldcause damage to the aceptance of vector-computers?
It wouldn't necessarily, but the whole mentality of Cray-1 programming
was that the entire machine operated synchronously. If you set things
up correctly, the machine could chain a vector load from memory,
multiply, add, and store to memory, with one new result popping out
each clock cycle, with memory being physically addressed as 64-bit
words and not as bytes, and certainly not as cache lines, because
their was no cache.
For someone who got used to a machine like that, cache seems like a
very odd notion. For certain kinds of problems, cache can actually
slow things down. If your data are stored on non-unit stride in
memory, half of every 128 bit cache line load is useless if you're
doing 64-bit floating point, and you may not be able to keep data
around in cache long enough to offset the extra latency of loading
into cache and then into a register.
I *still* don't always know for sure what Stream benchmarks mean on
microprocessors because they don't always tell you if they've done
something funny with the cache, like skipping over it. Stream tests
the ability of a microprocessor to do the kinds of streaming
calculation (fetch, multiply, add, store) that the Cray-1 would have
excelled at and that show up very frequency in engineering and
scientific work.
Stream-type calculations and vector machines naturally go together,
and cache is generally just an obstacle for Stream-type calculations.
andwhat do the other 2 notions mean?
From the Cray Hardware Manual cited in my previous post:
"V register reservations
The term "reservation" describes the register condition when a
register is in use and therefore not available for use as a result or
as an operand register for another operation. During execution of a
vector instruction, reservations are placed on the operand V registers
and on the result V register. These reservations are placed on the
registers themselves, not on individual elements of the V register."
"A reservation for a result register is lifted during "chain slot"
time. Chain slot time is the clock period that occurs at functional
unit time plus two clock periods. During this clock period, the result
is available for use as an operand in another vector operation. Chain
slot time has no effect on the reservation placed on operand V
registers. A V register may serve only one vector operation as the
source of one or both operands."
That means that, in doing a floating multiply-add, you could use the
result of a vector operation almost immeidately by doing the multiply
and initiating the add during the chain slot time. If you missed the
chain slot time, you had to wait for the entire vector multiply to
complete (typically taking as many cycles as the length of the vector)
before you could initiate the add operation. Modern vector units have
chaining built right in, but it was a novelty on the Cray-I and had to
be coded by hand at just the right time.
"chime" is a jargon-shortened synonym for chain slot time.
I'm just a beginningprogrammer (assembler) and from this POV I am merelyinterested in chip-design and am quite ignorant on thistopic, while you seem to know a lot in this area...
I have to bring it up frequently to justify my relative igorance about
modern microprocessors with cache. :-).
looks like? Check out http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html The issues of vector processing and global communication sometimes get tangled up because of data locality problems, but in general the two issues are related only weakly.yes, sorry, my error. I should have said that no MP-programmingis required for vector-supercomputers since their Instruction-setdoes already contain commands which can easily get paralellizedwhen the data-locality is handled smartly enough. i.e. vectorcommands could get spread onto multiple processors without theprogrammer even noticing a difference.
Well, yes and no. There are similarities and important differences
between machines with vector processing units and multi-processor
parallel machines. To quote
http://www.nersc.gov/aboutnersc/pubs/revolution.html
"...look at the industry's history from 1993 to 1996. Cray Research,
the historic leader in supercomputing technology, was unable to
survive financially as an independent company and was acquired by
Silicon Graphics. Two ambitious new companies that introduced new
technologies in the late 1980s and early 1990s -- Thinking Machines
and Kendall Square Research -- were commercial failures. And Intel
discontinued production of its Paragon supercomputer only a few years
after it was introduced."
"During the same time frame, scientists who had finished the laborious
task of writing scientific codes to run on vector parallel
supercomputers learned that those codes would have to be rewritten if
they were to run on the next-generation, highly parallel
architecture."
That's me! :-(. It also says
"Scientists who are not yet involved in high-performance computing are
understandably hesitant about committing their time and energy to such
an apparently unstable enterprise."
That could be you. ;-).
I can't find a good pedagogical explanation of the similarities and
differences between vector and multiprocessor computation (which are
many, and even a cursory discussion would be lengthy), but
http://parallel.hpc.unsw.edu.au/HPCAsia/papers/33.pdf
presents a recent-head-to-head comparison. That paper will also give
you an idea of some of the issues involved and give you an idea of why
vector units haven't been pursued aggressively of late.
if you are doing consciousMP-programming on such a computer then of course global communicationis an issue, but otherwise (when all the MP-stuff is handled bythe processor internally) the problem is well known from cpu-designwhere multiple execution-units work in parallel on some pre-fetchedcommands with branch-prediction and stuff. I'm just saying thattheoretically the whole multi-processor stuff could be hiddenin a supercomputer with vector-computer's instruction-set simplybecause data is represented as vectors and not memory-positions...
If only life were so simple.
Changing the representation is only a matter of changing the
programming language. In fact, whole libraries full of all the vector
type operations you might want to do are available to hide from you
the ugly details of how the machine is actually doing its dirty work.
If we had good enough languages and/or compilers, then computational
scientists wouldn't have to go through the agonizing code re-writes
that some of us have been through. The amazing thing is that, with
all of the different tricks of modern microprocessors: cache, vector,
SMP, ILP, SIMD, pipelining, super-scalar, (with the notable exceptions
of OoO and SMT), the obstacles to fast computation always come down to
the same thing:
You don't know far enough ahead of time the exact path the code will
take as it threads its way through code with branches (control
indeterminacy) and exactly what data you will need (data
indeterminacy) to keep operational units busy and the pipelines from
getting stalled.
There are probably much more elegant statements around, but getting
around control and data indeterminacy is a big part of the agenda of
computer architecture and code optimization. OoO is such a big win
because it allows you to decide in what order to execute instructions
at the very last moment, without having resolved control and data
dependency issues ahead of time. SMT (hyperthreading) to some extent
accomplishes the same thing.
The Cray-1 in vector mode was designed for and excelled at problems
where data were accessed in one very predictalbe way (constant stride
in memory) and where control indeterminacy could be finessed often
enough to make it unimportant. As it happens, the Cray-1 also
excelled in scalar mode where practically nothing was known ahead of
time becuse its main memory was essentially one big cache.
In most problems, there is much more exploitable parallelism than
programmers can manage to make obvious enough to currently available
compilers that the compiler can actually exploit the parallelism. We
need either smarter programmers or smarter compilers, or both...
Or we need some tools that will allow programmers to help the compiler
find exploitable parallelism without being so smart. That's a problem
I'm working on.
Its hard to know from your description whether you are referring to data dependency or a branch embedded in an inner loop as obstacles to vectorization, but there are ways around both of those obstacles for many cases of great interest. There is a general strategy for vectorizing most multi-dimensional simulations of physical phenomenaOf course, it's just that I was referring to the similaritybetween a vector-computer's capabilities and general MP-strategies.for example when I have 4 bytes and need to double each ofthem, then loading them into a single 32-bit register andshifting that, with some bit-masking afterwards for theoverflow. that's what we all do nowdays, we use the 32-bitwide execution-unit as if it where 4 8-bit-processors...
Your instincts are right in that if you know alot about the problems
of any kind of paralelism, you have a big head start on understanding
the problems of any other kind of parallelism, because they are all
pretty much the same.
<snip>
C just isn't the language in which vector-computersshould be programmed...
If you *really* understand the shortcomings of C as a language for
vector processing, you will rapidly come to the conclusion that it
usually isn't a very good language for modern microprocessors, because
it makes it very hard for a compiler to resolve control and data-flow
uncertainties.
Vector processors and serious funding for the high-speed interconnect fabrics were the victims of post-cold-war US DoD and DoE self-delusion about something called "COTS", or "commercial off-the-shelf," not for any reason having to do with the class of problems that vector processing could be applied to.
<snip> The Earth Simulator changed all that. Cray is back in business and building vector processors again. That does not mean that the COTS problem has gone away. This very month IBM had to tell the US government, once again, that it was not interested in funding R&D for computers that could not be commercialized, and that, if the US wanted cutting edge supercomputers for special purposes, it would have to come up with the money.I understand this, you are certainly more experienced than me,I just think that users paying for R&D of vector-computers couldhave been possible too. in the mid-eighties it was quite clearfor many people that MP is the future and I'm still wonderingwhy noone did come up with vector-computers as a basis for that.
I think the mind-bending cost of the earth simulator should give you a
clue to that.
I always envisioned a computer where I plug in a processor, andthen another processor into that and so on, until I have a realsuper computer, but somehow my dream didn't come true. not technicalobstacles did block this path, lack of research in vector-computersdid. Just my humble opinion...
Do you know about www.beowulf.org? If not, you should pay a visit.
You, too, can dream of owning a supercomputer, or at least your own
computer with all the nasty problems of a supercomputer. ;-).
did anybody actually look up what this japanesevector-supercomputer has been used for? The earth simulator is billed as being used for earth sciences. Atmospheric modeling, aka weather prediction, has long been one of the most demanding applications for high-performance computing. Weather prediction is useless if the model doesn't run faster than real time, and that's a real challenge.I guess Japan is merely interested in earthquakes and maybein hurricanes, they don'T seem to have much agriculture... :-)
Up until rather recently, Japan would not allow imported rice, and I
think Japan experiences typhoons rather than hurricanes, but if you
consider weather prediction to be a stand-in for fluid mechanics and
earthquake science to be a standin for solid mechanics, you have just
accounted for a huge chunk of all the computational work that is done
for scientific or technical purposes. Both are areas that are
generally well-suited to vector processors.
but seriously, earth-sciences is much bigger than mereAtmospheric modeling, there are enough computers alreadyworking on weather-prediction, prediction of ocean-behaviour,consequences from global warming and vulcanic activitiesare much less researched areas of earth-sciences. thereforeI ask again: are you sure that weather is a major applicationof this particular Vector Supercomputer (as opposed tosupercomputers in general)?
The more serious questions are (and the only ones to which I think a
crisp answer is possible) will Japan use the Earth Simulator for
routine weather precition, and if so, how much of the computer's time
does that require? I suspect but do not know that the answer to the
first question is yes, and the answer to the second question is that
weather prediction has to use significantly less than 100% of a
computer's available time to be useful. Asking what is to be done
with the rest would be like asking what is the Hubble Space Telescope
used for.
RM
Piotr Sawuk
08-26-2003, 08:42 PM
In article <dtgkkv4psch0k3fkput1ja14num0270a2l@4ax.com>,
Robert Myers <rmyers@rustuck.com> writes: On 25 Aug 2003 16:04:20 GMT, piotr5@unet.univie.ac.at (Piotr Sawuk) wrote:In article <lj4fkvck59ol4fjl115q44h1ueuqlnkqia@4ax.com>, Robert Myers <rmyers@rustuck.com> writes: <snip>
It wouldn't necessarily, but the whole mentality of Cray-1 programming was that the entire machine operated synchronously. If you set things up correctly, the machine could chain a vector load from memory, multiply, add, and store to memory, with one new result popping out each clock cycle, with memory being physically addressed as 64-bit words and not as bytes, and certainly not as cache lines, because their was no cache. For someone who got used to a machine like that, cache seems like a very odd notion. For certain kinds of problems, cache can actually slow things down. If your data are stored on non-unit stride in memory, half of every 128 bit cache line load is useless if you're doing 64-bit floating point, and you may not be able to keep data around in cache long enough to offset the extra latency of loading into cache and then into a register.
yes, it forces the programmer (actually compiler) to actually
think how optimization should be done best... I *still* don't always know for sure what Stream benchmarks mean on microprocessors because they don't always tell you if they've done something funny with the cache, like skipping over it. Stream tests the ability of a microprocessor to do the kinds of streaming calculation (fetch, multiply, add, store) that the Cray-1 would have excelled at and that show up very frequency in engineering and scientific work. Stream-type calculations and vector machines naturally go together, and cache is generally just an obstacle for Stream-type calculations.
not neccessarily, but I guess for current x86 that's the case.
one could use an additional cache for Stream-type operations,
and most importantly one could alter the way the cache does
work for the benefit of the operation. for example in a
vector-computer you have vector-registers which are processed,
a cache is exactly the same as a vector-register since at
each clock-tick data can get transferred into the calculation
(if it did get loaded into cache beforehand). also the waste
of time caused by 128-bit cache-lines I perceive as the same
thing as if my computer would only allow addressing and loading
of 128-bit values into registers. the cache was thought to
work as a "compatible" replacement for vast amounts of registers...
"chime" is a jargon-shortened synonym for chain slot time.
it's a funny concept (although the need for such a thing is
not understandable for me since register and operand are
synonymous fo me and there seems to be no reason why the
missed vector-parts should not get processed after the part
of the vector which arrived in-time) but certainly isn't
apropriate anymore since storing data in timing does reduce
a program's flexibility. the CPU might get simpler through
that concept, but self-altering programs get much more
complicated. so I ask you: why did you see the need to
cut your teeth on the notion of chain-slot-time?
BTW, as a obvious rant I have to include here that
such things like Cache-state or chain-slot-time,
the state of all registers and obvious variables
stored in memory should all be viewable in assembler
(while programming) at least to the degree they
are predictable. however, I have yet to find some
editor which does even attempt to calculate them...
"During the same time frame, scientists who had finished the laborious task of writing scientific codes to run on vector parallel supercomputers learned that those codes would have to be rewritten if they were to run on the next-generation, highly parallel architecture." That's me! :-(. It also says
that's the whole problem with chain-slot-time, it depends on
the architecture. much better than my above editor-idea would
be a programming-language which can get programmed to handle
such platform-dependant stuff. the real problem with optimizing
some C-output is that code-growth of the compiler is too slow
when compared to the growth of new generations of architectures.
once upon a time x86-optimization was on par with the development
of new designs, now nobody does actually have a good overview
on all the different types of P4-processors and the differences
in optimization required for each of them, even less on the
new possibilities of optimization arising from a re-design
of the CPU-internals. only the very rough and highly effective
optimizations are applied nowdays, noone even tries to spread
the workload onto seperate operation-units as suggested in
the Cray-1 manual you referenced... "Scientists who are not yet involved in high-performance computing are understandably hesitant about committing their time and energy to such an apparently unstable enterprise." That could be you. ;-). I can't find a good pedagogical explanation of the similarities and differences between vector and multiprocessor computation (which are many, and even a cursory discussion would be lengthy), but http://parallel.hpc.unsw.edu.au/HPCAsia/papers/33.pdf presents a recent-head-to-head comparison. That paper will also give you an idea of some of the issues involved and give you an idea of why vector units haven't been pursued aggressively of late.
before I read it: I am aware of differences, I just got the
impression that everything where a single vector-computer
can outperform the ordinary single-processor, a multi-processor
architecture can outperform even that without much change
of code. thereby I did conclude that some extension of the
vector-computer idea might replace traditional SMP-programming...if you are doing consciousMP-programming on such a computer then of course global communicationis an issue, but otherwise (when all the MP-stuff is handled bythe processor internally) the problem is well known from cpu-designwhere multiple execution-units work in parallel on some pre-fetchedcommands with branch-prediction and stuff. I'm just saying thattheoretically the whole multi-processor stuff could be hiddenin a supercomputer with vector-computer's instruction-set simplybecause data is represented as vectors and not memory-positions... If only life were so simple. Changing the representation is only a matter of changing the programming language. In fact, whole libraries full of all the vector type operations you might want to do are available to hide from you the ugly details of how the machine is actually doing its dirty work. If we had good enough languages and/or compilers, then computational scientists wouldn't have to go through the agonizing code re-writes that some of us have been through. The amazing thing is that, with all of the different tricks of modern microprocessors: cache, vector, SMP, ILP, SIMD, pipelining, super-scalar, (with the notable exceptions of OoO and SMT), the obstacles to fast computation always come down to the same thing: You don't know far enough ahead of time the exact path the code will take as it threads its way through code with branches (control indeterminacy) and exactly what data you will need (data indeterminacy) to keep operational units busy and the pipelines from getting stalled.
that's a good reason to replace cache by registers
and to add the possibility of pre-programming how
program-execution is predicted. 100% prediction
might be impossible, but the programmer certainly
does know more than the CPU. the problem here is
that with each new invention of how a computer
could speed up (for example Cache) the programming
languages would need an additional command to
provide additional info for the compiler to use
this facility more effectively (like for example
some tip on which values should get pre-loaded
into cache in the time of the computer being
idle), but such changes to all programming-languages
at once does not happen very often (it's astouning
that something did move in the area of SMP-programming)... There are probably much more elegant statements around, but getting around control and data indeterminacy is a big part of the agenda of computer architecture and code optimization. OoO is such a big win because it allows you to decide in what order to execute instructions at the very last moment, without having resolved control and data dependency issues ahead of time. SMT (hyperthreading) to some extent accomplishes the same thing.
the problem with hyperthreading is the question: how much
processors should get emulated for a given CPU to use
a big part of its resources most of the time? this
question needs to be answerd before the program gets
written, since a processor offering 80 threads to be
executed at the same time would not benefit much from
a program only using 4 threads... The Cray-1 in vector mode was designed for and excelled at problems where data were accessed in one very predictalbe way (constant stride in memory) and where control indeterminacy could be finessed often enough to make it unimportant. As it happens, the Cray-1 also excelled in scalar mode where practically nothing was known ahead of time becuse its main memory was essentially one big cache. In most problems, there is much more exploitable parallelism than programmers can manage to make obvious enough to currently available compilers that the compiler can actually exploit the parallelism. We need either smarter programmers or smarter compilers, or both...
or a smarter CPU. parallel execution of program-parts was
nice in the eighties where a computer was too expensive
to keep adding new processors every now and then, but
now with multi-processor-programming in some network
it isn't certain anymore how many computers will take
part in some calculation and if this number will stay
the same all the time. now we have Objects instead of
procedures and thereby we are able to tell the CPU
which data does belong together, which code is related
how to other code, or even which values can get stored
in some variable. since the CPU is lacking the functionality
to make use of such data we are bending the compilers
to do all the work. that's where I see the concept of
vector-computers to be more advanced, since it makes
use of the info of which data is thought to be an array-object...
If you *really* understand the shortcomings of C as a language for vector processing, you will rapidly come to the conclusion that it usually isn't a very good language for modern microprocessors, because it makes it very hard for a compiler to resolve control and data-flow uncertainties.
yes, Assembler is better fit for that... :-)
but seriously, I do not see how C could have a problem
in predicting Cache-hits or hyperthreading-problems.
I do see a problem with transforming a loop into
some vector-operation, but the problems of predicting
data and execution-path could easily be solved by
the compiler giving feedback on where it is uncertain
and the programmer (or tool) filling in the gaps.
Where do you see a problem with C?
I think the mind-bending cost of the earth simulator should give you a clue to that.
your Cray-1 manual seems to tell me the problem:
large parts of the CPU aren't doing much useful
work most of the time. Vector-registers stay
empty, byte-operations are executed as 24-bit
operations, syncronization because of some
vector being marked as "reserved", and so on.
for the same price a much better RISC-type
of computer could get created, and this one
would then allow much better control on which
resources are used when...
however, for me this is not explanation enough
for why Vector-computers didn't catch on. one
could have built some more flexibility into
those machines, and the problem would be solved.
larger instruction-set, smaller vector-size,
less registers, and it would be pretty much
the same as the usual 8086 for nearly the
same price...
The more serious questions are (and the only ones to which I think a crisp answer is possible) will Japan use the Earth Simulator for routine weather precition, and if so, how much of the computer's time does that require? I suspect but do not know that the answer to the first question is yes, and the answer to the second question is that weather prediction has to use significantly less than 100% of a computer's available time to be useful. Asking what is to be done with the rest would be like asking what is the Hubble Space Telescope used for.
I understand how sensetive a Cray-based computer might be
towards timing, but I suspect that at least with solid
mechanics it would easily be possible to evade global
syncronisation and data-exchange. that'S why I asked
the question, to check if my guess was correct that
this is the main use of Japan's new computer. prediction
of weather is profitable since it can be sold to other
coutries, but I think that with increased use of oceanic
and underground-resources also the need for earth-quake
prediction might rise in the future, since earth-quakes
are sort of the storm of the underground. also the rising
amount of nuclear powerplants makes info on which powerplant
will blow up next (through earthquake or vulcanic activity)
quite valuable. just my opinion, I was wondering if Japan
did have similar ideas...
--
Better send the eMails to netscape.net, as to
evade useless burthening of my provider's /dev/null...
before complaining because of my rudeness, read
http://www.unet.univie.ac.at/~a9702387/en/adl/liar-faq.txt
and killfile me...
P
MyLounge.com Site Map
Forum:
Cars,
Cell Phone,
Database,
Games,
Home Improvement,
IT,
Music,
School,
Sports,
Web Design,
Web Server,
Weight Loss
The MyLounge.com forum is intended for informational use only and should not
be relied upon and is not a substitute for any advice. The information contained
on MyLounge.com are opinions and suggestions of members and is not a representation
of the opinions of MyLounge.com. MyLounge.com does not warrant or vouch for
the accuracy, completeness or usefulness of any postings or the qualifications
of any person responding. Please consult a expert or seek the services of an
attorney in your area for more accuracy on your specific situation. Please note
that our forums also serve as mirrors to Usenet newsgroups. Many posts you see
on our forums are made by newsgroup users who may not be members of MyLounge.com
Term of Service
vBulletin v3.0.7, Copyright ©2000-2009, Jelsoft Enterprises Ltd.