View Full Version : Smart memory hubs being proposed
Yousuf Khan
12-27-2003, 01:56 PM
Both AMD and Intel are proposing a separate but similar new approach to
memory interconnection design for the future. They are dubbing it smart
memory hubs right now, but the details are a little sketchy. It involves
putting some sort of intelligence right into the memory modules.
http://www.eet.com/semi/news/OEG20030508S0023
The initial efforts are aimed at increasing memory density in servers. I'm
not sure how exactly these hubs are supposed to be "smart". I also fail to
see how adding another layer of circuitry in between the memory controller
and memory itself would speed up memory accesses, since it adds another hop
into the equation. However, perhaps these are the successors to the current
SPD ROM that is implanted on every DIMM to describe its architecture to the
memory controller on initialization? Perhaps these hubs send additional
information that SPDs can't send by themselves?
Yousuf Khan
daytripper
12-27-2003, 03:55 PM
On Sat, 27 Dec 2003 21:56:56 GMT, "Yousuf Khan"
<removethisspam.bjsk90.removethispam@hotmail.com> wrote:
Both AMD and Intel are proposing a separate but similar new approach tomemory interconnection design for the future. They are dubbing it smartmemory hubs right now, but the details are a little sketchy. It involvesputting some sort of intelligence right into the memory modules.http://www.eet.com/semi/news/OEG20030508S0023The initial efforts are aimed at increasing memory density in servers. I'mnot sure how exactly these hubs are supposed to be "smart". I also fail tosee how adding another layer of circuitry in between the memory controllerand memory itself would speed up memory accesses, since it adds another hopinto the equation. However, perhaps these are the successors to the currentSPD ROM that is implanted on every DIMM to describe its architecture to thememory controller on initialization? Perhaps these hubs send additionalinformation that SPDs can't send by themselves? Yousuf Khan
FB-DIMMs....Might be a lot less there than meets the eye of the article.
FB-DIMMs translate a narrow but very fast memory interconnect into ddr2 sdram
transactions, with each FB-Dimm having an asic (the "hub") doing all of the
things discrete registers and plls used to do - PLUS the memory interconnect
actually passes through the hub on one dimm to get to the next dimm/hub,
through that one to the next, and so on. It's quite extensible, which
addresses the problem of hooking a bunch of dimms to *anything* these days
while maintaining interconnect speed.
Note, however, that memory latency is clearly not addressed in a positive
manner - sticking n pass-thru elements between the nth dimm's drams and the
host chipset rarely results in quicker memory response ;-)
One can surmise the era of (up to) 6MB on-chip caches is expected to reduce
typical miss ratios down to where the even-longer-than-before latency isn't a
significant hit to overall platform performance...
And in any case, some powerful marketing forces will be brought to bear to
discourage any thoughts of "This is another iRDRAM marketing disaster waiting
to happen"...
/daytripper (wait for it ;-)
Robert Myers
12-28-2003, 06:02 AM
On Sat, 27 Dec 2003 23:55:30 GMT, daytripper
<day_trippr@REMOVEyahoo.com> wrote:
<snip>
FB-DIMMs....Might be a lot less there than meets the eye of the article.FB-DIMMs translate a narrow but very fast memory interconnect into ddr2 sdramtransactions, with each FB-Dimm having an asic (the "hub") doing all of thethings discrete registers and plls used to do - PLUS the memory interconnectactually passes through the hub on one dimm to get to the next dimm/hub,through that one to the next, and so on. It's quite extensible, whichaddresses the problem of hooking a bunch of dimms to *anything* these dayswhile maintaining interconnect speed.
Presumably solving the problems inherent in a multi-drop bus?
Note, however, that memory latency is clearly not addressed in a positivemanner - sticking n pass-thru elements between the nth dimm's drams and thehost chipset rarely results in quicker memory response ;-)One can surmise the era of (up to) 6MB on-chip caches is expected to reducetypical miss ratios down to where the even-longer-than-before latency isn't asignificant hit to overall platform performance...
The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.
Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.
RM
Robert Redelmeier
12-28-2003, 07:32 AM
In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: The 6mb cache is an act of desperation on Intel's part. I don't
Agreed. yet ...
_think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses.
Then how do you explain the _dismal_ performance of the
Celeron4 with only 128 KB L2 and poor showing of the first
P4 with 256 versus the current P4 at 512 KB? These are
all the same P7 core with the same small L1s.
I can't blame Intel for wanting to try more cache.
This is obviously a game of diminishing returns, and the
P4EE seems to be past. 512 KB seems optimal for current
datasets/problems/benchmarques. Cache MATTERS.
Notice also how the AMD K7 improved from 256 to 512.
The Duron, with the tiny 64 KB L2 performs amazingly well.
Decent L1s and the excellent organization of L2 (16 way,
exclusive) saves it from the Celeron4's fate.
-- Robert
Robert Myers
12-28-2003, 09:01 AM
On Sun, 28 Dec 2003 15:32:00 GMT, Robert Redelmeier
<redelm@ev1.net.invalid> wrote:
In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: The 6mb cache is an act of desperation on Intel's part. I don'tAgreed. yet ... _think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses.Then how do you explain the _dismal_ performance of theCeleron4 with only 128 KB L2 and poor showing of the firstP4 with 256 versus the current P4 at 512 KB? These areall the same P7 core with the same small L1s.I can't blame Intel for wanting to try more cache.This is obviously a game of diminishing returns, and theP4EE seems to be past. 512 KB seems optimal for currentdatasets/problems/benchmarques. Cache MATTERS.Notice also how the AMD K7 improved from 256 to 512.The Duron, with the tiny 64 KB L2 performs amazingly well.Decent L1s and the excellent organization of L2 (16 way,exclusive) saves it from the Celeron4's fate.
Well of course cache matters, and if the latency is fixed, the
increase in cache size with the speed at which you are retiring
instructions (not clock speed) has to be superlinear, no matter how
you get there. That is to say, cache size will keep increasing,
assuming that processors are able to retire instructions at increasing
speeds.
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers
with whom I've corresponded have not disagreed; i.e, there is a
substantial payoff to be had from a better memory subsystem.
I don't think it needs to be explained to you, but I will make the
point anyway: increased cache does no good if you have no way of
triggering memory fetches far enough ahead of time to make use of the
cache. An OoO processor can just juggle more instructions, but
Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.
RM
Robert Redelmeier
12-28-2003, 10:08 AM
In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineers
Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.
Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)
I thought the main idea behind on-CPU memory controllers
was to reduce this to ~70 ns by reduced bufferin/queuing.
A smart hub might be able to detect patterns like 2-4-6-8,
4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anything
with data-driven pseudo-randoms except add latency.
Itanium currently retires instructions in order. Sooner or later, Intel has to do something for Itanium other than to increase the cache size.
Are you suggesting Out-of-Order retirement???
Intriguing possibility with a new arch.
Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.
-- Robert
Bill Todd
12-28-2003, 10:19 AM
"Robert Redelmeier" <redelm@ev1.net.invalid> wrote in message
news:QHCHb.4150$Lr2.2258@newssvr23.news.prodigy.com... In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: The 6mb cache is an act of desperation on Intel's part. I don't Agreed. yet ... _think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses. Then how do you explain the _dismal_ performance of the Celeron4 with only 128 KB L2
Market segmentation: Celeron isn't *meant* to perform at levels comparable
to Pentium - else why would people shell out more for the latter?
and poor showing of the first P4 with 256 versus the current P4 at 512 KB?
Compilers have gotten a lot better at optimizing for P4 too over the past
couple of years - the difference from the early P4s is not *just* cache
size.
These are all the same P7 core with the same small L1s.
The above doesn't necessarily mean that P4 may not be somewhat more
sensitive to cache size than its predecessor - but it clearly doesn't
require many MB of cache to perform well, unlike Itanic.
....
Notice also how the AMD K7 improved from 256 to 512.
Doubling cache size usually helps. But doubling cache size from 256 KB to
512 KB is a hell of a lot less expensive (in terms of chip area) than
doubling cache size from 6 MB to 12 MB.
The Duron, with the tiny 64 KB L2 performs amazingly well. Decent L1s and the excellent organization of L2 (16 way, exclusive) saves it from the Celeron4's fate.
Er, no: having 128 KB of L1 cache plus an exclusive L2 that makes the total
cache size effectively 192 KB (vs. the older Athlon's effective cache size
of 128 KB + 256 KB = 384 KB), plus significantly better IPC, is what saves
it from being a dud like Celeron.
- bill
Yousuf Khan
12-28-2003, 11:26 AM
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:01otuv89vjgildn0eubp1rutfonno04p2o@4ax.com... The 6mb cache is an act of desperation on Intel's part. I don't _think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses.
What's COMA?
Intel will, I gather, move the memory controller onto the die. Other than that, the strategy of the day (and for the forseeable future) is to hide latency, not to address it directly.
Yes, but AMD is also proposing something similar, and they've already moved
the memory controller onboard.
Yousuf Khan
Robert Redelmeier wrote:
In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote:The 6mb cache is an act of desperation on Intel's part. I don't Agreed. yet ..._think_ their strategy is to keep increasing cache size. It's alosing strategy, anyway, unless you go to COMA. Itanium's in-orderarchitecture is just too inflexible, and the problem is still cachemisses. Then how do you explain the _dismal_ performance of the Celeron4 with only 128 KB L2 and poor showing of the first P4 with 256 versus the current P4 at 512 KB? These are all the same P7 core with the same small L1s. I can't blame Intel for wanting to try more cache. This is obviously a game of diminishing returns, and the P4EE seems to be past. 512 KB seems optimal for current datasets/problems/benchmarques. Cache MATTERS.
The non-Intel crowd has known that for years. But cache is
also expensive.
Notice also how the AMD K7 improved from 256 to 512. The Duron, with the tiny 64 KB L2 performs amazingly well. Decent L1s and the excellent organization of L2 (16 way, exclusive) saves it from the Celeron4's fate. -- Robert
--
After being targeted with gigabytes of trash by the "SWEN" worm, I have
concluded we must conceal our e-mail address. Our true address is the
mirror image of what you see before the "@" symbol. It's a shame such
steps are necessary. ...Charlie
Robert Myers
12-28-2003, 01:00 PM
On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan"
<removethisspam.bjsk90.removethispam@hotmail.com> wrote:
"Robert Myers" <rmyers@rustuck.com> wrote in messagenews:01otuv89vjgildn0eubp1rutfonno04p2o@4ax.com... The 6mb cache is an act of desperation on Intel's part. I don't _think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses.What's COMA?
Cache-only memory architecture. The original Cray's were effectively
COMA because Seymour used for main memory what everybody else used for
cache. That's why some three-letter-agencies with no use for vector
architectures bought the machines.
Intel will, I gather, move the memory controller onto the die. Other than that, the strategy of the day (and for the forseeable future) is to hide latency, not to address it directly.Yes, but AMD is also proposing something similar, and they've already movedthe memory controller onboard.
Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a
technical issue is just a technical issue.
I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.
As it happens, Intel's bone-headed approach to computer architecture
works well enough for the kinds of problems I am most interested in,
which involve doing the same thing over and over again in ways that
are stupefyingly predictable and you just want to find a way to do it
very fast. I've often wondered if the secret of the origins of the
Itanium architecture isn't that the engineers who designed it didn't
adequately take into account that most of the world isn't doing
technical computing. That, and the fact that nothing works really
well for the applications that matters the most, which is OLTP
(on-line transaction processing).
Itanium happens to interest me also as an intellectual sandbox in
which I can come to grips with things that may be completely obvious
to some people, but not to me. It does well enough for the problems
that interest me, and over the long haul, I expect Intel's bulldozer
approach to architecture and marketing to win. Those things together
are why you think I am an Itanium bigot.
RM
Robert Myers
12-28-2003, 01:20 PM
On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
<redelm@ev1.net.invalid> wrote:
In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineersOh, fully agreed. For some apps, latency is _everything_(linked-lists, TP dB). If the app hopscotches randomlythru RAM memory (SETI?) nothing else matters much.Modern systems have done wonders to deliver bandwidth.Dual channell DDR at high clocks. But has much been doneto improve latency from ~130 ns? (old number)
Close enough.
I thought the main idea behind on-CPU memory controllerswas to reduce this to ~70 ns by reduced bufferin/queuing.
And it has.
A smart hub might be able to detect patterns like 2-4-6-8,4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anythingwith data-driven pseudo-randoms except add latency.
As David Wang has shrewdly observed, you lose alot of information once
you are outside the processor. All you have left is the history of
memory requests. How about a Bayesian network to try to infer the
underlying pattern? Lame joke. Doesn't even warrant a smiley.
Itanium currently retires instructions in order. Sooner or later, Intel has to do something for Itanium other than to increase the cache size.Are you suggesting Out-of-Order retirement???Intriguing possibility with a new arch.
Just another example of what another poster in another group would
call my non-standard use of language. I had already started using the
word retirement and stuck with it for no better reason than that I had
already started using it. I'm not making any bold new proposals for
computer architecture. Just at the moment, my brain is frazzled from
trying to consume an entire branch of mathematics in a very short
time, so I wouldn't recognize a good new idea if I saw one.
On the other hand, there is no reason why the only information to make
it across a memory hub has to be memory requests, and there is no
reason why the only thing a memory location knows about itself is that
corresponds to a particular address.
Of course, SMT is just a different solution -- keep the CPUbusy with other work during the ~300 clock read stalls.Good if there are parallel threads/tasks. Useless if not.
There is aonther way to use SMT, which is to excecute speculative
slices, and there are papers in the literature for simulated-Itania
that show a dramatic improvement. Until we see the details of how SMT
is implemented in Montecito, it won't be obvious whether SMT can
actually be used that way in Montecito or not. If it can, it goes a
long way toward making up for a lack of true run-time scheduling,
since the speculative slice, whose only purpose in life is to trigger
memory requests, is operating in the actual run-time environment, not
one assumed by the compiler.
RM
Robert Myers wrote:
On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan" <removethisspam.bjsk90.removethispam@hotmail.com> wrote:"Robert Myers" <rmyers@rustuck.com> wrote in messagenews:01otuv89vjgildn0eubp1rutfonno04p2o@4ax.com...The 6mb cache is an act of desperation on Intel's part. I don't_think_ their strategy is to keep increasing cache size. It's alosing strategy, anyway, unless you go to COMA. Itanium's in-orderarchitecture is just too inflexible, and the problem is still cachemisses.What's COMA? Cache-only memory architecture. The original Cray's were effectively COMA because Seymour used for main memory what everybody else used for cache. That's why some three-letter-agencies with no use for vector architectures bought the machines.Intel will, I gather, move the memory controller onto the die. Otherthan that, the strategy of the day (and for the forseeable future) isto hide latency, not to address it directly.Yes, but AMD is also proposing something similar, and they've already movedthe memory controller onboard. Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a technical issue is just a technical issue. I cannot for the life of me get inside the head of whoever makes the technical calls at Intel, because Intel seems to want to do everything the hard way. Why, I do not know.
I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.
As it happens, Intel's bone-headed approach to computer architecture works well enough for the kinds of problems I am most interested in, which involve doing the same thing over and over again in ways that are stupefyingly predictable and you just want to find a way to do it very fast. I've often wondered if the secret of the origins of the Itanium architecture isn't that the engineers who designed it didn't adequately take into account that most of the world isn't doing technical computing. That, and the fact that nothing works really well for the applications that matters the most, which is OLTP (on-line transaction processing). Itanium happens to interest me also as an intellectual sandbox in which I can come to grips with things that may be completely obvious to some people, but not to me. It does well enough for the problems that interest me, and over the long haul, I expect Intel's bulldozer approach to architecture and marketing to win. Those things together are why you think I am an Itanium bigot. RM
--
After being targeted with gigabytes of trash by the "SWEN" worm, I have
concluded we must conceal our e-mail address. Our true address is the
mirror image of what you see before the "@" symbol. It's a shame such
steps are necessary. ...Charlie
Yousuf Khan
12-28-2003, 02:45 PM
"Robert Myers" <rmyers@rustuck.com> wrote in message
news:18huuvo6mfusk4umdvbpns2q7t7prol8dj@4ax.com...Of course, SMT is just a different solution -- keep the CPUbusy with other work during the ~300 clock read stalls.Good if there are parallel threads/tasks. Useless if not. There is aonther way to use SMT, which is to excecute speculative slices, and there are papers in the literature for simulated-Itania that show a dramatic improvement. Until we see the details of how SMT is implemented in Montecito, it won't be obvious whether SMT can actually be used that way in Montecito or not. If it can, it goes a long way toward making up for a lack of true run-time scheduling, since the speculative slice, whose only purpose in life is to trigger memory requests, is operating in the actual run-time environment, not one assumed by the compiler.
That's an interesting way of using SMT, but I suspect we won't see such a
sophisticated use of SMT until at least 65nm, possibly 45nm.
SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count. I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count. Probably more of a reflection about the IPC inefficiency
of the P4 architecture than a reflection on Power5's.
Yousuf Khan
Robert Myers
12-28-2003, 02:58 PM
On Sun, 28 Dec 2003 22:21:43 GMT, CJT <abujlehc@prodigy.net> wrote:
Robert Myers wrote:
<snip> I cannot for the life of me get inside the head of whoever makes the technical calls at Intel, because Intel seems to want to do everything the hard way. Why, I do not know.I've wondered whether they're groping around for something they canpatent -- obvious and previously tried solutions don't meet thatcriterion, leaving "the hard way" perhaps the preferred way fromtheir standpoint.
That is probably the correct explanation.
RM
Yousuf Khan
12-28-2003, 06:00 PM
"CJT" <abujlehc@prodigy.net> wrote in message
news:3FEF577B.4010608@prodigy.net... I've wondered whether they're groping around for something they can patent -- obvious and previously tried solutions don't meet that criterion, leaving "the hard way" perhaps the preferred way from their standpoint.
Yeesh, if that were the case at Intel, I wonder if they have teams of
managers reviewing and shooting down ideas that are too radical, yet not
proprietary enough? :-)
Yousuf Khan
Bill Todd
12-28-2003, 08:57 PM
"Yousuf Khan" <removethisspam.bjsk90.removethispam@hotmail.com> wrote in
message news:t2JHb.192556$ea%.1351@news01.bloor.is.net.cable.rogers.com...
....
SMT in the form of P4's Hyperthreading was done without really adding too many transistors. However, it looks like any other architectures if they want to implement SMT will need to add to the transistor count.
IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure
which) was only a few percent.
I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count.
I've seen that number as well and am curious what it actually refers to
(given the EV8 experience plus comments from the SMT researchers at UWash
about the minimal added chip-area costs of SMT). It's possible that Px's
use of instruction groups aggravates the problem, or that IBM is quoting the
impact of side-effects rather than just SMT per se (e.g., additional cache
to accommodate the increased use by having additional threads), or that IBM
is referring only to the impact on the size of the processor core rather
than to the overall chip area (which includes not only significant amounts
of L2 cache but memory control and inter-chip routing logic plus, for P5,
reportedly some kinds of on-chip offload engines for specific tasks).
- bill
Yousuf Khan
12-28-2003, 10:19 PM
"Bill Todd" <billtodd@metrocast.net> wrote in message
news:F_6dnfreQqiuKXKiRVn-vg@metrocast.net... IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure which) was only a few percent.
Perhaps Alpha is closest in philosophy to P4, except from an earlier
generation? That is high frequencies, but low IPCs. Afterall, Alpha was the
Mhz king of processors for years prior to the crown being taken over by x86
processors. During Alpha's reign on the Mhz pile, its contemporaries (Sparc,
MIPS, Power, PA-RISC, etc.) seemed to be relatively competitive still,
despite not producing the high Mhz that Alpha did.
I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. I've seen that number as well and am curious what it actually refers to (given the EV8 experience plus comments from the SMT researchers at UWash about the minimal added chip-area costs of SMT). It's possible that Px's use of instruction groups aggravates the problem, or that IBM is quoting
the impact of side-effects rather than just SMT per se (e.g., additional cache to accommodate the increased use by having additional threads), or that
IBM is referring only to the impact on the size of the processor core rather than to the overall chip area (which includes not only significant amounts of L2 cache but memory control and inter-chip routing logic plus, for P5, reportedly some kinds of on-chip offload engines for specific tasks).
I don't have that information, but I was also just working from the
assumption that they were talking about a 25% increase in the size of just
the inner core, not the overall die size.
Yousuf Khan
Bill Todd
12-28-2003, 10:37 PM
"Yousuf Khan" <bbbl67.spam@yahoo.com.nospam> wrote in message
news:gIPHb.225446$%TO.27995@twister01.bloor.is.net.cable.rogers.com... "Bill Todd" <billtodd@metrocast.net> wrote in message news:F_6dnfreQqiuKXKiRVn-vg@metrocast.net... IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not
sure which) was only a few percent. Perhaps Alpha is closest in philosophy to P4, except from an earlier generation? That is high frequencies, but low IPCs.
Nope. While that characterization might have had some validity in early
Alphas, by the time EV6 appeared Alpha's IPC was competitive with anyone's
(and better than most) - and, unfortunately, soon thereafter Compaq lost
interest in pushing Alpha clock-rates up (after Capellas took over and
reversed Pfeiffer's intention to market Alpha against the expected Itanic)
so the *only* thing that kept Alpha ahead of the pack was its IPC (until it
fell a full process generation behind as well more recently).
EV8, by virtue of its 8-way issue and even greater number of in-flight
instructions, would have had significantly better IPC than the rest of the
world - leaving aside the impact of SMT on effective IPC.
....
I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. I've seen that number as well and am curious what it actually refers to (given the EV8 experience plus comments from the SMT researchers at
UWash about the minimal added chip-area costs of SMT). It's possible that
Px's use of instruction groups aggravates the problem, or that IBM is quoting the impact of side-effects rather than just SMT per se (e.g., additional
cache to accommodate the increased use by having additional threads), or that IBM is referring only to the impact on the size of the processor core rather than to the overall chip area (which includes not only significant
amounts of L2 cache but memory control and inter-chip routing logic plus, for
P5, reportedly some kinds of on-chip offload engines for specific tasks). I don't have that information, but I was also just working from the assumption that they were talking about a 25% increase in the size of just the inner core, not the overall die size.
Depending on what the EV8 percentages referred to, that might be possible:
my impression is that the POWERx core itself is pretty compact.
- bill
Keith R. Williams
12-29-2003, 05:23 PM
In article <t2JHb.192556$ea%.1351
@news01.bloor.is.net.cable.rogers.com>,
removethisspam.bjsk90.removethispam@hotmail.com says... SMT in the form of P4's Hyperthreading was done without really adding too many transistors. However, it looks like any other architectures if they want to implement SMT will need to add to the transistor count. I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. Probably more of a reflection about the IPC inefficiency of the P4 architecture than a reflection on Power5's.
Me thinks apples <> oranges.
--
Keith
Tony Hill
12-29-2003, 06:07 PM
On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
<redelm@ev1.net.invalid> wrote:In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineersOh, fully agreed. For some apps, latency is _everything_(linked-lists, TP dB). If the app hopscotches randomlythru RAM memory (SETI?) nothing else matters much.Modern systems have done wonders to deliver bandwidth.Dual channell DDR at high clocks. But has much been doneto improve latency from ~130 ns? (old number)
Actually yes, though it wasn't anything obvious. More just that
reducing latency was a major emphasis of current memory controllers.
Intel and nVidia were the first to get it right, and they both did a
bang-up job with their nForce2 and i875 chipsets respectively.
Latency numbers have dropped down to ~100ns on both chipsets (though
I've seen all sorts of different latency numbers depending on just how
this is being measured).
I thought the main idea behind on-CPU memory controllerswas to reduce this to ~70 ns by reduced bufferin/queuing.
On-chip memory controllers reduces latency in a few ways, and it
works. Even with the greatly improved memory controllers from nVidia
and Intel (and now SiS and VIA have more or less caught up), the
Athlon64 and Opteron still have noticeably less latency. In fact,
even with registered memory the Opteron has lower latency than a P4
with unbuffered memory.
Unfortunately there is only so much that can be done here. When you
get right down to it, DRAM has high latency, and nothing you do on the
memory controller side of things can change that. The real solution
to latency is to replace DRAM with something new.
Itanium currently retires instructions in order. Sooner or later, Intel has to do something for Itanium other than to increase the cache size.Are you suggesting Out-of-Order retirement???Intriguing possibility with a new arch.
I think he's merely suggesting out-of-order execution. I don't know
how well this would work with the IA-64 instruction set, but I suppose
it should be possible.
-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
Tony Hill
12-29-2003, 06:07 PM
On Sun, 28 Dec 2003 13:19:13 -0500, "Bill Todd"
<billtodd@metrocast.net> wrote:"Robert Redelmeier" <redelm@ev1.net.invalid> wrote in messagenews:QHCHb.4150$Lr2.2258@newssvr23.news.prodigy.com... Then how do you explain the _dismal_ performance of the Celeron4 with only 128 KB L2Market segmentation: Celeron isn't *meant* to perform at levels comparableto Pentium - else why would people shell out more for the latter?
That may be part of it, but I don't think that Intel intended the
Celeron to be as terrible as it is. I'm also quite certain that Intel
does NOT want people to know about this, as I'm sure most customers
would rather purchase a $35 Duron running at 1.6GHz if they knew it
was consistently faster than Intel's $130 Celeron 2.8GHz processor.
The Celeron is clearly being marketed at the uninformed at this point
in time, because it's performance is absolutely abysmal. It's only
with about the 2.6GHz (P4-style) Celeron that they've finally managed
to equal the performance of the old 1.4GHz (PIII-style) Celeron.
and poor showing of the first P4 with 256 versus the current P4 at 512 KB?Compilers have gotten a lot better at optimizing for P4 too over the pastcouple of years - the difference from the early P4s is not *just* cachesize.
It's not *just* cache size, but cache size along did make a noticeable
difference, to the tune of 10% on most applications. The P4 is a very
cache-hungry processor. This is not surprising as it's whole design
is based around having a constant stream of instructions (and data for
those instructions) to execute. Any time that stream is interrupted,
performance starts dropping real fast.
These are all the same P7 core with the same small L1s.The above doesn't necessarily mean that P4 may not be somewhat moresensitive to cache size than its predecessor - but it clearly doesn'trequire many MB of cache to perform well, unlike Itanic.
That is true, the Itanium is even more cache sensitive than the P4,
however the P4 seems to be more cache sensitive (err, L2 cache
sensitive at least) than the Athlon64 or AthlonXP.
Notice also how the AMD K7 improved from 256 to 512.Doubling cache size usually helps. But doubling cache size from 256 KB to512 KB is a hell of a lot less expensive (in terms of chip area) thandoubling cache size from 6 MB to 12 MB.
Yup, and that's the problem with ever increasing cache, you really
need to double the size for it to be worthwhile and even then you are
often talking about a case of diminishing returns. For the P4, the
difference between the 128KB cache Celerons and the 256KB cache
"Willamette" P4 is HUGE (often 25% or more). Going from 256K to the
512K "Northwood" resulted in a smaller but still very noticeable
improvement (~10% as mentioned above). It's likely that doubling
again for "Prescott" P4s will result in a smaller improvement but
probably still ~5% (though the Prescott will contain other
enhancements, so this will be very tough to measure).
-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
James Boswell
01-08-2004, 04:55 PM
Robert Redelmeier <redelm@ev1.net.invalid> wrote: In comp.sys.ibm.pc.hardware.chips Robert Myers <rmyers@rustuck.com> wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineers Oh, fully agreed. For some apps, latency is _everything_ (linked-lists, TP dB). If the app hopscotches randomly thru RAM memory (SETI?) nothing else matters much. Modern systems have done wonders to deliver bandwidth. Dual channell DDR at high clocks. But has much been done to improve latency from ~130 ns? (old number)
the Athlon64 3400+ returns memory latency numbers in the region of 45ns,
that's single channel of course, but.. damn.
-JB
Robert Myers
01-08-2004, 10:46 PM
On Fri, 9 Jan 2004 00:55:32 +0000 (UTC), "James Boswell"
<JamesBoswell@Btopenworld.com> wrote:
<snip>the Athlon64 3400+ returns memory latency numbers in the region of 45ns,that's single channel of course, but.. damn.
Yeah, that's hot. Intel's approach to latency is <expletive-deleted>
annoying.
RM
MyLounge.com Site Map
Forum:
Cars,
Cell Phone,
Database,
Games,
Home Improvement,
IT,
Music,
School,
Sports,
Web Design,
Web Server,
Weight Loss
The MyLounge.com forum is intended for informational use only and should not
be relied upon and is not a substitute for any advice. The information contained
on MyLounge.com are opinions and suggestions of members and is not a representation
of the opinions of MyLounge.com. MyLounge.com does not warrant or vouch for
the accuracy, completeness or usefulness of any postings or the qualifications
of any person responding. Please consult a expert or seek the services of an
attorney in your area for more accuracy on your specific situation. Please note
that our forums also serve as mirrors to Usenet newsgroups. Many posts you see
on our forums are made by newsgroup users who may not be members of MyLounge.com
Term of Service
vBulletin v3.0.7, Copyright ©2000-2009, Jelsoft Enterprises Ltd.