PDA

View Full Version : Question: memory layout multi proc. for Opteron systems??


Guest
11-10-2003, 08:26 PM
I've asked this before but I don't think I got an answer.

What is the memory layout on dual Opteron systems?
For example:
http://www.tyan.com/products/html/thunderk8w.html

Do I put memory in both banks? Is there any OS function that controls
what processor gets what (the "closest") memory? I am just running vanilla
(32bit) Linux.

Also, does anyone have "stream" benchmarks for this machine? I'm curious
what improvement results from the onchip memory controller has, and if
applicable, what degradation results from the hypertransport.

Thanks for any info,
Richard

Edmund
11-10-2003, 09:08 PM
On 10 Nov 2003 23:26:37 -0500, Mannr@uwaterloo.ca wrote:
I've asked this before but I don't think I got an answer.What is the memory layout on dual Opteron systems?For example: http://www.tyan.com/products/html/thunderk8w.htmlDo I put memory in both banks? Is there any OS function that controlswhat processor gets what (the "closest") memory? I am just running vanilla(32bit) Linux.Also, does anyone have "stream" benchmarks for this machine? I'm curiouswhat improvement results from the onchip memory controller has, and ifapplicable, what degradation results from the hypertransport.Thanks for any info, Richard

Download the mother board manual, (Acrobat PDF file)
http://www.tyan.com/support/html/manuals.html

Thunder K8W, section 2.07 - Installing The Memory.(pages 19-22)

Ed

Guest
11-10-2003, 09:58 PM
Ed <nomail@hotmail.com> writes:
On 10 Nov 2003 23:26:37 -0500, Mannr@uwaterloo.ca wrote:I've asked this before but I don't think I got an answer.What is the memory layout on dual Opteron systems?For example: http://www.tyan.com/products/html/thunderk8w.htmlDo I put memory in both banks? Is there any OS function that controlswhat processor gets what (the "closest") memory? I am just running vanilla(32bit) Linux.Also, does anyone have "stream" benchmarks for this machine? I'm curiouswhat improvement results from the onchip memory controller has, and ifapplicable, what degradation results from the hypertransport.Thanks for any info, Richard Download the mother board manual, (Acrobat PDF file) http://www.tyan.com/support/html/manuals.html Thunder K8W, section 2.07 - Installing The Memory.(pages 19-22) Ed

Thanks for the information. I just checked. OK I guess the chips will access
memory more or less transparently, no matter where it is.

But my questions still remain:

- What is the performance of accessing memory directly and going through
hypertransport? I would like to see both numbers, since I expect
direct memory accesses to improve due the onchip controller too.
I guess it is possible that hypertransport adds very little to the
access time, compared to the memory itself. But I'd like to see
some numbers.

- Will Linux (or any other OS) make sensible decisions, eg., allocating
new memory on the CPU it is running on, and, if possible, keeping the
thread/process on that processor in the future? If so, this architecture
could scale very well.

Richard

Rob Stow
11-11-2003, 12:45 AM
Mannr@uwaterloo.ca wrote: Ed <nomail@hotmail.com> writes:On 10 Nov 2003 23:26:37 -0500, Mannr@uwaterloo.ca wrote:I've asked this before but I don't think I got an answer.What is the memory layout on dual Opteron systems?For example: http://www.tyan.com/products/html/thunderk8w.htmlDo I put memory in both banks? Is there any OS function that controlswhat processor gets what (the "closest") memory? I am just running vanilla(32bit) Linux.Also, does anyone have "stream" benchmarks for this machine? I'm curiouswhat improvement results from the onchip memory controller has, and ifapplicable, what degradation results from the hypertransport.Thanks for any info, RichardDownload the mother board manual, (Acrobat PDF file)http://www.tyan.com/support/html/manuals.htmlThunder K8W, section 2.07 - Installing The Memory.(pages 19-22)Ed Thanks for the information. I just checked. OK I guess the chips will access memory more or less transparently, no matter where it is. But my questions still remain: - What is the performance of accessing memory directly and going through hypertransport? I would like to see both numbers, since I expect direct memory accesses to improve due the onchip controller too. I guess it is possible that hypertransport adds very little to the access time, compared to the memory itself. But I'd like to see some numbers. - Will Linux (or any other OS) make sensible decisions, eg., allocating new memory on the CPU it is running on, and, if possible, keeping the thread/process on that processor in the future? If so, this architecture could scale very well.

There is a PDF at AMD that, among other things, discusses your last
question. Basically, by default, when a *proccesor* needs memory,
it will try to allocate from the banks "attached" to it before it uses
memory "attached" to other processors.

However, it is *possible* for someone to create an OS that micro-manages
the memory allocations instead of letting the processors decide. AFAIK,
Linux and 64 bit Windows will *not* override the default. 32 bit versions
of Windows pretty much had their feature sets fixed before this became
something for MicroSoft to worry about, so they will or course simply let
the processors take care of this.

And yes, this architecture scales *very* well. Scaling from one
to 2, or 2 to 4 processors with Opterons give *much* better results
than the same kind of scaling with Itanic or Xeon. Opty vs Xeon
benchmarks that show the benefits of Opterons scaling are easy
to find on the web, while Opty vs Itanic takes a lot more searching.

Note also that AMD left the door open for any chipset/motherboard
manufacturer who wants to provide their own memory controller instead
of using the ones in the processors. AFAIK, no one has done this yet.



--
Reply to newsgroup only please. This e-mail account is real
but effectively abandoned because of excessive spamming.

Tony Hill
11-11-2003, 01:43 AM
On 11 Nov 2003 00:58:48 -0500, Mannr@uwaterloo.ca wrote:Thanks for the information. I just checked. OK I guess the chips will accessmemory more or less transparently, no matter where it is.

Yup, that about covers it.
But my questions still remain:- What is the performance of accessing memory directly and going through hypertransport? I would like to see both numbers, since I expect direct memory accesses to improve due the onchip controller too. I guess it is possible that hypertransport adds very little to the access time, compared to the memory itself. But I'd like to see some numbers.

AMD had some numbers at various times. I believe their saying was
that the difference between a local and remote memory access was less
than the difference between a page hit and a page miss. My
understanding is that the latency of local memory access is roughly
115 clock cycles at 2GHz, and something on the order of 150 clock
cycles for remote memory. I'm afraid that I don't have the documents
on hand to back up those numbers though, but they're at least in the
right ball-park.

For comparison, with an external memory controller you would tend to
be looking at about 170 or 180 clock cycles latency at 2GHz.
- Will Linux (or any other OS) make sensible decisions, eg., allocating new memory on the CPU it is running on, and, if possible, keeping the thread/process on that processor in the future? If so, this architecture could scale very well.

What you are talking about is NUMA optimizations (having your OS know
about local and remote memory) and processor affinity (having
processes stick to one processor so that they don't thrash your
cache).

The Linux 2.4 kernel contains no NUMA optimizations and very little
(none?) in the way of processor affinity optimizations. The upcoming
2.6 kernel has a significantly improved scheduler that with much
better processor affinity. I believe that WinXP also has some
processor affinity.

NUMA is a bit trickier. Again, the 2.6 Linux kernel does have NUMA
support, but according to one kernel developer, it doesn't do much of
anything for the Opteron.

As for scalability, the Opteron is already showing some very promising
signs of scaling well, and it should continue to improve as the
software is better able optimized. This can most easily be seen when
comparing SPEC CPU2000 "rate" numbers. You can find identical systems
being used in 1, 2 and 4 processor mode for both an AMD Opteron system
(Einux A4800, listed under AMD's name) and an Intel XeonMP system
(Dell PowerEdge 6650, listed under Dell's name). In CINT_rate for
single processor systems, the 2.8GHz XeonMP is 5% faster than the
1.8GHz Opteron. For the dual-processor systems, the difference drops
to less than 1%, and for the 4-processor systems, the Opteron is just
over 2% faster.

Of course, that scalability is nothing as compared to what you see in
Spec CFP2000_rate. Here, the same two systems are again tested in 1,
2 and 4 processor modes. Here, the Opteron starts out 8% faster in a
1P system, then extends it's lead to 38% in a 2P system. The real
kicker though is in the 4 processor system, where the Opteron is no
less than 94% faster than the XeonMP system!

As you suggest, the Opteron does indeed scale very well, at least up
to 4 processor systems (no one sells anything bigger at this point in
time). Now, to be fair, comparing it to a Xeon doesn't say much,
because the Xeon is well known not to scale well at all. However, the
scaling that the Opteron shows is pretty darn close to that of the Sun
UltraSparcIII and the IBM Power4, which puts it in pretty good
company.

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca

Rui Pedro Mendes Salgueiro
11-11-2003, 02:25 AM
Mannr@uwaterloo.ca wrote: I've asked this before but I don't think I got an answer.
What is the memory layout on dual Opteron systems? For example: http://www.tyan.com/products/html/thunderk8w.html
Do I put memory in both banks?

Yes. So you will need 4 DIMMs (2 banks, each 128 bit wide,
so 2 DIMMs per bank).
Is there any OS function that controls what processor gets what (the "closest") memory?

Check the links found by:
http://www.google.com/search?q=linux+NUMA

(NUMA = Non-Uniform Memory Access)

Linux Scalability Effort: NUMA Group Homepage
Linux Support for NUMA Hardware. Large ... This page provides links to information
about the various Linux on NUMA projects. Discussions ...
lse.sourceforge.net/numa/ - 7k - Cached - Similar pages

Linux: NUMA Awareness Added To Scheduler
.... Linux: NUMA Awareness Added To Scheduler. Posted by jeremy on Wednesday,
January 22, 2003 - 05:22. After several earlier attempts ...
kerneltrap.org/node/view/562 - 33k - Cached - Similar pages

The second one seems to show that the Linux Scheduler is aware of
NUMA (probably only the bleeding-edge versions).
I am just running vanilla (32bit) Linux.

Suse 9.0 64 bit is out now.
SuSE Linux 9.0 Professional 64 Bit Edition
http://www.amazon.co.uk/exec/obidos/ASIN/B0000UI2WS/

http://shop.mensys.nl/catalogue/mns_SuSELinux.html

I am waiting for an Opteron processor (and memory) to test it (all
the other hardware is already in the shop). Maybe today or tomorrow.
Also, does anyone have "stream" benchmarks for this machine?

http://www.cs.virginia.edu/stream/#PeeCeeResults
I think there are no Opteron results there (or I didn't find them).
But the code is easy to compile and run, so you can do it yourself.

Some days ago I found this:

http://wwwseminars.web.cern.ch/wwwseminars/2003/2003-OtherFormats/t-20030903.ppt

In slide 13 there is:
1x Stream: 2x Stream: 4x Stream:
2x Opteron, 1.8 GHz,
HyperTransport: 1006 1671 MB/s 975 1178 MB/s 924 1133 MB/s
2x Xeon, 2.4 GHz,
400 MHz FSB: 1202 1404 MB/s 561 785 MB/s 365 753 MB/s

I found these numbers a bit suspect, because I don't know if the benchmark
works well when 2 copies or more are run at the same time.
I'm curious what improvement results from the onchip memory controller has,

For comparision, results from a P4 2.8 GHz, with an Intel 875 chipset.

Function Rate (MB/s) RMS time Min time Max time
Copy: 2666.6667 0.0802 0.0600 0.0900
Scale: 2666.6667 0.0792 0.0600 0.1000
Add: 3000.0000 0.1011 0.0800 0.1200
Triad: 3000.0000 0.1022 0.0800 0.1300
and if applicable, what degradation results from the hypertransport.

If you have the time, you could run it with both banks filled and only
one. But as STREAM reports the best time, it might not show anything
interesting.

--
http://www.mat.uc.pt/~rps/

..pt is Portugal| `Whom the gods love die young'-Menander (342-292 BC)
Europe | Villeneuve 50-82, Toivonen 56-86, Senna 60-94

Felger Carbon
11-11-2003, 02:30 AM
"Rob Stow" <rob.stow@sk.sympatico.ca> wrote in message
news:vr185t9f7n9g82@corp.supernews.com... Mannr@uwaterloo.ca wrote: And yes, this architecture scales *very* well. Scaling from one to 2, or 2 to 4 processors with Opterons give *much* better results than the same kind of scaling with Itanic or Xeon. Opty vs Xeon benchmarks that show the benefits of Opterons scaling are easy to find on the web, while Opty vs Itanic takes a lot more searching.

_Generally_ correct. However...

The reason that the Opteron (generally) scales better than Xeon etc. is
that the Opteron has a NUMA memory structure, with each processor having
its own memory, while MP Xeons have one central shared memory. Let me
emphasize that I definitely prefer the Opteron's memory structure.
However, there are some algorithms/applications, including some
commercially important applications, where a shared memory SMP system is
best.

So the answer to the question of whether the Opteron or Xeon MPs are
better is, "usually the Opteron is better, but sometimes the Xeon is
better."

Yousuf Khan
11-11-2003, 07:28 AM
<Mannr@uwaterloo.ca> wrote in message
news:yua3ccv78fb.fsf@tapir.uwaterloo.ca... - What is the performance of accessing memory directly and going through hypertransport? I would like to see both numbers, since I expect direct memory accesses to improve due the onchip controller too. I guess it is possible that hypertransport adds very little to the access time, compared to the memory itself. But I'd like to see some numbers.

Well, I think others have already provided the numbers. One thing to keep in
mind is that Hypertransport is so fast that AMD doesn't want developers to
spend an inordinate amount of time taking NUMA effects into account. They
have invented a new moniker they call SUMO (Sufficiently Uniform Memory
Organization, or something like that). I think they claim that accessing
memory through the Hypertransport link (HTL) is faster than the traditional
shared bus memory controllers anyways, because HTL doesn't have any bus
contention issues (they are all dedicated point-to-point links). So worrying
about optimizing an OS kernel is without merit, at least when it comes to
processors connected via HTL. I think the original idea of HTL was for it to
be the traditional full front side bus of the processor, including the link
to the external memory controller, but when they decided to integrate the
memory controller into the processor, then the HTL was freed up to become
simply the link to the I/O chips and the other Opterons.

They would rather that NUMA tuning efforts be geared towards external
interconnects. For example, with HTL one could place upto 8 Opterons in a
SUMO configuration inside a single systemboard, but once you want to use
more than 8 Opterons, you have to add additional system boards, and connect
each of the systemboards to each other via some interconnect other than HTL.
So they suggest treating SUMO simply like SMP, while slower systemboard to
systemboard interconnects can be treated like NUMA.
- Will Linux (or any other OS) make sensible decisions, eg., allocating new memory on the CPU it is running on, and, if possible, keeping the thread/process on that processor in the future? If so, this
architecture could scale very well.

I think they've already discovered that the architecture can scale extremely
well, even without taking NUMA into account.

Yousuf Khan

Guest
11-11-2003, 09:21 PM
Rui Pedro Mendes Salgueiro <rps@rena.mat.uc.pt> writes:
Mannr@uwaterloo.ca wrote: I've asked this before but I don't think I got an answer. What is the memory layout on dual Opteron systems? For example: http://www.tyan.com/products/html/thunderk8w.html Do I put memory in both banks? Yes. So you will need 4 DIMMs (2 banks, each 128 bit wide, so 2 DIMMs per bank). Is there any OS function that controls what processor gets what (the "closest") memory? Check the links found by: http://www.google.com/search?q=linux+NUMA (NUMA = Non-Uniform Memory Access) Linux Scalability Effort: NUMA Group Homepage Linux Support for NUMA Hardware. Large ... This page provides links to information about the various Linux on NUMA projects. Discussions ... lse.sourceforge.net/numa/ - 7k - Cached - Similar pages Linux: NUMA Awareness Added To Scheduler ... Linux: NUMA Awareness Added To Scheduler. Posted by jeremy on Wednesday, January 22, 2003 - 05:22. After several earlier attempts ... kerneltrap.org/node/view/562 - 33k - Cached - Similar pages The second one seems to show that the Linux Scheduler is aware of NUMA (probably only the bleeding-edge versions). I am just running vanilla (32bit) Linux. Suse 9.0 64 bit is out now. SuSE Linux 9.0 Professional 64 Bit Edition http://www.amazon.co.uk/exec/obidos/ASIN/B0000UI2WS/ http://shop.mensys.nl/catalogue/mns_SuSELinux.html I am waiting for an Opteron processor (and memory) to test it (all the other hardware is already in the shop). Maybe today or tomorrow. Also, does anyone have "stream" benchmarks for this machine? http://www.cs.virginia.edu/stream/#PeeCeeResults I think there are no Opteron results there (or I didn't find them). But the code is easy to compile and run, so you can do it yourself. Some days ago I found this: http://wwwseminars.web.cern.ch/wwwseminars/2003/2003-OtherFormats/t-20030903.ppt In slide 13 there is: 1x Stream: 2x Stream: 4x Stream: 2x Opteron, 1.8 GHz, HyperTransport: 1006 1671 MB/s 975 1178 MB/s 924 1133 MB/s 2x Xeon, 2.4 GHz, 400 MHz FSB: 1202 1404 MB/s 561 785 MB/s 365 753 MB/s I found these numbers a bit suspect, because I don't know if the benchmark works well when 2 copies or more are run at the same time. I'm curious what improvement results from the onchip memory controller has, For comparision, results from a P4 2.8 GHz, with an Intel 875 chipset. Function Rate (MB/s) RMS time Min time Max time Copy: 2666.6667 0.0802 0.0600 0.0900 Scale: 2666.6667 0.0792 0.0600 0.1000 Add: 3000.0000 0.1011 0.0800 0.1200 Triad: 3000.0000 0.1022 0.0800 0.1300 and if applicable, what degradation results from the hypertransport. If you have the time, you could run it with both banks filled and only one. But as STREAM reports the best time, it might not show anything interesting. -- http://www.mat.uc.pt/~rps/ .pt is Portugal| `Whom the gods love die young'-Menander (342-292 BC) Europe | Villeneuve 50-82, Toivonen 56-86, Senna 60-94

Thanks for the detailed information!


MyLounge.com Site Map
Forum: Cars, Cell Phone, Database, Games, Home Improvement, IT, Music, School, Sports, Web Design, Web Server, Weight Loss

The MyLounge.com forum is intended for informational use only and should not be relied upon and is not a substitute for any advice. The information contained on MyLounge.com are opinions and suggestions of members and is not a representation of the opinions of MyLounge.com. MyLounge.com does not warrant or vouch for the accuracy, completeness or usefulness of any postings or the qualifications of any person responding. Please consult a expert or seek the services of an attorney in your area for more accuracy on your specific situation. Please note that our forums also serve as mirrors to Usenet newsgroups. Many posts you see on our forums are made by newsgroup users who may not be members of MyLounge.com Term of Service