PDA

View Full Version : On the Feasibility of the Broadband Engine


PS3
04-16-2004, 05:10 PM
Broadband Engine is the reported Cell-based CPU of PlayStation 3

http://www.beyond3d.com/forum/viewtopic.php?t=11600&postdays=0&postorder=asc
&start=0

QUOTE:
_____________________________________________________________________
With all the discussion on the feasibility of the Broadband Engine as seen
in the well known Suzuoki/SCE and IBM/STI patents and in a feew threads in
which some (you can guess who) used BlueGene L as a base, I had a feeling
something didn't add up. So, I did some looking around and have an
alternative concept of the situation which I'm not saying is totally
accurate, but I propose that we have a central, civil, discussion on this.

I'm going to present some numbers, show where they're from, and then open it
up for people to draw their own conclusions and comments. We realize that
these numbers are ballpark, but instead of shunning the variance, lets use
it as a pseudo-constant, a fudge factor if you will, that will allow us to
have a rough estimate of what can be done, what can't and where it all
stands. So, without further ado, here's the basic premise.

I propose a strict interpretation of the above patents, use of recent
comments which are certain, and past precedence as a guide when in doubt.
Let's try to keep this as plausible as possible. We'll add up the knowns and
leave the overhead and fudgefactors lumped together until the end. Then,
depending on the magnitude, we'll see what can be done.


James Kahle, DesignChain.com wrote:
Kahle also had to remain aware of the eventual manufacturability of
the chip at this point, but elected to put the main burden of this part of
the effort on the implementation phase that would follow.



So, we'll start by assuming the BE needs to be a commercially replicable
part and as such will have definitive upper bounds on it's area. I figure
that area is a much better metric than, say, density, due to it being
invariant - I feel we'll all agree. I then propose that the 250nm Graphic
Synthesizer and it's 279mm2 physical size will provide a good reference
point. We will also state that it'll be fabricated on the 65nm process,
which is supportable by facts like this, or that I know Sony has bought
rights to all Cell based 65nm chips produced for an undisclosed period,
blah. we all can agree on this.

We'll start with the PE core, which it shall be assumed to be the PowerPC
440 that Sony just licensed from IBM as seen here. Some background
information on this core can be seen above or for more information there
will be some links below.



PowerPC 440 Core Features wrote:


a.. 0-667MHz performance

b.. 9.8 mm2 hard core size

c.. 7-stage pipeline, out-of-order
issue, out-of-order execution and
completion

d.. Dynamic branch prediction

e.. Parity error detection and recovery

f.. Static design with extensive clock
and power management support

g.. 32x32 multiply, with single-cycle
throughput


Memory Management Unit

a.. 64-entry, fully associative unified
TLB

b.. Separate I- and D-side micro-TLBs

c.. Flexible TLB management

d.. Variable page sizes (1KB-256MB)

e.. 32K/32K instruction cache/data
cache controllers with parity

f.. Write-back, write-through,
non-blocking operation

g.. Cache line locking (I and D)




The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process.
Given our previous assumption that the IC will be produced on the 65nm node
and encounter linear scaling, we're left with a PowerPC 440 core that's
4.9mm2 in size. As per the other assumptions above, we assume there will be
four of them on a single die, which brings us to 19.6mm2 in utilized area
(or down to 259mm2 in free area as per our assumption of precedence).

* In the original 9.8mm2 figure, there is a good chance that L1 Caches were
not included. But, we are only talking about 64 KB per PowerPC 440 core
though, so this might reflect in increase in the 19.6mm2 number to something
like 20.8583mm2 (64 KB * 4 cores and a cell size for SRAM of 0.6um2 at 65nm
would mean 1.2583mm2 ) -- which is statistically irrelevant with the
variance and accumulated error which is in this measurement.

The PowerPC 440 core needs an FPU, so we'll utilize this one. Here's the
background information:



PowerPC 440 FPU Core Specifications wrote:


Performance:

a.. 1050 megaflops @ 525 MHz (Nominal)

b.. 734 megaflops @ 367 MHz (Worst case)

Frequency:

a.. 0-525 MHz nominal silicon, 1.8V, 55°C

b.. 0-367 MHz slow silicon, 1.65 V, 85°C



Architecture - 32-bit PowerPC Book E compliant, supports IEEE-754
floating-point Precision IEEE Single-precision and Double-precision
Superscalar 2 way Issue Out of order
Core size: 3.7 mm2
Technology: 0.18 µm (drawn) CMOS copper technology (CMOS 7SF)
Power Supply: 1.8 ± 0.15 V
Transistors: < 800K
Temperature range: -40 to 125 °C



This will work well for our FPUs. At 180nm, it's 3.7mm2 - which would yield
roughly 1.33mm2 at 65nm with the same assumptions. This brings us to 24.9mm2
in utilized area thus far and 254mm2 in possible area left.



So, next we need to factor in the APUs, of which there's a plurality (32, 8
per Power440) and which contain 4 FPUs, 4 FXUs, Registers, and all these
assorted things we'll get to in a bit. First let's start with the FPUs.

There are 32 FPUs, of which the above type should suffice as they'll yield
the necessary performance and are a good rough indicator of IBM's
microarchitecture. So, 32 FPUs will yield 42.75mm2 in necessary area -
bringing the grand total upto 67.65mm2 utilized, with 220.35mm2 of the area
left.

There are also 32 FXUs, of which I couldn't find a definitive area
requirement. So, if someone knows and can help out, that would be awesome. I
assumed just for this that each FXU is around 150% the size of a FPU and
rounded to 2mm2. So, that'll yield 64mm2 total, bringing the area count used
up to 131.65mm2, and the amount of potential area down to 147.35mm2

Within each APU (which we assume is reminiscent of the following Patent on
Unified Scalar and SIMD datapaths, we can assume that each APU will require
control logic as well as Instruction Fetch, Decode, an Issue/Branch unit
where It's queried and then overhead for this. WePropose a conservative
fudgefactor of a full 2mm2 per APU for this which comes to 64mm2 and brings
us to a total used area of 195.65mm2 and roughly 84mm2 worth of potential
area remaining.

Just something to keep in mind when looking at the remaining area of 84mm2,
the entire area of the AMD Thoroughbred's, A & B, are 80mm2 and 84mm2
respectively on a 130nm process. You can fit an two, perhaps three of them
normalized to 65nm in the area remaining according to this.


We then have the SRAM based storage to account for: these being the
Registers and LS. The SRAM will likely be derived from this. A brief
summation:


Embedded SRAM cell wrote:
SRAM is sometimes used as cache memory in SoC systems. The Hi-NA193-nm
lithography with alternating phase shift mask and the slimming process
combined with the non-slimming trim mask process will achieve the world's
smallest embedded SRAM cell in the 65nm generation an areas of only 0.6um2



Knowing this, Pana first found the area allocated to the APU registers. Each
Register per APU is 128*128 bits in size, giving us an aggregate size of
512KB of SRAM.

We then used the patent as a guide and found the area taken up by the
aggregate 32Mbits of SRAM. Combined, we have 32MBits (LS) and 512KBits
(Registers) of SRAM which comes to 20.448 mm2. This brings our total area
used up to around 216.098mm2 and the potential area remaining to be tapped
at 63mm2.

Abstracted above the APUs and SRAM in the memory hierarchy is a layer of
eDRAM. The patent states 64MB, several press releases state 32MB (256MBit)
or more. I suggest that we start at 32MB and move up from there depending on
the size of the fudgefactor at the end. So, again, Pana used the released
specs of Sony and Toshiba's 65nm process for embedded RAM as found above.
Here's the relevant part:


Embedded DRAM cell wrote:
High-speed data processing requires a single-chip solution integrating
a microprocessor and embedded large volume memory. Toshiba is the only
semiconductor vendor able to offer commercial trench-capacitor DRAM
technology for 90nm-generation DRAM-embedded System LSI. Toshiba and Sony
have utilized 65nm process to technology to fabricate an embedded DRAM with
a cell size of 0.11um2, the world's smallest, which will allow DRAM with a
capacity of more than 256Mbit to be integrated on a single chip.



So, using this as a guide, you'll take up at least 29.528 mm2 of die area
(again, don't freak out with these numbers - there will be a fudgefactor to
eat up inefficiencies) which brings us to 245.626mm2 and leaves us with
34mm2 of potential area to utilize.


With those 34mm2, we need to fit 4 DMACs, TLBs, and then go from there to
work around the remaining control logic and wiring, inefficiencies which
weren't eaten up in the conservative 2mm2/APU factored in earlier as well as
places in which highest density hasn't been used (which we'd hope
[assume?!]) on a modular processor like this is to a minimum. Not sure of
the sizes of a DMAC, don't know where to start really, but there would
appear to be ample room, at least theoretically as 34mm2 on 65nm is a lot of
room indeed. For a general idea, that's roughly, the same amount of area as
an AMD Thoroughbred if it was built on the 65 nm process.



Before we get into inefficiencies, I do believe there is a strong case for
this to be a highly area efficient design. Between it being inherently
customized for a specific application, unlike, say the Power5, which
includes a high degree of modularity, it should be possible to make each
corpuscle highly optimized with a very attractive equilibrium between area,
power, and computational ability.


James Kahle, DesignChain.com wrote:
He directed some of his engineers to work from the bottom up, drafting
circuit designs for specific functional elements of the chip, while other
engineers were working from the top down, modeling the high-level
architecture and providing the definition of the chip instruction set.. As
with the high-level design, the implementation process is highly iterative,
with the engineers building simulations of how each section of the chip will
perform under real-life circumstances. "The trick is to achieve the right
stability of the chip's smaller pieces," explains Kahle, "and then to build
up bigger and bigger chunks, thereby improving the overall quality and
stability of the design.



We've heard that tools such as IBM's EinsTuner has been used since to the
beginning with theoretical efficiencies of 1/3rd in area savings and 15% in
circuit performance. But one would still expect inefficiencies. Personally,
I see variance in these numbers of ~10-25mm2 easily in each direction. But,
again, the purpose of this isn't to show exactly what it will or may or
might be, but could be. Just to add up a bunch of relatively known numbers
and see what we have remaining - which a bit indeed.

If anything, this shows that with the commercial production of ICs
approaching the order of 300mm2, be it the Graphic Synthesizer or the
recently announced Nv40, that the Broadband Engine is highly feasible from a
strictly area PoV. The numbers used here are all stand-alone and don't
reflect the effeciencies gained by designing a modular and set-piece
architecture, nor do they reflect the ineffeciencies or other area drains -
although, holistically, they should cancel eachother out for the most part,
perhaps leaving a slight bias towards the ineffeciency.

The bounds regarding this design would appear to rest with the power
attributes, be them the direct problem of wiring and propogation or the
thermal and power intake issues. Levied against this are the process
technologies like the Low-K dielectrics and PD-SOI's, which have worked
amazingly well with AMD's recent use of it with Opteron. Appearently, word
on the street, is that the BE implimentation is limited by the thermal
aspects, not size - which may be indicative of the speeds in which they seek
to attain. But, this is another thread, at another time.


_____________________________________________________________________


MyLounge.com Site Map
Forum: Cars, Cell Phone, Database, Games, Home Improvement, IT, Music, School, Sports, Web Design, Web Server, Weight Loss

The MyLounge.com forum is intended for informational use only and should not be relied upon and is not a substitute for any advice. The information contained on MyLounge.com are opinions and suggestions of members and is not a representation of the opinions of MyLounge.com. MyLounge.com does not warrant or vouch for the accuracy, completeness or usefulness of any postings or the qualifications of any person responding. Please consult a expert or seek the services of an attorney in your area for more accuracy on your specific situation. Please note that our forums also serve as mirrors to Usenet newsgroups. Many posts you see on our forums are made by newsgroup users who may not be members of MyLounge.com Term of Service