IBM’s Possible Designs For Power10 Systems
August 31, 2020 Timothy Prickett Morgan
In the past two weeks, we have been telling you about the future Power10 processor that will eventually be able to support the IBM i platform as well as AIX, Big Blue’s flavor of Unix, and Linux, the open source operating system that is commercially exemplified by IBM’s Red Hat Enterprise Linux distribution. The leap in performance with Power10 is akin to those we saw between the generations spanning from Power6 through Power9.
This week, we want to contemplate the systems that will be using the Power10 chip and how they will be similar to and different from past and current Power Systems machines. Then we are going to take deeper dive into performance, clustering systems through their memories rather than their I/O – perhaps the most exciting new thing in the Power architecture – and then also do a side foray into machine learning inference performance, which is going to be important for future commercial application workloads.
So let’s start at the top by looking at how the Power10 sockets will be lashed together into shared memory systems. Here are the two different sockets and two different NUMA scaling techniques you need to think about:
As we explained in last week’s issue, the Power10 chip has 16 fat SMT8 cores or 32 skinny SMT4 cores, with the mode are set by IBM at packaging so it cannot be altered by users or even IBM itself. With the shrink from the 14 nanometer process from GlobalFoundries used to etch Power9 down to the 7 nanometer processes from Samsung, IBM probably could not have doubled up the SMT4 and SMT8 cores with variants of the Power10 chip on a monolithic die. But it could probably have gotten pretty close. Somewhere just north of 800 square millimeters of die space, the mask for the die is outside the reticle limit on the lithography machine, and you just can’t make a bigger chip area. But if IBM can cram 16 cores on 600 square millimeters, it could have gotten 20 cores in 800 square millimeters if it did not add any more L3 cache or Serdes for PowerAXON interconnect and OMNI memory SerDes circuits. That would have dialed back the cache, memory bandwidth, and I/O bandwidth per core a little bit, and it might not be in as good of a balance. And moreover, the yields would have been poor anywhere close to the full core count and thus IBM would be right back selling processors with fewer cores anyway.
It is better to make a smaller chip that has inherently better yields in the first place, and then add multiple chips running at a slower speed into a single socket. IBM knows this better than any other server vendor, having used dual-chip modules (DCMs) in the Power5+, Power6+, Power7+, and Power8 generations – and sometimes not telling customers, even, as it did not with Power6+ or Power8. (But we figured it out, just the same.)
With the process shrink, you can radically cut the thermals for any given amount of circuit functionality. In this particular case, that shrink from 14 nanometers to 7 nanometers plus some architectural changes such as improved clock gating and better branch prediction and a simply more efficiently running core with the “Cirrus” chip, the raw wattage used in the chip was dropped by 50 percent for the core. So there is headroom to drop the clocks a little bit and drop the heat dissipation even faster (they have an exponential relationship), and that is just a funny way of saying that you can crank the clocks down a little bit and preserve a fair amount of performance – and then double up the sockets. And this is precisely what IBM did with Power10.
To be precise, as you can see from the chart above, the single-chip module (SCM) implementation of Power10, which has 15 active SMT8 cores or 30 active SMT4 cores, runs at somewhere around 4 GHz at its design frequency. IBNM offered Power9 chips that ran at 190 watts (which had 16 cores or 20 cores running at 2.25 GHz) and 250 watts (which had a 20-core geared down to 2.8 GHz), and we think the 22-core parts running at a baselined of 3.2 GHz that could turbo up to 3.9 GHz were probably in the range of 350 watts. That’s a guess on that latter number. And on the very high-end NUMA machines, we would not be surprised if IBM pushed it all the way up to 400 watts per socket. But here is the trick, as Brian Thompto, the Power10 core architect, explained it to us. By dropping the Power10 chip frequency down to only a target 3.5 GHz – only a 12.5 percent decrease – IBM is able to put two Power chips in a DCM, doubling up the cores and I/O and still not bust through the high end of its per-socket thermal limits.
And that, as you can see from the chart above, is precisely what IBM is going to do.
On the top right of the chart, you see the same server architecture that was code-named “Fleetwood” and “Mack” by IBM, with the former being the server node and the latter being the node controller to link up to four four-socket machines together into a single system image we know as the Power E980, launched in August 2018. That NUMA interconnect architecture with the Power10 will be essentially the same at a maximum of 16 sockets, and this machine will likewise use the 15-core SCM variants of the Power10 die to yield a maximum of 240 cores and 64 TB of main memory across a single system image. The prior Power E980 had a dozen Power9 SMT8 cores per socket, for a total of 192 cores. Clock for clock, the Power E1080 – if the big NUMA box is called that when it is announced a little more than a year from now – should deliver around 1.63X the performance of the Power E980.
Now, if you squint your eyes at the system configuration you can see at the lower right, that is a funky variant of the Power E950 server launched two years ago, only instead of Power9 SMT8 processors with 12 cores per processor, the future Power E1050, as this machine might be called, has Power10 DCMs in it instead of Power9 SCMs. So this machine will have 30 SMT8 cores running at 3.5 GHz or so in each socket, which will be a little less than half as powerful as the Power E1080, but in half the space and without the need for the HMC. (Topologically, the Power E1050 will be like half of a Power E1080.)
What cannot happen, of course, is for customers to lash together four Power E1050s to create an even more powerful Power E1080-Plus machine because there are not enough NUMA interconnects without adding more hops top the system. (But, where scale matters, IBM could do this, of course, but performance tuning would be problematic.) The point is, you can have a machine that is spread out like the E1080 with 16 sockets or a machine that is half the size and puts eight Power10 processors into four sockets and offers a little less than half the performance, depending on the clock speeds and core counts IBM actually delivers in the Power E1050. With 120 cores and 960 threads and 32 TB of maximum memory capacity (theoretically anyway), this Power E1050 is going to be a real screamer. That Power E1050 will have 120 cores, or 2.5X that of the top-end Power E950 based on Power9 SMT8 engines. Normalized for the 4 GHz design clock speed and reduced down to 3.5 GHz target clock speed, the Power E1050 will have about 2.85X the performance of the Power E950 and have 2X the memory capacity and perhaps as much as 4X the memory bandwidth if it can use 3.2 GHz DDR4 memory compared to the 1.6 GHz DDR4 memory of the Power E950.
Many shops that might have needed a Power E870, Power E880, or Power E980 in the past are going to be able to do just fine with a Power E1050 – provided IBM allows enough I/O to hook into it.
We have no indication of what IBM is thinking about the entry servers that dominate the IBM i market. It is a safe assumption that Big Blue will have machines with one or two sockets, as it has had for years, and will also have geared down processors – perhaps even with a single core variant – aimed specifically at IBM i customers.
While this IBM i entry systems strategy is interesting, it is getting old. The trick is not to keep giving such customers a single core that doubles in performance every three to four years, but a need to buy more cores and consolidate many of their X86 workloads onto Power cores – and make them all cheap enough that the Power Systems faithful unplug those Intel Xeon machines and don’t even think about those AMD Epyc machines. IBM owns the QuickTransit emulator that helped Apple move off PowerPC and onto Intel Core processors for its desktops and laptops, which it took control of after Apple had licensed the technology because it saw the existential threat to all architectures that QuickTransit posed. It is high time that IBM deployed this technology to move Windows Server applications to Power and deployed Red Hat Enterprise Linux to consolidate Linux workloads onto Power. I would go so far to say that it may be high time as well to fully emulate System z onto Power using QuickTransit, too, and eliminate that cost. This is hard for me to say because I have great respect for the System/360 and its progeny. But if we entering a world where IBM really can only afford one processor during the System z17 and Power11 generations in 2025 or so, I know which way I would consolidate.
Up next, we will talk about the expected performance of these Power10 systems, as best as we can reckon from what IBM has said so far.
RELATED STORIES
Power Systems Slump Is Not As Bad As It Looks
The Path Truly Opens To Alternate Power CPUs, But Is It Enough?
What Open Sourcing Power’s ISA Means For IBM i Shops
IBM’s Plan For Etching Power10 And Later Chips
The Road Ahead For Power Is Paved With Bandwidth
IBM Puts Future Power Chip Stakes In The Ground
“… We have no indication of what IBM is thinking about the entry servers that dominate the IBM i market. It is a safe assumption that Big Blue will have machines with one or two sockets, as it has had for years, and will also have geared down processors – perhaps even with a single core variant – aimed specifically at IBM i customers. …”
I think features are historically lacking in the IBM i OS and programming languages because of the decision to lock so many shops to a single core. The more features, the more code needed to support the feature, the more CPU needed to run the code.
A recent example. Had to troubleshoot a problem where the WMS shipped an order and left inventory allocation hanging. Used the DSPJRN command to find which program created the inventory allocation record. Very helpful. Only problem was the journal reported the program name as a service program. Did not contain the name of procedure in the service program that inserted the record in the inventory table. That was a crucial piece of info that was missing.
Troubleshooting application problems on a production system is a big deal. Very important to be able to explain why inventory does not match. And to pinpoint which department or partner caused the problem. The more info that IBM i can store as transactions are run thru the system, the better. If IBM i shops routinely had multiple cores instead of 1 maybe the OS could store more info. Like the full program call stack and snapshot of program variables involved when an update is applied to the database.