Power Systems And The Spectre And Meltdown Threats
January 10, 2018 Timothy Prickett Morgan
Speculative execution is something that has been part of modern processors for well over a decade, and while it is hard to quantify how much of a performance benefit this collection of techniques have delivered, it is obviously significant enough that all CPUs, including IBM Power and System z chips, have them. And that, as the new Spectre and Meltdown security holes that were announced by Google on January 3 show, turns out to be a big problem.
Without getting too deep into the technical details, there are many different ways to implement speculative execution, which is used to keep the many instruction pipelines and layers of cache in a processor busy doing what is hoped will be useful work. So much of what a computer does is an IF-THEN-ELSE kind of branch, and being able to pre-calculate the answers to multiple possible branches in an instruction stream is more efficient than following each path independently and calculating the answers in series. The speculative part of the execution involves using statistics to analyze patterns in data and instructions underneath an application and guessing which branches and data will be needed. If you guess right a lot of the time, then the CPU does a lot more work than it might otherwise. There are no modern processors (except for the PowerPC A2 chips used in the BlueGene/Q supercomputers from IBM) that we can find that don’t have speculative execution in some form or another, and there is no easy way to quantify how much of a performance boost it gives.
It is a pity then that the Spectre and Meltdown security vulnerabilities, which allows for user-level applications to see data they are not authorized to see in the privileged kernel memory space of operating systems, go right to the heart of modern processors. The fixes to these issues, which Google has documented here and which the search engine giant and the rest of the CPU and operating system industry has been working to try to solve since last June without any of us knowing about it, do not require turning off speculative execution. (We are pretty sure no one can do this, which is why these vulnerabilities are so insidious.) But the fixes do place some overhead on systems as user-level memory addresses are blocked off from kernel-level memory to keep the one from seeing the other. We expect, in the fullness of time, that CPU makers will add hardware to perform these functions and that the performance impact will be negligible, but this will require time, money, and some head-scratching to accomplish.
Every CPU in servers seems to be potentially affected by these speculative execution exploits. Here is what the exploits are called and the security notices related to them:
- Variant 1, CVE-2017-5753: Bounds check bypass. This vulnerability affects specific sequences within compiled applications, which must be addressed on a per-binary basis.
- Variant 2, CVE-2017-5715: Branch target injection. This variant may either be fixed by a CPU microcode update from the CPU vendor, or by applying a software mitigation technique called Retpoline to binaries where concern about information leakage is present. This mitigation may be applied to the operating system kernel, system programs and libraries, and individual software programs, as needed.
- Variant 3, CVE-2017-5754: Rogue data cache load. This may require patching the system’s operating system. For Linux there is a patchset called KPTI (Kernel Page Table Isolation) that helps mitigate Variant 3. Other operating systems may implement similar protections – check with your vendor for specifics.
Variant 1 and Variant 2 are collectively called Spectre, and Variant 3 is known as Meltdown. Meltdown seems to largely affect Intel Xeon and Core processors and their predecessors back to 2009 or so, when the “Nehalem” architecture cores came out and first used speculative execution and a new cache structure that previous chips did not have. It looks like Spectre vulnerabilities can affect different processors to varying degrees.
The fixes to the security issues thus far involve a combination of patches to operating system kernels and in some cases firmware running on the system. This is the case, for instance, with IBM’s Power Systems iron.
Back on January 3, when the bits hit the fan, everyone else was putting out statements about these speculative execution vulnerabilities, IBM said that it would have firmware patches for Power Systems iron using Power7+, Power8 (and that implies the Power8+ with NVLink), and Power9 processors on January 9, and that it would also make patches available for Linux on Power at the same time. If you look at IBM’s Product Security Incident Response Team (PSIRT) statement, which was revised that day, it now says that Power7+ and Power8 fixes are available now through its FixCentral firmware and software updating service, and that patches for Power9 system firmware will be available on January 15. IBM is now saying that Linux shops have to go through Red Hat, Canonical, or SUSE Linux to get kernel patches. The IBM i and AIX operating systems will get their fixes on February 12. IBM has yet to say what happens to Power6, Power6+, and Power7 iron that, as far as we know, include some speculative execution features. It is interesting to see that IBM’s statement does not say if its patches will cover all three variants of the vulnerabilities and what the exposure to each is for various generations of processors. (Other chip vendors did offer such insight, particularly Intel and AMD.)
IBM i shops don’t just run Power Systems iron and IBM i, of course. They are largely also Windows Server shops with X86 servers, and they often have Linux running on Power and some AIX on Power, too. Once in a while, they even have Linux on X86 iron. The Linux kernel has its patches, something that Google and the Linux community have been working on since last summer. Many of the most popular Windows Server releases have also been patched against these vulnerabilities, which you can see in Microsoft’s statement; you can get patches for Windows Server 2008 R2, Windows Server 2012 R2, and Windows Server 2016 but you cannot get them for the base (R1) releases of Windows Server 2008 or Windows Server 2012. The Citrix XenServer, VMware ESXi, and Red Hat KVM hypervisors have also been patched. We presume that as part of its patches, IBM is also patching the PowerVM and OpenKVM hypervisors used on Power Systems iron.
The thing everyone wants to know is what kind of performance impact the fixes for Spectre and Meltdown will have. We don’t know, because it depends on the nature of the CPU architecture, the way the memories are isolated and checked to keep users out of kernel space, and the way the applications make use of speculative execution. Google has said that on its own internal workloads, the impact of the fixes has been negligible, but then again, it controls its own Linux distribution and wrote the fixes using some of the smartest software engineers on the planet – who also discovered the vulnerabilities. We shall see.
Interestingly, Red Hat ran a series of benchmark tests on various workloads using the fixes for its Enterprise Linux 7 distribution running on the past three generations of Intel Xeon processors. On heavily virtualized environments and those involving online transaction processing, the overhead of the fixes was between 8 percent and 19 percent, which is pretty significant. (Both Red Hat and Google caution that these “microbenchmarks,” which stress only parts of a system, can show greater performance degradation than real world applications, so be careful making assumptions.) Java virtual machines and database analytics and decision support systems have a more moderate performance hit after the fixes – on the order of 3 percent to 7 percent – because they often aggregate requests between the kernel and user spaces. More generic kinds of raw calculations had a small impact of 2 percent to 5 percent, and of course any function that bypasses the kernel is going to be unaffected.
In other words, you are going to have to apply the fixes and see. Red Hat did have one warning, which may apply to all operating systems but was certainly the case for its Linux. Any system that is CPU bound or memory bound is going to thrash after the fixes are applied. Our advice is to test the throughput of your system for some time before applying the patches, then apply the patches and run the tests again. Then, you will know what the performance impact will be for sure. This is data you may need later. For instance, it will come in handy when you are arguing for discounts on Power9 servers. If somewhere around 10 percent or 20 percent of the capacity is going to go up the chimney because of memory space management for the kernel and user spaces, that’s not your problem. That’s IBM’s problem. At the very least, Big Blue can split the difference with you.
Good job Tim!