How Much Does NVM-Express Flash Really Boost IBM i Performance?
November 9, 2020 Timothy Prickett Morgan
With the NVM-Express protocol, a storage overlay for the PCI-Express peripheral bus that allows flash to be addressed in a parallel fashion as flash in its own right and not as emulated disk storage using a SATA or SCSI protocol, the idea is to get those vintage storage drivers out of the way and let the operating system kernel speak directly to the flash. This is done so the impressive – and seemingly always growing – I/O bandwidth of flash can actually be brought to bear to speed up applications.
Flash in general, and NVM-Express flash in particular, has been a boon to many applications, particularly those that involve databases and data analytics that involve chewing on data stores that do not fit into main memory and that would normally have their data dumped onto arrays of disk drives. Modern disk drives may have lots of capacity – we are pushing up to 16 TB and 20 TB of capacity on recent disks – but they only deliver on the order of 60 to 200 I/O operations per second, or IOPS, depending on if they are reading or writing and depending if they have their own SRAM or flash cache. If you need a lot of I/O, you therefore need a lot of disk drives.
We would love to know the precise performance impact of shifting from disk to flash to NVM-Express flash for batch and transaction processing workloads, and others, running on Power Systems machines equipped with the IBM i operating system. We don’t know how sensitive the Db2 for i database is to IOPS, but we suspect for certain kinds of work it has been tweaked to do lots of caching in main memory and processor L3 cache simply because so many IBM i machines have fairly modest hardware with a half dozen or a dozen or two dozen disk drives. That’s not a lot of IOPS, as the chart below, which we have found in a presentation from earlier this year from IBM to business partners, makes clear:
This is a very handy chart just for some performance specs and relative price/performance of various media, including DDR4 DRAM and 3DXPoint main memory in DIMM form factors, and 3D XPoint and flash in various NVM-Express, SATA, and SAS form factors and different flash types. Now, thus far, Intel has kept 3D XPoint memory to itself at this point with its Optane branded memory, and IBM does not yet have PCM main memory sticks or SSDs available from partners for Power Systems that provides persistence – something that main memory itself lacks, of course. It is called Dynamic Random Access Memory for a reason, because if you turn off the power, the data is gone. Not so with flash or 3D XPoint, which are both persistent memories. Flash cannot be used in DRAM DIMM form factors because it is too slow for the system to accept it in this role – 3D XPoint is roughly halfway between DRAM and flash in terms of performance, and 3D XPoint is roughly 10X as expensive as flash and DRAM is roughly 2X again as expensive as 3D XPoint. And for anyone who has bought a server in the past two decades, you know why no one has fully configured a machine with big fat memory sticks to accelerate performance – it does not take long before getting the memory capacity (by using fat memory stocks) and the memory bandwidth (by fully populating the memory slots in a system) can cost a lot more than the processors in a Power Systems machine. And these processors themselves are not cheap by any stretch of the imagination.
NVM-Express is a way to help boost the performance of the system without having to get all the way to an all-memory system. Which is not even close to economically feasible.
It can be a pain in the neck to keep track of which features are used for what machines, and we did some digging around to find the original NVM-Express flash devices, which were based on the PCI-Express 3.0 protocol, that IBM announced with the Power9 systems two years ago and to characterize the new PCI-Express 4.0 devices that were rolled out this past summer in conjunction with the updated Power9 “G” series scale out machines that have more PCI-Express 4.0 slots. So this chart will help:
IBM used to sell flash for Power Systems machines in three different physical form factors: The M.2 gum stick form factor, often used for base operating system installs; the U.2 form factor that is a standard SSD size and resembles a 2.5-inch disk drive package; and the add-in-card (AIC) which plugs directly into a physical PCI-Express slot in a machine. IBM killed off the M.2 devices originally sold in Power9 machines back in March. Here are the feeds and speeds of these original PCI-Express 3.0 U.2 flash devices, which were not supported by IBM i but rather only AIX and Linux:
Now, there was IBM i support for the NVM-Express add-in card flash devices in Power Systems based on the PCI-Express 3.0 interconnect, as shown here:
There are some caveats, according to Big Blue, to using these NVM-Express flash cards with IBM i, as outlined below:
- IBM i supports virtualized NVM-Express (only for the PCI-Express Add In Card) via VIOS and requires the use of the VIOS LVM (Logical Volume Manager). This supports IBM i 7.2, 7.3 or 7.4. Since this has more “layers” between IBM i and the storage, it will not perform the same as native NVM-Express, thus would not be the recommended option if the best/most performance is required.
- PCI-Express card NVM-Express service/repair is similar to other PCIe card slot concurrent maintenance, but with extra steps such as described in this document.
- Encryption is not supported today. IBM i development is aware of the desire for hardware encryption support.
- IBM i treats NVM-Express as the same tier as SSDs, so currently there isn’t a way to tier (say using the Trace ASP Balance (TRCASPBAL) command) between them.
- NVM-Express devices are now supported as of the April 14 announcement as direct attached devices for IBM Db2 Mirror for i
For the completeness of our feeds and speeds wrap up, here are the details on the NVM-Express add-in card flash devices in Power Systems based on the faster PCI-Express 4.0 interconnect, which has twice as much bandwidth per PCI-Express I/O lane:
In general, these newer devices have lower latency on reads and writes and higher IOPS thanks to the improvements jumping from PCI-Express 3.0 to PCI-Express 4.0 and to improvements in the devices themselves. But raw feeds and speeds do not necessarily translate into higher batch and OLTP database performance.
In our digging around, we found a very interesting benchmark test that Big Blue has run on a Power S924 system pitting SAS drives with either RAID 5 or RAID 10 data protection against a similar capacity of mirrored NVM-Express flash capacity. IBM used its Batch Workload (BatchWL) generator to tickle the Db2 for i database using a mix of SQL or native I/O queries. The database used keyed files, with small databases using 256 B records and large database using 32 KB records, and the files were all over 3 GB if size to reduce caching effects and to really stress the storage systems. Each run thread does an I/O operation on its unique file, and files are unique to each run thread to eliminate locking overhead. Sequential I/O operations in the BatchWL test access each record by incremental key, and random I/O operations access each record by a random key. The performance data is gathered per thread and are aggregated to in the performance report. The BatchWL benchmark had six different tests, with four of them being “corner cases” with small random operations or large sequential operations with 100 percent reads or 100 percent writes or small length random updates (with a 50 percent read/write ratio) and large sequential updates.
Here are the two different Power S924 configurations tested:
Now here is the data we have always wanted to see: Actual benchmarks showing IBM i database benchmarks contrasting SAS flash drives with NVM-Express. We would love to see additional data for machines with physical disk drives here, but we suspect the performance would be quite poor indeed by comparison. Anyway, here are how the Power S924s configured with the two types of flash stack up:
It is important to remember that there is intense data protection on these two different configurations, with either RAID 5 or RAID 10 data protection for the SAS flash drives and mirroring through IBM i with a pair of NVM-Express flash cards.
It is also important to see that with the exception of small random writes (which clog up the CPU because the mirroring is running on the processors, not on a SAS or SCSI card), the NVM-Express mirror offers significantly higher reads and writes and significantly lower latency. The latency gains are not as high as we expected, but the NVM-Express flash drives are running through a thick driver stack based on the Virtual I/O Server (VIOS).
One funny thing we noticed. In the examples where a big PCI-Express flash card is used as primary storage, it is actually carved up into namespace regions – essentially a virtual disk drive – because if you don’t do that with IBM i (and for all we know other operating systems, too) you get “suboptimal performance” as IBM put it. It’s just strange that it takes virtual disk arms to boost performance. Plus ça change, plus c’est la même chose.
Now, let’s talk about money for a second. Let’s start with raw capacity and then move on to the configurations used in the benchmarks.
Say you want around 1.6 TB of capacity for IBM i. In the examples we have seen from Big Blue, with six 283 GB 15K RPM disk drives (we are talking like maybe 1,200 IOPS, which is nothing in the modern world) the disks would cost $3,750 at list price and the expanded function storage backplane, which holds a dozen drive bays, would cost $4,099 for a total of $7,849. If you wanted a pair of 1.6 TB flash cards (and used OS mirroring instead of RAID 5 data protection), it would cost less money at $5,198, a savings of $2,651, or 33.8 percent less and delivering just a crazy amount more of IOPS (more than 250,000 IOPS on mixed read/write workloads at a 70/30 ratio). Interestingly, the savings compared to flash SSDs are even bigger, even if the aggregate IOPS may not be substantially different (in theory). If you think of flash as a funky disk drive literally, and buy four 387 GB for $7,796 and then $4,099 for the expanded function backplane, you are in for $11,895 in total. The pair of mirrored flash cards is actually a whopping 56.3 percent less expensive, although we suspect the performance in the aggregate is not a lot different in terms of raw IOPS. But the latter is running the NVM-Express protocol and the bunch of flash SSDs look like virtual disks and we suspect that the actual, realized IOPS is much higher for the flash card than the array of SSDs.
Now, you can double up the capacity on the disk drives and flash SSDs to try to amortize the cost of that expanded function storage backplane a bit. Take a 3.2 TB example. With a dozen disk drives, you pay $7,500 for the disks and $4,099 for the backplane, for a total of $11,599. A pair of NVM-Express flash cards at 3.2 TB that are mirrored by the OS cost $9,198, the latter of which is 20.7 percent cheaper and delivers more than three orders of magnitude more IOPS. With eight 387 GB SAS flash SSDs, it costs $15,592 for the flash SSDs and $4,099 for the backplane, for a total of $19,691. The pair of mirrored NVM-Express flash cards at 3.2 TB cost 53.3 percent less.
It is hard to argue for the flash SSDs on the economics or disks on the price. You can have high performance and good price with the NVM-Express flash cards.
Now, let’s suss out the cost of the storage in the benchmark that IBM actually ran. Forget the cost of the system for the moment. The feature #EJ14 PCI-Express SAS RAID controller with 12 GB cache costs $9,169 a pop, and IBM put a pair of them in the box. Forget the cost of the disk and SSD enclosure (because I can’t easily get my hands on it) and add in the cost of 18 of the 387 GB flash SSDs at $1,949 a piece to give a pair of mirrored RAID 5 arrays and you are talking about spending $53,420 for the storage. For the pair of 3.2 TB NVM-Express flash cards, you are talking about spending $9,198. No enclosure needed, no controllers needed, the IBM i operating system just talks to it directly over the PCI-Express bus – and that is a cost reduction of 82.8 percent and a significant increase of the raw performance of the storage subsystem and, as the BatchWL test shows, the application performance.
This is a no-brainer for the user, but resellers who live on margin are going to be nervous about now having to make up their revenues in volume. But given this performance increase, at least they have something to sell. Drive up the utilization of those expensive cores and get work done quicker. That’s worth something.
This is just the beginning of our search for good information about NVM-Express performance on IBM i. Stay tuned, and if you have other data, please share.
RELATED STORIES
Tweaks To Power System Iron Complement TR Updates
IBM Revamps Entry Power Servers With Expanded I/O, Utility Pricing
IBM Doubles Up Memory And I/O On Power Iron To Bend The Downturn
The Skinny On NVM-Express Flash And IBM i
Power Systems Refreshes Flash Drives, Promises NVM-Express For IBM i
‘One funny thing we noticed.’ : Zoned Namespaces (ZNS) SSDs
https://www.golem.de/news/wd-ultrastar-dc-zn540-western-digitals-ssd-verhaelt-sich-wie-eine-hdd-2011-152003.html
https://blog.westerndigital.com/what-is-zoned-storage-initiative/
https://zonedstorage.io/introduction/zns/
https://www.youtube.com/watch?v=9yVWb3rbces (no longer need 1GB RAM/1TB flash!)
https://blocksandfiles.com/2020/11/09/western-digital-zn540-zoned-ssd/ “The host server manages data placement for the ZN540, a design that optimises performance and prolongs usable life by minimising writes.”
Once this gets PTF-ed into IBM i, things will get even better!
M.