SSD over-provisioning and random write performance

NAND flash memory cells degrade slightly every time they’re written to, eventually failing. And so, after getting a replacement for a failing drive, I decided to look up, if there’s something I could do to help it outlast its predecessor.

The replacement was an Intel 545s, so I was happy to find a recent Intel whitepaper among my search results. The paper said that improving a drive’s endurance can be achieved through ‘over-provisioning’ - leaving a part of a drive unpartitioned or unavailable to the host. But what’s more, the paper also said, over-provisioning would improve random write performance.

Curious, I decided to check, how much of a difference it would make in the case of my budget consumer device, compared to the NVMe datacenter part serving as an example in the document. I found a sysbench usage example for testing random write performance, adjusted the size of the files it creates to 64GiB, cause I didn’t want my benchmarking to affect my drive’s endurance too much and ran tests both when over-provisioned and not. The over-provisioned configuration performed 4% worse. My disappointment was immeasurable.

Where are my performance gains?

In real-world scenarios. Or, to put it more accurately, scenarios, where more of the drive is used in the eyes of the drive controller. Over-provisioning doesn’t let a drive perform writes faster, it lets it perform background tasks shuffling data around faster.

To understand, why a drive has to do these tasks, even when it’s under heavy load, one has to understand the peculiarities of NAND flash. NAND flash memory cells are organized in collections called ‘pages’, which in turn are organized in ‘blocks’. Writes happen on a page level, but - unlike on a magnetic platter - data cannot be written over existing data. The existing data needs to be erased first and that happens on a block level.

This means, that, to change the contents of a single page in a full block, the controller would end up writing to every page in that block to both put the new data in and restore pages that didn’t need to be altered. This is called ‘write amplification’ and, if the controller didn’t do anything to minimize it, your drive’s memory cells would fail extremely quickly. Therefore, the controller will instead mark the amendable page as invalid and write the new data to an empty page. Eventually, to free the space taken up by invalid pages, the valid ones need to be moved elsewhere and this juggling of data will happen quicker and with less wear to memory, when there is plenty of empty blocks, which is why you also see recommendations to never run an SSD filled to near-capacity.

But why leave space unpartitioned?

When a host deletes a file, it will just delete whatever information it uses to keep track of that file. It will not inform the drive that the blocks, which the file took up, are now free. Not until you call the fstrim command. So, while a host has access to a block, it’s liable to not be seen as empty by the drive controller and so not usable for data shuffling. This is how unpartitioned, unaddressable area is better.

Understanding all that, it’s clear, why I saw no improvement during my benchmarks. My drive controller had no shortage of empty blocks and so gained nothing from over-provisioning.

How to demonstrate the advantage of over-provisioning?

# Sysbench measures files sizes in GiB, we're using 150GB out of our 256
sysbench --test=fileio --file-total-size=140G prepare
#copying some of the test files to take up more of the flash blocks
for i in test_file[1-8]? ; do cp $i copyof$i ; done
#then removing the files leaving plenty of space in the filesystem
rm copyof*
#Flushing write cache
cd ..
umount /mnt
mount /dev/sdb1 /mnt
#Running multiple shorter benchmarks over one long one to demonstrate performance over time
cd /mnt
for i in `seq 1 36`; do sysbench --test=fileio --file-total-size=140G --file-test-mode=rndwr --max-requests=0 run ; done | grep written

#shrinking the partition
fstrim /mnt #This takes a long time
umount /mnt
fsck -f /dev/sdb1
resize2fs /dev/sdb1 186G #200GB, resize2fs works with GiB
parted -s /dev/sdb resizepart 1 200GiB
#filling the blocks again.
mount /dev/sdb1 /mnt
cd /mnt
for i in test_file[1-4]? ; do cp $i copyof$i
#Flushing cache
cd ..
umount /mnt
mount /dev/sdb1 /mnt
#Removing the extra files, leaving the same 150GB used in the filesystem
cd /mnt
rm copyof*
#And running the benchmark over-provisioned
for i in `seq 1 36`; do sysbench --test=fileio --file-total-size=140G --file-test-mode=rndwr --max-requests=0 run ; done | grep written

Sysbench’s output is a bit verbose. Here are some graphs I derived from it.

write speed data transfered

And there you go. During both tests, according to the filesystem there was plenty of free space on the partition - 50 and 100GB respectively. According to the controller there were ~5GB of blocks free on the partition. And, like clockwork, the non-over-provisioned configuration started choking shortly after passing 5GB written, while the over-provisioned configuration did not, as it still had the over-provisioned area to work with to mitigate write amplification and manage wear-leveling.

It’s an exaggerated example. I don’t expect to see that many untrimmed blocks on my drive, considering Ubuntu will run fstrim for me weekly, but it shows that over-provisioning can lead to improved performance. Anyway, thanks for reading and, if there’s anything you’d like to add to my understanding of the subject, address it to ‘comments’ at this domain.