Watch this: 3 things to consider when buying storage devices
01:31
Editors' note: This article is frequently updated to reflect changes in technology and in the marketplace.
The computing world runs on information, and handling it is crucial. So it's important that you select the best storage device to not only hold your data, but also distribute it. In this guide, I'll explain the basics of storage and list the features that you should consider when shopping. If you're ready to head to the store right now, though, I've also listed my top picks.
Power users hoping to get the most out of a home storage system should consider a network-attached storage (NAS) server such as a four- or five-bay NAS server from Synology, QNAP, Asus, Netgear, Western Digital or Seagate. Alternatively, if you want your new computer to run at its top speed, a solid-state drive (SSD) such as the Samsung 850 Pro or the Toshiba OCZ VX500, or an M.2 drive (if your computer supports it) will make that happen. But if you have an older machine and budget is an issue, there are more affordable SSDs, like the Samsung SSD 850 Evo or the OCZ Trion.
Want more SSD options? Check out this list.
If you just want to boost your laptop's storage space or find a quick way to back up your data, an affordable portable drive such as the WD My Passport Ultra or the Seagate Backup Plus Ultra Slim will do the trick. Again, more excellent portable storage drives can be found on this list.
There are three main areas you should consider when picking a storage device: performance, capacity and data safety. I'll explain them briefly here. After you're finished, I encourage you to check out this article for an even deeper dive into the world of storage.
Storage performance refers to the speed at which data transfers within a device or from one device to another. Currently, the speed of a single consumer-grade internal drive is largely defined by the Serial ATA interface standard (SATA). This determines how fast internal drives connect to a host (such as a personal computer or a server) or to one another. There are three generations of SATA -- the latest and most popular, SATA 3, caps at 6 gigabits per second (about 770 megabytes per second). The earlier SATA 1 (largely obsolete) and SATA 2 standards cap data speeds at 1.5Gbps and 3Gbps, respectively.
So what do those data speeds mean in the real world?
Consider this: At top speed, a SATA 3 drive can transfer a CD's worth of data (about 700MB) in less than a second. The actual speed of a hard drive may be slower because of mechanical limitations and overheads, but that should give you an idea of what's possible. A hard drive's real-world speed tends to be around one-tenth of the SATA 3 standard. SSDs, on the other hand, offer speeds much closer to the SATA 3 ceiling. Most existing internal drives and host devices (such as computers) now support SATA 3, and are backward-compatible with previous revisions of SATA.
Since 2015, there's been a new standard called M.2, which is only available for SSDs. M.2 allows the storage device to connect to a computer via PCI express (the type of connection once used only to connect a video card to a motherboard) and is therefore much faster than SATA. Currently, only high-end desktop motherboards support M.2. These tend to come with two slots. Some ultracompact laptops also have an M.2 slot instead of SATA. Just about the size of a stick of system memory, an M.2 SSD is much more compact than a regular SSD. It's also much faster and can deliver the same amount of storage space. In the future, M.2 is expected to replace regular SATA drives completely.
Since internal drives are used in most other types of storage devices, including external drives and network storage, the SATA standard is the common denominator of storage performance. In other words, a single-volume storage device -- one that has only one internal drive inside -- can be as fast as 6Gbps. In multiple-volume setups, there are techniques that aggregate the speed of each individual drive into a faster combined data speed, but I'll discuss that in more detail in the RAID section below.
Capacity is the amount of data that a storage device can handle. Generally, we measure the total capacity of a drive or a storage system in gigabytes. On average, 1GB can hold about 500 iPhone photos or about 200 iTunes songs.
Currently, the highest-capacity 3.5-inch (desktop) internal hard drive can hold up to 10 terabytes (TB) or roughly 10,000GB. On laptops, the top hard drives as well as SSDs can offer up to 2TB.
While a single-volume storage device's capacity will max out at some point, there are techniques that make it possible to combine several drives to offer dozens of TBs and even more. I'll discuss that in more detail as well in the RAID section below.
The safety of your data depends on the durability of the drive on which it's stored. And for single drives, you have to consider both the drive's quality and how you'll use it.
Generally, hard drives are more susceptible to shocks, vibration, heat and moisture than SSDs. Durability isn't a big issue for a desktop since you won't be moving your computer very often (one hopes). For a laptop, however, I'd recommend an SSD or a hard drive that's designed to withstand falls and other sudden movement.
When it comes to portable drives, you can opt for a product that comes with layers of physical protection, such as the Glyph Blackbox Plus or the G-Tech G-Drive ev ATC. These drives are generally great for people working in rough environments.
But even when you've chosen the optimal drive for your needs, you mustn't forget to use backup, redundancy or both. Not even the best drive is designed to last forever -- and there's no guarantee against failure, loss or theft.
The easiest way to back up your drive is to regularly put copies of your data on multiple storage devices. Most external drives come with automatic backup or sync software for Windows. Mac users, on the other hand, can take advantage of Apple 's Time Machine feature. All external drives work with both Windows and Macs, as long as they're formatted in the right file system: NTFS for Windows or HFS+ for Macs. The reformatting takes just a few seconds. If you're on a budget or want to quickly find the best portable storage system, here's our list of top portable drives.
But be warned -- this process isn't foolproof yet. Besides taking time, backing up your drive can leave small windows in which data may be lost. That's why for professional and real-time data protection, you should consider redundancy.
The most common approach to data redundancy is RAID, which stands for "redundant array of independent disks." RAID requires that you use two internal drives or more, and depending on the setup, a RAID configuration can offer faster speeds, more storage space or both. Just note that standard RAIDs generally require drives of the same capacity. Here are the three most common RAID setups.
RAID 1: Also called mirroring, RAID 1 requires at least two internal drives. In this setup, data writes identically to both drives simultaneously, resulting in a mirrored set. What's more, a RAID 1 setup continues to operate safely even if only one drive is functioning (thus allowing you to replace a failed drive on the fly). The drawback of RAID 1 is that no matter how many drives you use, you get the capacity of only one. RAID 1 also suffers from slower writing speeds.
RAID 0: Like RAID 1, RAID 0 requires at least two internal drives. Unlike RAID 1, however, it combines the capacity of the drives into a single volume while delivering maximum bandwidth. The only catch is that if one drive dies, you lose information on all devices. So while more drives in a RAID 0 setup means higher bandwidth and capacity, there's also a greater risk of data loss. Generally, RAID 0 is used mostly for dual-drive storage setups. And should you choose RAID 0, backup is a must. RAID 0 is the only RAID setup that doesn't provide data protection.
For a storage device that uses four internal drives, you can use a RAID 10 setup, which is the combination of RAID 1 and RAID 0, for both performance and data safety.
RAID 5: This setup requires at least three internal drives, but it distributes data on all drives. Though a single-drive failure won't result in the loss of any data, performance will suffer until you replace the broken device. Still, because it balances storage space (you lose the capacity of only one drive in the RAID), performance and data safety, RAID 5 is the preferred setup.
RAID 6: This array is similar to RAID 5, but now the array can survive the case that two of its internal drive fails at the same time. RAID 6 is generally used in storage devices that have 5 internal drives or more. In a RAID 6, you lose the capacity of two internal drives.
Most RAID-capable storage devices come with the RAID setup pre-configured, so you don't need to set that up yourself.
Now that you've learned how to balance performance, capacity and data safety, let's consider the three main types of storage devices: internal drives, external drives and network-attached storage (NAS) servers.
Though they share the same SATA interface, the performance of internal drives can vary sharply. Generally, hard drives are much slower than SSDs, but SSDs are much more expensive than hard drives, gigabyte for gigabyte.
That said, if you're looking to upgrade your system's main drive -- the one that hosts the operating system -- it's best to get an SSD. You can get an SSD with a capacity of 256GB (currently costing around $150 or less), which is enough for a host drive. You can always add more storage with an external drive or, in the case of a desktop, another regular secondary hard drive.
Though not all SSDs offer the same performance, the differences are minimal. To make it easier for you to choose, here's our list of the best internal drives.
External storage devices are basically one or more internal drives put together inside an enclosure and connected to a computer using a peripheral connection.
There are four main peripheral connection types: USB, Thunderbolt, FireWire and eSATA. Most, if not all, new external drives now use just USB 3.0 or Thunderbolt or both. There are good reasons why.
USB 3.0 offers a cap speed of 5Gbps and is backward-compatible with USB 2.0. Thunderbolt caps at 10Gbps (or 20Gbps with Thunderbolt 2.0), and you can daisy-chain up to six Thunderbolt drives together without degrading the bandwidth. Thunderbolt also makes RAID possible when you connect multiple single-volume drives of the same capacity. Note that more computers support USB 3.0 than Thunderbolt, especially among PCs. All existing computers support USB 2.0, which also works with USB 3.0 drives (though at USB 2.0 data speeds).
Generally, speed is not the most important factor for non-Thunderbolt external drives. That may seem counterintuitive, but the reason is that the USB 3.0 connectivity standard, which is the fastest among all non-Thunderbolt standards, is slower than the speed of SATA 3 internal drives.
Capacity, however, is a bigger issue. USB external drives are the most affordable external storage devices on the market, and they come with a wide range of capacities to fit your budget. Make sure to get a drive that offers at least the same capacity as your computer. Check out our list of best external drives for more information.
There's no difference in terms of performance between bus-powered (a data cable is also used to draw power) and non-bus-powered (a separate power adapter is required) external drives. Generally, only single-volume external drives that are based on a laptop 2.5-inch internal drive can be bus-powered, and these drives offer around 2TB of storage space. Non-bus-powered external storage devices mostly use 3.5-inch internal drives and can combine multiple internal drives, so they can offer more storage space.
Currently, Thunderbolt storage devices are more popular for Macs, and unlike other external drives, deliver very fast performance. They are significantly more expensive than USB 3.0 drives with prices fluctuating a great deal depending on the number of internal drives you use. Here's our list of the top Thunderbolt drives.
A NAS device (aka NAS server) is very similar to an external drive. But instead of connecting to a computer directly, it connects to a network and offers storage space to all devices on the network at the same time.
As you might imagine, NAS servers are ideal for sharing a large amount of data between devices. Besides storage, NAS servers offer many more features, like being capable of streaming digital content to network players, downloading files, backing up files from a network computer and sharing data over the internet.
If you're in the market for a NAS server, you should focus on the capacities of the internal drives used. Also, it's a good idea to get hard drives that use less energy and are designed to work 24-7 since NAS servers are generally left on all the time.
A final consideration when purchasing a storage device is the connection. Currently, it all comes down to USB vs. Thunderbolt, since other types are largely obsolete. Obviously you'll want to get a drive that can work with your computer. So if your machine has a Thunderbolt or Thunderbolt 2 port (as most Macs do) then you'll want to get a Thunderbolt drive. On the other hand, since most computers have at least one USB port, getting a USB-based drive is a safe bet. Some portable drives support both Thunderbolt and USB.
However, if you want your drive to be future-proof -- meaning it will not only work with your current computer, but also the computer you'll buy a year or three from now -- then you need one with a USB-C port. A USB-C portable drive will work with all existing computers when you use a USB-C-to-USB-A cable. If you have a computer that has a USB-C port, such as the MacBook, you can connect a USB-C drive to it by using a regular USB-C-to-USB-C cable.
Currently, all new computers with Thunderbolt will also support USB-C. This is because the latest version of Thunderbolt 3 has moved to use the same port type and cable as those of USB-C. In other words, every Thunderbolt 3 port will also function as a normal USB-C port and every Thunderbolt 3 cable will also work as a USB-C cable.
This guide, sponsored by StorPool Storage, was written by Marc Staimer, president and CDS of Dragon Slayer Consulting in Beaverton, OR, since 1998.
The 2024 Block Data Storage Buyer’s Guide
A Pragmatic Process to Selecting Primary Shared Block Data Storage
It was created for CEOs, CTOs, CFOs, and business leaders who are looking for a comprehensive understanding of a better approach to storage for their business. An approach that will help to gain control of overall operational costs, ensure the right storage for business needs, and position infrastructure to handle any amount of data growth today and in the future.
Additionally, IT team leaders, storage architects, and DevOps leaders will gain a comprehensive understanding of evaluating the right storage based on need-to-haves and want-to-haves to not only ensure a modern block storage system but a system with the right automation to give them the confidence in their daily routines that the storage is always working and delivering data when and where it is needed allowing IT teams to have the time to focus on the next innovation and the real business at-hand.
Introduction
There are many storage buyer’s guides readily available on the Internet. So why publish a new one? An extensive analysis of those storage buyers guides reveals they’re not very useful. They tend to be too rudimentary. They assume quite a bit, are too general, and fail to provide down-to-earth steps necessary to determine the best, most reliable, and most cost effective storage for the buying organization.
They’re generally organized into 3 sections:
Then there is the first and most useful section. Section 1 is useful in that it counsels what is generally needed for IT preparations in purchasing and utilizing a new storage system. That is the only similarity between other storage buyer’s guides and this one.
This demands specific knowledge.
Instead of devoting a complete section to this fundamental, task to basic knowledge such as knowing:
• The capacity requirements for both current data being stored and retained in addition to all projected future storage for the life of the system.
• The type of workloads that will be using the storage. They will vary. This essentially breaks down into application response times to the user. The response time is affected by the application server performance and available resources, networking between the user and the application server, networking between the application server and the storage, in addition the storage system performance and resources.
• Each of the workload requirements for performance in latency, IO/s, and performance: capacity; and data protection – RPO (amount of data that can be lost. RPOs reflect the time between data protection or backup events) and RTO (how fast the workload needs to be backup and running).
• Government regulations, of which there are many. Some are industry specific like HIPAA or Basel II and III. Others are more general such as GDPR or CCPA, but increasingly common for data privacy, data sovereignty – data kept in a specific country, geographic area, or on-premises.
• Storage security processes for preventing or at least mitigating breaches, data theft, malware infiltration, ransomware attacks, unauthorized access, accidental deletions, software corruptions, and malicious employees.
• Current or planned IT infrastructure on-premises and/or in the public cloud requirements. If the application workloads require NVMe/TCP at 25GbE and the current network is primarily 1Gb/s or even 10Gb/s, something has to change.
• Current and desired timeframe objectives when provisioning storage for new applications and devops. i.e., how long do the application owners and devops people have to wait? Too long and they will find their own storage. This is called ‘Shadow IT’. Shadow IT is a problem for IT organizations. Support eventually falls to them even when they had no say in the selection of the storage, no training, and no staff for it.
An outage is commonly the event that pulls IT into taking over the support.
• A frank assessment of internal storage knowledge, skills, and experience of the IT administrators responsible for the new storage system. The key word here is … frank.
• Timeframes required for the new system to be up and running. As the Covid 19 pandemic made clear, supply chains can be and are disrupted at times. Some vendors are better prepared than others.
• The budget for the new storage system – including all software and hardware; implementation costs; professional service costs; ongoing costs for maintenance, subscription, or STaaS; supporting infrastructure costs such as rack units (RU) and the allocated fixed data center overhead assigned per RU, power, cooling, UPS, switch ports, cables, conduit, transceivers, cable management, and their associated maintenance or subscription costs. One other cost that must be budgeted is the tech refresh cost at the storage system’s end-of-life. That includes the data migration cost, professional services cost, full time employee (FTE) cost spent on the tech refresh, and the co-resident cost for both storage systems while the tech refresh takes place. This cost is often left out of the budget.
The budget decision should not be limited to just a price per terabyte measure. Price per terabyte assumes that all other costs are equal. That is a very inaccurate assumption. Price per terabyte is horribly misleading and leads to poor decisions. It assumes there is only a single performance tier and all media is the same cost. That’s a fantasy. Every type of media has different performance characteristics, capacities, and cost.
That’s led many storage buyers to use 3 different measures per storage tier:
1. TCO over the storage system’s life per terabyte.
2. TCO per IO/s.
3. TCO per throughput in bytes per second.
Objective: To Provide Buyers a Rational Block Storage Buyer’s Guide
Making this guide more pragmatic compels considerable differences from all the others currently available. It has to be simple and useful.
The 1st difference is the focus on block storage. It’s in the title. Why block? Because block storage is the principal choice for performance. Most primary storage is in fact block. That doesn’t mean this buyer’s guide ignores file and object storage. On the contrary, it explains the differences in storage types, use cases, and problems they solve. However, there is much more detail on the block storage problems, workarounds, capabilities storage buyers should be looking for – depending on their use cases, and how they should evaluate each vendor’s storage offerings.
The 2nd difference is the effort to educate and debunk storage myths. There are a lot of them.
The 3rd major difference is the emphasis on the ‘why’. Why certain capabilities and features? The ‘why’ are the problem details solved by block storage. Why workarounds to some block storage problems fail, are unsustainable, and have ultimately very high costs.
The 4th extremely valuable difference is a useful, simple, and pragmatic process tool that empowers storage buyers to compare and contrast different vendor block storage systems. It is designed to compare how well each block storage system solves the block storage problems, workarounds, and TCO.
The 5th fundamental difference is the detailing of a simple intuitive process for calculating real TCO for each block storage system. Too many of these storage buyers guides emphasize the purchase prices, subscription fee, or cloud storage-as-a-service fees. Those costs are often referred to as price/TB.
Differences Between Block, File, Object, and Unified Storage
Block storage
It is the universal underlying storage infrastructure. All data block, file, or object data is ultimately stored in a block format on the storage media. That data is saved to storage media – SSD, HDD, tape, or optical – in fixed-sized chunks called blocks. A data block is a sequence of bytes or bits. It has a fixed length known as block size. It generally contains a whole record number. Each block has a unique address. That address is the only metadata assigned to each block. Block management is handled by software that controls how blocks are placed, organized, and retrieved correlated to the metadata.
Data is stored on the internal media of a server and workstation or connected to an external block storage system over a network – Historically called a SAN. Note: the more common name today is storage network.
In block storage systems each individual storage volume acts as an individual HDD configured by a storage administrator. These systems are connected to application or database servers in multiple ways.
1. DAS – i.e. external SAS, FC, or iSCSI ports on the storage controller.
2. High performance storage switched networking a.k.a. NVMe_oF:
• NVMe/RoCE – layer 2
• NVMe/FC – layer 2
• NVMe/IBA – layer 2
• NVMe/TCP – layer 3 • NVMe/NFS – layer 3
3. Standard storage switch networking a.k.a. SAN:
• SCSI over FC – layer 2
• iSCSI over Ethernet – layer 3
• SCSI over IBA – layer 2
Block storage is ideal for high-performance, mission-critical, data-intensive, and enterprise applications needing consistent low latency high I/O performance. Applications such as relational databases, OLTP, eCommerce, or any application that demands subsecond or real-time response times. It is also ideal for high speed analytics. In essence, when low latency, high IO/S, or very high throughput (measured as bytes per second) are the requirement, block storage is more often than not the fastest shared storage.
Block storage is data agnostic, working well for all structured and unstructured data types, hypervisor virtual drives, and persistent container storage. The strengths of block storage are performance and flexibility. The weaknesses tend to be technical complexity and cost.
File storage
It storage is a hierarchical storage methodology. It can use either SSDs or HDDs. It’s also known as file-level or file-based storage. This type of storage writes, organizes, stores data as files with the files organized in folders with the folders organized in a hierarchy of directories and subdirectories. Files are located via a path from directory to subdirectory to file. This organization occurs on the internal media of a server and workstation or a NAS system. That connection is via layer 3 networks – TCP/IP or NVMe/NAS over Ethernet or IB.
File storage is very well suited for unstructured file data, PACS DICOM health imaging data, backup data, cool data, and even cold data. It is ideal for unstructured data. However, it works adequately for structured data including many database types, hypervisor virtual drives, and persistent container storage. Its biggest strength is its simplicity.
File storage historically had trouble scaling to very large file counts. That’s changed in the past several years. It can now scale to many billions even trillions of files through clever metadata management.
The downside to file storage is its I/O performance is noticeably lower than block storage. It’s hampered by the additional file storage and management software layer latencies. That performance limitation reveals itself in reduced application response time, concurrent IO, and total throughput.
The exception to total throughput being lower is the parallel file system (PFS) storage. PFS aggregates or bonds multiple file storage servers simultaneously enabling very high throughput. It’s complicated and is primarily utilized in HPC and large unstructured data ingestion.
Object storage
Object storage organizes, stores, and manages data as discrete units or objects. Unlike files, they’re stored in a single repository. There is no need for nesting of files inside a folder that may be inside other folders. In other words, there is no hierarchical structure. All data is stored into a flat address space known as a storage pool or bucket. The blocks that make up an object are kept together while adding all of its associated metadata. One object storage advantage is the ability to add extensive and unique metadata to the objects. Users can define their own metadata. File storage generally cannot. (There are some limited exceptions.) Every object – file, photo, video, etc. – has a unique identifier. That metadata is an essential key to its value enabling extensive data analytics of the object data.
Similar to file torage, object storage is connected via layer 3 networks – TCP/IP over Ethernet. It can use both SSDs and HDDs. But primarily HDDs because object storage is not known for being much of a performance play but rather a lower cost storage play. SSDs only nominally affect the performance issues of object storage. Object storage I/O latency is notably higher than file and especially block storage.
Object storage is primarily utilized as secondary or tertiary storage for cool and cold data. It is mostly used for unstructured data found in data lakes – large amounts of cool data used in analytics and machine learning, public clouds, backups, and archives. It competes with file storage while making inroads in the DICOM PACS imaging healthcare markets as well as the media and entertainment markets.
Unified storage
It is the combination of all 3 storage types or 2 of the 3. A combination of block and file storage or file and object storage are the most common unified storage types.
There is a problem with most proprietary and open-source unified storage systems. They are predominantly developed for one type of storage. The other types are gatewayed. Gateways add latency and reduce application performance. One proprietary file storage system converts blocks to files. Then the files are written ultimately as blocks. Remember, both file and object storage ultimately writes data to the media as blocks. Converting blocks to files – in order to leverage file storage services – adds another layer of latency. Another open-source object storage converts both files and blocks to objects before eventually writing the data to the media as blocks. Again, 2 layers of latency.
Unified storage works best when the majority of the data is written to its primary storage type. The other storage types are mainly for convenience, not performance purposes.
Differences between SDS and storage systems
Storage systems
Buying storage systems or renting them via STaaS means getting a complete integrated and tested system. Everything is integrated, burned in, and optimally aligned. Everything works. Support is a single vendor meaning one throat to choke when there’s a problem. And there are always problems.
The downsides are much higher costs; vendor lock-in – can’t buy software, parts, or maintenance from anyone else; significant lag time in months or years behind server implementation of CPU, networking, and storage media innovations – frequently skipping generations. The innovation lag is because the storage vendor needs to get a ROI on each released model and that typically requires a minimum of 3 years.
SDS
It extracts the core storage operational software from the physical hardware. This enables the software to run on bare metal common-off-the-shelf (COTS) x86 servers, a.k.a. white box, or brand name x86 servers, VMs, containers, or HCI. Gartner describes SDS as “storage controller software abstracted from the underlying hardware, so it can run on any hardware, any hypervisor, or on any cloud.“
This offers several advantages over the storage system including:
• No hardware vendor lock-in.
• Hardware can be upgraded at any time.
• Current server contracts and discounts can be leveraged.
• Same media is at much lower server, not storage system prices.
• Better control.
• Much more flexibility.
Downsides of SDS are hardware troubleshooting unless the SDS vendor takes ownership of the problem regardless of whether the problem is in the software or the hardware.
Setting the Record Straight on Storage Media
There are 4 generally available storage media as of 2023. They are SSD, HDD, LTO tape, and optical.
Neither LTO tape nor optical are much used beyond archiving today. Optical is slow with low capacities. LTO tape usually resides in tape libraries and offline making access slow.
With the demise of Intel Optane, there are no commercially available NVM storage media beyond flash NAND or flash NOR – NAND is by far the dominant flash media. There are several NVM technologies in development; however, none have made it to volume production yet. That leaves SSDs and HDDs as the prevalent interactive storage medium today.
SSDs
Flash NAND SSDs come in several shapes, form factors, bits-per-cell, capacities, and write life referred to as drive writes per day (DWPD). DWPD refers to the number of complete writes of the total capacity of the drive per day guaranteed by the manufacturer. NAND cells are arranged in program erase blocks. Once data is written to one of these blocks no additional data can be written to that block. This is true even when there is unwritten capacity on that block. To write any additional data to the program erase block requires a layer of material to be erased. It is a destructive process. That is why NAND flash has a limited number of writes.
Manufacturers deal with this by concatenating writes to consume entire program erase blocks and overprovision the SSDs with additional capacity to account for cells that fail or reach the end of their write life. That overprovisioning is not reflected in the SSDs usable capacity.
The most common flash NAND SSD is in the standard 2.5″ drive form factor. The other common form factor is the 3.5″ standard. But because flash NAND are chips, they’re not limited to standard drive form factors, which are artifacts from HDDs. There are M.2, pencil, EDSFF (E1.L, etc.) and custom form factors. A few storage system vendors use their own custom flash NAND SSD form factors. They assert it lowers cost and provides better write life. Both claims are highly debatable.
Flash NAND SSD capacity, performance, error correction, and write life are all affected by the number of bits in each cell. Each additional bit per cell reduces its write-life – a.k.a. program erase cycles (p/e), reduced write/read performance, increased power required to write or read, and increased errors per write/read requiring more error correcting code.
There are 4 currently commercially available flash NAND SSD types:
1. SLC 1 bit per cell: It is by far the fastest and has recently been reclassified as storage class memory (SCM) by several storage vendors. SLC has the highest write life at around 100,000 writes per cell, lowest need for error correcting code, and highest price per terabyte.
2. MLC 2 bits per cell: It was the first attempt to reduce flash NAND SSD cost. Write life decreased to ~10,000 to 30,000 writes per cell. It has largely fallen out of favor because of more bits per cell, lower cost alternatives.
3. TLC 3 bits per cell: It appears to have achieved the correct balance between performance, write life, and cost even with significantly more overprovisioning than SLC or MLC. Write life per cell is ~1,000 to 3,000 writes per cell. An important flash NAND manufacturing innovation has been 3D layering that’s become common for TLC SSDs. This has enabled much larger capacities per TLC SSD – as high as 100TB in a 3.5” form factor – while lowering the price per terabyte.
4. QLC 4 bits per cell: It is primarily used as a competitor for nearline HDDs. It has a limited write life of ~100 to 300 writes per cell meaning the number of DWPD are severely limited. It also takes advantage of 3D layering to deliver higher capacities – as high as 64TB in a 3.5″ form factor and one coming out at 128TB in a 2.5” form factor – at lower price per terabyte than even TLC. Performance is also noticeably lower than TLC.
Several storage vendors pushing all-flash arrays claim their QLC drives are cost equivalent to nearline HDDs. Even when considering HDD greater consumption of power, cooling, and RU for the same capacities, those claims should be viewed with significant skepticism. Make them prove it. Also make them spell out specifically what they’re comparing. Several recent studies put QLC SSDs at 5 to 7x more expensive than high capacity HDDs. That difference is forecasted to decline to be only 4 to 5x more expensive by decade’s end.
Every realistic honest comparison shows nearline HDDs are still considerably less costly. This is critically important since the amount of organization cool or cold data is typically ≥ 9x that of hot data. QLC will definitely outperform HDDs in transactional read/write performance. But sequential RW is a different story where the performance differences are negligible.
HDDs
They have been whittled down to the 3.5″ form factor nearline drives. SSDs killed off 2.5″ form factor 15,000rpm, and for the most part, 10,000rpm HDDs. Nearline HDDs are 7,500rpm or less. These are known as high capacity or fat drives. Mostly used for cool and cold data – backup and archive, current capacities top out at 22TB in a 3.5″ form factor with a road map to 100TB in the next few years.
Will QLC SSDs or possibly future PLC (penta-level-cell or 5 bits per cell) SSDs replace HDDs? Not now, but possibly sometime down the road. It’s a bit unlikely in the next few years, contrary to some vendor assertions.
Fundamental Application Problem Uniquely Solved by Block Storage
Application response time. Storage performance metrics do not refer to application response times. It should. It only talks about storage performance. Performance metrics are measured as latency – i.e. delay, IO/s, and throughput – bytes per second or bp/s. These performance metrics ultimately affect application response times.
Applications that demand lower latency and high IO/s for faster response times will generally connect to block storage. No other storage type matches block storage in latency and I/O performance. Throughput is debatable specifically when comparing parallel file storage that binds multiple ports and storage controllers/servers. But there is no question that block storage is preferred when low latency, high IO/s, and greater transactions per second are required. In other words, structured data applications. This is why the cloud storage for most transactional applications is block storage.
Some may question why they may need storage with high performance, low latency, and faster IO/s. They might think their current storage performance is good enough. They’re not thinking about application response time. More specifically, sub-second application response times. They don’t realize or understand that sub-second response times lead to much greater user productivity, higher morale, faster time-to-market, faster time-to-actionable-insights, faster time-to-revenues and unique revenues, lower headcount, and higher profits. IBM research demonstrated this unequivocally in 1982 with their Red Book called, The Economic Value of Rapid Response Timehttps://jlelliotton.blogspot.com/p/the-economic-value-of-rapid-response.html.
“When an application and its users interact at a pace that ensures that neither has to wait on the other, productivity soars, the cost of the work done on the application’s computer infrastructure tumbles, users get more satisfaction from their work, and their quality improves.”
The research further revealed that as application response times reached ≤400ms, which is four tenths of a second, productivity soared, and users became addicted to that response time. This is called the Doherty Threshold. It’s cleverly explained in a clip of an episode of the television series “Halt and Catch Fire” on YouTube.
More detail about this IBM research can be found in Appendix A.
The most important takeaway from that research is the incredible importance of low latency, high performance I/O block storage. It leads to lower costs, faster-time-to-market, faster-time-to- actionable-insights, faster-time-to-unique-revenues, and-faster-time-to-substantial-profits.
Failure to utilize such storage can and often does result in reduced application response time, low productivity, late-to-market products and services, reduced revenues, and increased costs. That translates into a competitive disadvantage for both users and hosting providers.
Keep in mind there are several shortcomings and problems with many block storage systems and software. Working around or solving those problems is crucial to leveraging the power of block storage.
Block Storage Shortcomings, Workarounds, Consequences, and How to Better Solve
The biggest block storage problems center decidedly around performance restrictions, non-scalable performance, capacity scalability boundaries, compromised data protection, extensive software inefficiencies, unnecessary complexity/lack of automation, and excessive cost.
Performance restrictions
Block storage system and software performance has a history of declining when capacity utilization exceeds 50%. It falls off a cliff at 80% utilization. It’s why most vendors recommend never to exceed 80% capacity utilization regardless of the amount of capacity installed. Avoiding that threshold frequently requires the addition of more capacity. What it really means is that at least 20% of your capacity is wasted and can never be consumed at all times. Please note, that the 80% capacity utilization threshold is typically the same for file and object storage.
Workaround: buy extra capacity that you won’t actually use
This workaround alleviates capacity utilization performance degradation, but it increases the amount of unusable capacity. There is a hard limit to this workaround with every storage system or SDS. It ultimately is a costly and unsustainable workaround.
Workaround: buy block storage systems that scale beyond 80% of your perceived requirements
The problem with this workaround is that it adds significant cost without addressing the performance declination above 50% utilization. It only pushes out the 80% cliff.
Better solution: buy block storage systems that utilizes ≥90% of capacity without performance degradation
There are a few block storage systems or software that can do this today. They all should.
Non-scalable performance
Most block storage systems and software are limited to 2 active-active, or active-passive storage controllers or servers. In either case, each controller/server has access only to the data it specifically stored. It is when one of the controllers/servers fails that it can access the other’s stored data. There are some exceptions to this norm.
What it means is the active-active block storage aggregate performance is somewhat restricted to that of a single controller/server or at best 2. That might be adequate for your current and perceived growth rates. If it’s not, then how do you solve it?
Workaround: buy more block storage systems
There are several problems with buying more block storage systems. It’s not just the cost, although cost is a major issue. A bigger issue is the management. Each new storage system adds more than 100% more management tasks. These are duplicate tasks that humans are not very good at. And then there are the additional tasks that grow increasingly complicated as more and more systems are added to address performance.
Workaround: Go all-flash or faster flash
If the storage media is the performance bottleneck, that can work. NVMe flash SSDs have lower latency, higher IO/s, and higher throughput than SAS or SATA SSDs. However, flash SSD performance is rarely the performance problem root cause. Need proof? Add up the aggregate IO/s and throughput performance of all of the flash drives in an AFA. It’s likely greater than the rated performance of the AFA. Although NVMe will bypass SAS or SATA controllers reducing overall latencies, the bigger most consistent performance bottleneck is the storage controller/server.
The reason behind the storage controller/server performance bottleneck is the de facto standardization on the x86 CPU. The x86 CPU became the de facto standard because of Moore’s Law of doubling the transistors and performance every two years. It enabled storage software to be inefficient because it was covered up by the exponential growth in x86 performance every couple of years. But all good things come to an end and Moore’s law has been slowing substantially. Now, instead of performance doubling, the trend is to double or at least increase the number of cores with only marginal improvements in the performance of each core. And the timeframe for each new CPU has increased beyond 2 years. All of this leads to storage controllers/servers that are not improving all that much year over year.
The non-scalable storage performance culprit is more likely the storage controller/server, not the media, nor whether or not the storage array is all-flash. Then take into consideration that storage controllers are generally limited to 2 sockets (CPUs), most block storage systems or software are limited to 2 storage controllers/servers, and it does not take a rocket scientist to identify the bottleneck.
Better Solution: buy scale-out block storage
Scale-out block storage solves the limited number of storage controllers/servers and eliminates that performance bottleneck. Scale-out can be shared nothing or shared everything. Shared nothing scale-out block storage tends to have better performance with lower latency and higher IO/s than shared everything. Shared everything will have lower scalability limits than shared nothing. Both solve the limited controller performance bottleneck problem.
Compromised data protection
Storage data protection is table stakes today. But what that encompasses varies considerably among storage systems and software.
Snapshots
A good example of compromised data protection is snapshots. Snapshots have been touted as an excellent way to protect data on a storage system. But the number of snapshots supported per volume has become inadequate in the modern data center. That snapshot number is critical in determining RPO a.k.a. the amount of data that can be lost. RPOs are the time between snapshot events. When snapshots are taken once a day then the RPO is 24h. That’s how much data will be lost in an outage requiring a recovery. When it’s 4x a day, the RPO is 6h. More snapshots per volume enables smaller RPOs. Fewer snapshots mean larger RPOs. Too many block storage systems or software have limited snapshots per volume. This becomes critical when snapshots need to be retained for very long time periods.
A common snapshot limitation is approximately 256 per volume. Where that may seem like a lot, a little math shows why it’s not. If the snapshot retention policy is a year, 256 is only going to allow one snapshot approximately every 34h. That’s less than a snapshot a day with a very high RPO. Even when the retention is just 30 days, it works out to be at best one snapshot every 2h hours, 48mn, and 45. Not exactly a small RPO. Especially if you need RPOs in single digit minutes.
Why the limitation? It comes down to storage snapshot inefficiencies. When snapshot software is inefficient, and most are, each snapshot consumes a large amount of storage controller/server resources. Snapshots are extremely fast. However, enough of them can cause a noticeable significant decrease in storage I/O performance. That’s the foremost reason snapshots are limited.
Workaround: Use a 3rd party data protection product and services
There are several 3rd party data protection products and services that provide small RPOs. However, they are essentially fixing a storage system feature deficit for an oversized cost.
Better Solution: block storage that provides lots of snapshots
Block storage system that delivers large or unlimited numbers of snapshots enabling smaller RPOs. Doing so effectively will not negatively impact the I/O performance.
Ransomware and malware
Ransomware has become a considerable threat to all organizations. They can and will delete storage snapshots and backups. This has made snapshot immutability important. Immutability prevents the data from being corrupted, changed, or deleted. Preventing the malicious ransomware from stealing admin credentials that could override that immutability calls for 2 factor authentication (2FA). Few block storage systems provide both today.
Malware and ransomware can both copy and steal your most sensitive and mission-critical data. Data breaches will hurt your organization in litigation, painful regulatory fines, reputation, lost customers, revenues, and massive repair costs. Storage can mitigate these events with 2FA and internal data encryption. 2FA prevents the ransomware or malware from copying out your data without having compromised both devices. The internal encryption prevents physical access to the data stored on the media or unauthorized reading of the data. This is especially important when a drive fails and is retired.
One more storage ransomware mitigation feature is anomaly detection. This capability essentially detects an anomalously high change rate in the data, which occurs when ransomware is encrypting the data. The storage system can provide an alert that this is going on. The storage admin can then take steps to stop the encryption, block the offending ransomware, and initiate a recovery from the latest immutable snapshot.
Good Solution: Use a 3rd party data protection product or services
There are several 3rd party data protection products or services that can provide small RPOs, immutable backups, 2FA, and encryption. Many database applications will also store their data encrypted. The best 3rd party data protection applications even proactively scan the backed up data to eliminate infected, but not yet detonated ransomware and malware.
Good Solution: block storage with built-in snapshot immutability, 2FA, and internal encryption
Block storage snapshot immutability can prevent ransomware from corrupting, changing, or deleting snapshots. 2FA prevents ransomware or other malware from accessing, stealing, or breaching your data. Internal storage encryption prevents physical access to the data. This last defense prevents a disgruntled or malicious employee from taking the media and the data on it. It also provides security of the data when a drive fails and has to be disposed of. And it prevents unauthorized software applications from reading the data.
Failed drive rebuilds decimate performance
RAID has been the block storage array’s failed drive protection since the early 2000s. RAID was effective when drive capacity was measured in megabytes or gigabytes. Now that drive capacity ranges up to 100TB, RAID has become a problem.
Unfortunately, RAID decimates I/O performance during a drive rebuild. A single failed drive can reduce I/O performance by as much as half. Two concurrent drive failures being rebuilt in RAID-6, will reduce storage system performance by as much as 80%. Larger capacity drives (15TB, 30TB, 100TB, 128TB SSDs and 18TB, 20TB, 22TB HDDs) take much more time to rebuild. Time that often stretches into weeks.
General Just Ok workaround: Run the drive rebuilds in background
Running a drive rebuild in the background will minimally reduce I/O performance. Much less so when there are two concurrent failed drives being rebuilt. The problem with running RAID rebuilds in the background is that it doubles, triples, even quadruples rebuild time. That exposes data being rebuilt to greater risk of data loss should another drive in the RAID-5 group, or 2 more drives in the RAID-6 group fail. That is statistically more likely to happen than most IT professionals recognize.
General Just Ok workaround: Recover any RAID group failures with snapshots
Snapshots are generally an ok workaround. Most IT organizations would prefer not to experience a preventable outage because of the snapshot RPO issue. There will be lost data. And an outage means there is some level of downtime. Downtime is very costly in productivity, lost revenues, customers, stock value, and reputation.
Much Better Solution: block storage with fast rebuilds, low resource intensity, erasure coding
Erasure coding has been touted as the next generation of RAID. It breaks blocks of data into chunks and places them on different drives behind different data storage controllers/servers. The number of concurrent tolerated drive failures is often configurable and flexible. It also protects the data from one or more storage controller/server failures. Erasure coding does not rebuild the drive, it rebuilds the data on available capacity on multiple drives. That shortens rebuild times considerably. Data rebuilds are accelerated as more storage controllers/servers are utilized for the rebuilds.
The problem with erasure coding is that it is resource intensive. It needs a lot of storage controller/server resources to write, read, and rebuild the data that was on a failed drive or storage controller/ server. That in turn greatly reduced I/O and throughput performance. Thus, relegating it to secondary storage for cool and cold data such as scale-out file or object storage. But not block storage.
That is no longer the situation. A few block storage suppliers have solved the erasure coding resource requirements, making it much more efficient. In this new generation of erasure coding there is little to no impact on I/O and throughput performance. Additionally, multiple concurrent drive rebuilds occur without reducing I/O and throughput performance or requiring downtime. It’s a very
good idea to pick a storage system that provides highly efficient erasure coding.
Time consuming and costly block storage tech refresh
Tech refresh is a must for all storage. Whether it be SDS or a complete storage system, the hardware and drives must be refreshed to take advantage of ongoing technology improvements. Improvements in interconnect, NICs, switches, faster and/or denser drives, latest PCIe generation, CPUs, memory, controllers, etc. Storage technologies are constantly innovating. Leveraging that innovation requires some level of tech refresh, a.k.a. replacing the hardware. Doing so is non-trivial.
Block storage tech refresh is the bane of every data center. It takes extensive effort, skill, expertise, time, and money. It’s complicated and commonly requires at least one outage – downtime. Ongoing research by DSC found the average time to complete a block storage system tech refresh is approximately 9 months. That’s 9 months of duplicate storage systems on the data center floor being powered, cooled, managed, maintained, and paid for concurrently.
Another Costly Workaround: Buy storage system that auto-replaces controllers
There are several storage vendors that offer this type of package on their systems. It provides assurance that their active-active controllers will be replaced every 3 to 5 years as they come out with new releases. There are several flaws with this workaround.
The minimum timeframe for the controller replacement is 3 years. A lot of innovation occurs in those 3 years. The cost of this additional insurance policy is very high. The math shows the total cost of replacing those storage controllers is 3x more than just buying new controllers. However, doing the latter may incur additional tech refresh time and costs. Additionally, the higher cost of tech refresh is replacing the drives. The drives are typically warrantied for 3-5 years. They will eventually wear out and fail. Replacing the storage controllers and not the media is very risky.
Always remember the first rule of storage is to “do no harm to the data“. The second rule, remember the first rule.
Just an Ok Workaround: Buy and use HCI
HCI utilizes its own SDS as its storage. The upside of this approach is that HCI SDS allows new server nodes to be added to the cluster without a significant effort in tech refresh.
The downsides include an inability of the storage to be shared outside of the cluster. It generally can’t be shared by outside application servers or other HCI clusters. Another downside is the limitation on data protection. For most HCI SDS, snapshots are limited, RAID is the standard, and protection vs. node failures requires triple copy mirroring at a minimum.
Better Solution: shared block storage with simple tech refresh
This would allow new storage controller or server nodes to be added without an outage. Active-active systems would replace the first controller as if it was a failover event. After completion the next controller would be replaced the same way. Drives could be replaced as if they failed and were rebuilt, or if the drawers are not full, put in the new drives, put them in the same volume and RAID group, copy the data from the old drives, remove the old drives from the volume and RAID group.
Scale-out shared block storage is simpler. Add a new controller with internal drives to the cluster; replicate the configuration from the node to be retired including volumes, data protection, permissions; retire the old node from the cluster. Adding just new drives is also simple. Place the drives in empty drive slots anywhere in the scale-out cluster, configure them into the desired volumes, data protection, permissions, etc. Done.
Extensive software inefficiencies
Storage controllers/servers have fixed resource constraints. They have limited CPU cycles, memory, and internal bandwidth. And as previously noted, the vast majority of block storage controllers/servers are active-active or active-passive with each one generally dual socket (2 CPUs).
The problem occurs because storage software is inefficient. It specifically took advantage of Moore’s law, which as previously discussed has been rapidly slowing down. It was assumed that any software inefficiencies would be taken care of by ongoing CPU releases that doubled in performance each time. But those software inefficiencies are no longer covered up by new CPUs. The only way to fix this is to redevelop the storage software efficiently. Few software developers have any desire to redevelop their code, which is what they have to do to make it efficient. This is one of the reasons you only occasionally see erasure coding in block storage software.
A Temporty Workaround: Buy and implement more of the same block storage systems
The most common workaround is to buy and implement more of the same data storage systems. It’s NOT a good idea and doesn’t work out well. It’s based on the misguided perception that adding more of the same types of storage systems will eliminate additional training requirements while adding nominally more management tasks. That perception is incorrect. Not only do all of the management tasks have to be duplicated, but it also adds plenty of complications. Complications like setting up duplicate volumes, load balancing, and data replication between systems to protect crucial data in the event of a storage system failure. That complexity and its cost increase exponentially as more and more systems are added. This workaround is ultimately unsustainable.
Better Solution: efficient block storage software
Much more efficient block storage software solves a lot of problems. It solves performance problems, data protection problems, and complicated workarounds.
What’s better is to marry that efficient block storage software with non-disruptive scale-out. That combination gets more performance out of less hardware while enabling flexible scaling.
Unnecessary complexity/lack of automation
Far too many block storage systems require that the administration be done by experts. Storage admins with knowledge, skill, and experience. That philosophical paradigm is problematic because of baby boomer retirements and fewer new hires that can replace those capabilities. The learning curve is steep, and the gap is increasingly problematic.
A Very Costly Workaround: put all your application workloads and storage in a public cloud
Moving to the public cloud means someone else has to manage the storage infrastructure. No expertise needed. Lower cost. At least that’s what many IT pros believe before they move to the public cloud.
Although they don’t manage the storage, they need to know its strengths and limitations. Take AWS, the largest worldwide public services cloud with 3 provisioned-IO/s flash SSD block volume types:
Any application workload requiring more than 256,000 IO/s is completely out of luck. A transactional database would have to be shared and split over multiple instances with separate io2 volumes to solve. Besides being extremely labor-intensive and time consuming, it’s also quite expensive. There are additional database instances for each shard, added hardware infrastructure to run each database shard, and the additional provisioned-IO/s SSD volumes. It adds up quickly and provides quite the bill shock.
A big myth about public cloud data storage is that it’s cheaper than on-premises storage. That myth has been persistent and still not true. A thorough DSC analysis revealed that the cloud data storage is based on S3 object storage, not block or file storage. It’s not true for object storage either. The analysis was conducted on three different occasions over 4 years compared data storage like-to-like – block-to-block, file-to-file, and object-to-object. The results showed that if you have your own data center or are in a co-lo, the cost of cloud data storage is less costly only for the first 18 months. That’s based on street pricing from multiple storage vendors and public clouds. After that, public cloud storage is much more costly. When comparing vs. STaaS programs for on-premises data storage, where the storage is provided in a cloud-like pricing model, the public cloud storage is more costly from day 1.
An Unsustainable Workaround: Scripts
Scripts are a very common fallback for any system or application that lacks automation. Scripts use the tools and API of the data storage system to provide some of the automation they may need. There are many problems with scripts.
Scripts are rarely documented on paper or within the code. Then there’s the lack of both initial and ongoing quality assurance (QA). Scripts are rarely updated when the data storage software is patched or updated. The scripts are mostly updated when they break in production. Once again unlikely to be documented or go through QA testing. The script user interface is usually not made to be intuitive to anyone other than the author.
Then there’s the issue of what happens when the person who wrote the script leaves the position or leaves the organization. If it’s not documented, and it likely is not, the new storage administrator may have no idea how the script works or what it does. This leads to the script being completely rewritten, which takes considerable effort and time. Once again, the documentation, QA, patching, testing, etc. tends to be the same.
Better Solution: High levels of automation or STaaS
Extensive automation puts the expertise into the block data storage software and not the administrator. It enables the non-expert and expert alike to get the most optimized system with minimal effort. It greatly simplifies implementations, operations, management, scaling, and tech refresh. The greater the automation the easier and more intuitive it becomes.
STaaS means someone or something else – like automation or AI – is handling and managing everything. From implementation, operations, managing, scaling, and tech refresh. The responsibility for the data storage infrastructure is in the hands of the service provider.
Excessive cost
Cost is one of the most misunderstood aspects of block data storage. Too many buyers focus on the single cost aspect of the net purchase price per terabyte. Focusing on net purchase price per terabyte is based on too many false assumptions such as:
Enterprise class hardware has an average of 75% discounts in the market. Whereas mid-tier class hardware averages only 60%. Those are both very large discounts or not. Enterprise and mid-tier storage vendors generally raise the MSRP excessively so that they can make it seem that they are giving you a great deal. They’re not. Keep in mind that the media is the largest hardware cost in a storage system. The average storage system media street price after all discounts is minimally 3x the average street price for the same drive in a server.
Better Solution: Calculate the TCO)
The first 2 parts of TCO are “hard” and “soft” costs.
Hard costs are directly measurable that include:
• Purchase price, subscription, or monthly STaaS fees;
• All other required hardware infrastructure – servers, NICs, adapters, switch ports, switches – purchase price, subscription, or as-a-service fees;
• RUs consumed multiplied by the data center fixed overhead allocated per RU; cables, transceivers, conduit; supporting the storage.
Soft costs are indirect costs that include:
• Personnel such as storage admins.
• Personnel churn.
• Training costs.
• Productivity costs – costs directly tied to application response times that exceed the Doherty Threshold. See Appendix A.
The third set of costs are known as opportunity costs. These are lost revenues as a result of inadequate storage performance.
These include:
• Late versus early time-to-actionable-insights.
• Time-to-action.
• Time-to-market.
• Time-to-unique-revenues/profits.
Evaluating and Comparing Block Data Storage Systems or Software
Know thyself
As discussed in the introduction, it starts with knowing all of your needs, requirements, current and future hardware infrastructure. Below is an important application workload worksheet.
Next up is making sure you know what your current infrastructure looks like in servers, networking, and current storage systems or software.
Features/Functions/Capabilities
Unless the features, functions, and capabilities serve a specific purpose such as meeting your requirements or solving a problem, then they are only technology. It may be interesting while not particularly useful if it does not actually meet or exceed a requirement, solve a problem for you of some sort, or advance your organization in a meaningful way. If it does not do any of those things, it’s a waste of time and money. Many storage systems have features that are meant to differentiate them from their competitors, whether they’re useful or not. There are many times said unique feature is there to just fix a product problem.
Here is a great example: several years ago, a major storage vendor used Microsoft Windows as part of its storage OS. Then a major Windows malware attack occurred worldwide. But when it was cleared out of all of the infected servers, something kept reinfecting them. It was tracked down to the storage system. The vendor had to install anti-virus software on all of their storage systems. They turned that negative into a marketing positive by saying “our storage systems have anti-virus software, does yours?” Doesn’t matter that no other storage required it.
Caveats
Some vendors push de-dupe and/or compression as a way to make their cost/TB appear lower. De-dupe and compression are only effective for some types of data such as backups, archives, or various types of unstructured data. They’re not nearly as effective with most database data, video, audio, or encrypted data. In fact, many of the applications that produce this type of data have their own much more effective compression and de-dupe built-in. And many encrypt the data before it’s written to storage.
Take the Oracle Database as an example. The best case de-dupe and compression of Oracle Database data is ≤2:1. Oracle’s built-in hybrid columnar compression is 10:1 or higher in some cases. Storage based de-dupe and compression add latency on both writes and reads noticeably reducing performance. Just remember, they are primarily aimed at reducing capacity cost at the penalty of performance.
Another caveat is how vendors measure their performance. For latency, is it measured from the application server to the storage and back? Is it based on reading or writing? For IO/s, what is the payload – 4K blocks, 8K blocks, 16K blocks, etc. Are the IO/s sequential or random reads or writes, or a mix of random reads and writes? Throughput is measured in bytes per second. The key is how it’s calculated. Is it calculated bidirectionally or unidirectionally? And for all of the performance measurements, are they calculated per workload, per node, or per system?
Because vendors’ performance statistics tend to use different methodologies, or in some cases do not report any performance statistics, comparisons can be difficult. This needs to be normalized for effective and correct comparisons. The best way to normalize performance comparisons is to provide each of them with a benchmark or test that emulates your workloads. Have each of the vendors run it in their labs based on the configuration they want to sell to you.
Remember that published performance statistics will likely not resemble what you will see. A lot of that has to do with mixed workloads, inefficient storage software, not enough nodes and lots of different demands happening concurrently.
Comparing Products and/or Services: Step 1
Start by creating general categories for performance, scalability, data protection (includes security), manageability (ease of use or simplicity), and cost. Then under each category, list all of the features, functions, and capabilities you require or think you need.
Next, mark all features, functions, and capabilities with one of 3 different classifications. Items marked with the first classification are the must haves. Failure to provide any one of the must haves is a deal breaker. Some good examples are specific performance or data protection capabilities. If an OLTP database workload needs 1 million IO/s, you can’t accept block storage that doesn’t provide at the very least that much and probably more. The 2nd classification indicates the important capabilities. These are perceived as borderline necessities; however, you can live without some of them. A common example is RTO. You may want to have RTOs that are instantaneous. But can you live with 1mn or 2.
The 3rd classification identifies the nice to haves. These are advantageous, have value, but are not necessarily a requirement. Yet they can be the difference maker.
Then take each storage system or SDS you are evaluating and put them in their own column. Rank each feature, function, capability between 1 and 5, with 1 being the lowest and 5 being the highest. You can also optionally put weights on them. This adds a bit of complexity that is not necessary because of the 3 classifications. You should be looking for 100% of the ‘must haves’, 60-85% of the ‘important’, and 25-50% of the ‘luxuries’.
See example spreadsheet below.
If none of the vendors meet your criteria, you have 2 choices. Either look for a storage vendor that manages to meet the criteria or reevaluate and modify the criteria.
TCO
TCO is much more than simply the purchase price plus maintenance, leasing/financing, or STaaS. There are many costs that really need to be counted. There also needs to be accounting for application software license or subscription consolidation, and hardware consolidation cost savings because of faster storage providing faster application response times. Those faster application response times also convert into user productivity gains, FTE time savings now available for other more strategic tasks, reduced FTE turnover plus their training, and unique revenue gains.
All of these positive credits against cost are considered ‘soft’ by many CFOs. They will disregard or not include those credits. That’s a huge mistake in that they can and commonly do dwarf the costs. There are significant and dramatic differences between vendors. Here’s how to calculate those credits.
When the application is waiting for the storage, faster storage translates into faster application response times. Faster response times enable application and database consolidation. Application consolidation savings are the easiest to calculate. Applications are typically licensed or subscribed on an instance, socket, or core basis. This is definitely true for databases. Even open source software support is based per instance. The applications that charge on a per seat basis tend to be SaaS, and per seat licensing will not experience any license savings from consolidation. There will still be hardware consolidation savings in both cases because the application software needs to run on fewer virtual or physical servers. The credits then come from both reduced license, maintenance, or SaaS fees.
Productivity savings credit calculations are also straightforward. See Appendix A to calculate productivity cost savings.
Calculating FTE time cost savings requires knowing the average block storage admin FTE annual cost. It will vary by region. However, a good rule of thumb is approximately $150,000 including all benefits. Next determine the number of hours saved because of automation, intuitive simplicity, and/or STaaS. This will require some research, talking to the current in-house block storage admins. Then estimating the total hours saved per week, month, and year. Multiply those hours by the average FTE cost per hour: hours per year=2,524 working days*8=2016 $150,000/2,016=$74.40/hour.
There is also the turnover, new hire, and training savings. This too is tied to freeing up the admin’s time. It requires a bit of estimating how improved morale will decrease turnover. Starts with estimating the cost of hiring a new admin. That cost is much higher when executive recruiters are used. Then there’s the training cost for a new block storage admin. It should include any vendor system training and any other tools they need to be trained up to speed. Multiply that cost times the number of admin FTE turnover saved per year.
The last bit of revenue calculation is unique revenue gains as a result of faster application response times. This is the most difficult to calculate. That does not mean it should not be calculated. It definitely should. It means looking at the new product or service revenue projections based on time-to-market. Determining how much of that is accelerated because of productivity gains from faster application response times. Then using the projected growth curve from the new starting point. Then compare the 3-year and 5-year revenues to the original projected revenue curve. The differences from the new to the old are a net gain.
These credits against cost are all based on the different block storage having different performance both now and later at scale. If performance was identical, then the credits would not matter. Performance is nowhere near the same.
Many block storage will not reduce FTE time, improve user productivity, or increase unique revenues because they do not accelerate application response times in a material way.
Recommendations
Choosing the best block storage for your organization will always come down to your needs now and in the future. But do not sacrifice the future for the immediate now as too frequently happens. A specific block storage might meet your needs upfront but can’t meet where those needs may grow, especially in performance. Lower cost upfront does not mean it’s lower cost over 3 to 5 years. You can meet your needs now as well as in the future with a little effort and planning on your part. It will pay off in a very large way.
Keeping that in mind, here are the block storage capabilities you should be looking for to solve block storage issues and problems and make your life much easier.
1. Best block storage type
• Scale-out – because it solves several difficult problems.
• Eliminates the limited storage controller/node/server problem.
• Provides highly scalable performance and capacity.
• Delivers non-disruptive patching, upgrading, and tech refresh.
• Examples: StorPool, Nvidia-Excelero, Dell PowerFlex, iXsystems TrueNAS, NetApp Solidfire
2. Most flexible block storage implementation – SDS
• Radically reduces the cost of hardware.
• COTS servers and you can leverage current server vendors.
• Enables faster adoption of faster CPUs, networks, and media for competitive advantages.
• Media is more than 67% less costly.
3. Ideal performance
• Very low latency and high IO/s
• Empowers mission-critical or business-critical application sub second response times.
• These applications are not waiting on the storage.
• High throughput
• Enables analytics, ML, deep ML, process AI, or generative, for faster results.
• Extremely efficient capacity utilization that does not impact performance.
• Aim for 90% or better without performance degradation.
• Snapshots that do not degrade performance while being taken.
• Drive or data rebuilds that do not degrade performance while occurring.
4. Essential performance and capacity scalability
• A good rule of thumb for future performance scalability needs is the 2x exponential factor.
• Take the performance requirements required in year 1 and double it for year 2.
• Take the performance requirements required in year 2 and double it for year 3.
• Take the performance requirements required in year 3 and double it for year 4.
• Is that more than what’s required? Possibly, but just as likely not.
• Historical capacity planning rule of thumb, take the 3-year estimate and double it, no longer valid.
• Storage is doubling approximately every 2 years and accelerating.
. Because of ML, DML, pAI, gAI, analytics, data lakes, data lake houses, IoT, and more.
• Plan 4-10x the current amount of data stored for the next 3 to 5 years.
5. Better Data Protection
• Enough snapshots per volume to meet mission-critical and business-critical application RPOs
Now and over the next 3-5 years.
Expect more workloads requiring a minimum RPO as time goes on.
• Immutable snapshots to prevent unauthorized deletions or modifications.
• 2FA to prevent unauthorized access or changes.
• Erasure coding across storage drives and nodes.
As long as it does not negatively impact performance.
Rebuilds data, not drives, faster rebuilds, reduces risk, and reduces capacity needs.
Minimally, RAID-5, -6, -1, -0, -10, -50, -60, although erasure coding should be preferred.
• Triple-copy mirroring or more between storage nodes.
Delivering continuous data access for multiple hardware failure types.
• End-to-end data encryption.
Prevents unauthorized access of data in-flight or at-rest.
6. Simpler manageability
• Non-disruptive patching, upgrades, drive replacements, software changes, or tech refresh.
Disruptions have to be scheduled for non-production hours in a 7x24x365 world, are on a tight schedule, pressure-filled and are often error prone requiring do overs.
Disruptive patching delays implementation of vulnerability patches.
Non-disruptions can be done online during production hours without taking an outage.
• Intuitive GUI
Enables novice storage admin to use and figure out everything with little to no training.
• CLI or RESTful API
For integration with other data center management tools.
• Threshold alerts and alarms
Optional predetermined actions such as removing a drive from a volume based on errors.
Useful in dealing with problems promptly.
7. High degree of Automation
• More automation equals fewer human errors, less required expertise, reduced admin workloads.
• Reduces storage admin turnover.
• Saves time and money.
8. Lowest TCO/performance
• Make sure you’re looking at the lowest TCO and not the lowest price
TCO/IO/s and TCO/bps.
9. Choice of payment methods
• STaaS consumption model of pay for what you use
Reduces risk, management, and overpaying upfront.
But it can cost more in the long run.
Per Gartner, STaaS can decrease IT spend by up to 40%.
• Purchase upfront + warranty + premium maintenance
Traditional method.
Risk of overbuying upfront to take advantage of upfront discount.
Premium maintenance based on MSRP not net price.
Final Thoughts
Choosing the right block storage is non-trivial. It will have repercussions for years and greatly affect your organization’s ability to compete. Do not make your buying decision lightly or cavalierly. It is too important.
Appendix A: Rapid Response Time Economic Value
Calculating performance productivity impact:
• Performance has a substantial and measurable impact on productivity.
Response time has a direct correlation on user productivity, quality-of-work, and time-to- market.
It was determined that the maximum application response time before user productivity declines precipitously is 3s. Anything over 2s response times caused user attention to wander.
Application response times that are less than 3s promptly increase user productivity, quality-of-work, and time-to-market.
Reducing response time to ~.3s more than doubles productivity vs. 2s.
Productivity gains are substantially greater depending on the user’s level of expertise.
• Faster response times mean shortened project schedules and higher work quality.
• ≤.4s equates into what is called the Doherty Threshold. The Doherty Threshold is when response time becomes addictive whereas >.4s users’ attentions begin to stray and productivity begins to decrease rapidly.
• Determine application response times for each service under consideration.
• Compare productivity rates.
• Divide FTE costs by productivity to calculate FTE cost per transaction.
• One alternative is to compare the time required to complete a defined set number of transactions.
• Multiply the time saved by FTE average hourly cost.
Time-to-market revenue acceleration increases top line revenues and bottom line profits
Based on current schedules estimate the following:
• Amount of revenue for each week or month schedule is moved up.
• Project how much time the reduced application response time performance will accelerate the time-to-market. This can be derived from the increase in productivity based on application response time. If the developers can more than double their productivity, they can more than cut in half the amount of time to complete their project.
• Apply the projected market growth rate to that revenue for a set period, anywhere from 1-10 years. Compare the total revenues to what it would have been had the schedule not been accelerated. The differences are the unique gains. If the database cloud service delays time to market, then the differences are the unrecoverable losses.
• Example from a large microchip manufacturer:
By accelerating delivery of their chip to market by one quarter they were able to realize unique revenue more than $100 million upfront and 5x that amount over 3 years.
Previous: None
Next: None
Comments
Please Join Us to post.
0