As virtualized environments are growing fast, the need for managing large number of virtual objects is also increasing. Storage systems are part of virtualized or non-virtualized environments, public or private cloud, small or big datacentres. The performance and scalability of storage systems are gaining a lot of attention by many professionals, vendors and organizations. This whitepaper keeps performance and scalability testing of storage systems as centre point of discussion.
Meeting performance and scalability requirements of storage systems is one of the very important objectives of product teams. The list of parameters which can affect performance and scalability results is long. In fully scaled environment, there are high number of hardware devices, network devices, storage nodes, servers and software components involved therefore finding bottlenecks can become very complex at times. Usually, setting up scalability environment with accuracy is a time consuming task and requires careful adherence to industry best practices for optimized usage of storage systems. Design and execution of test cases, analysing results and drawing conclusions require systematic approach.
Before taking deep dive into the topic, basic terminologies associated with performance and scalability testing of storage systems are explained. With some examples, this whitepaper will help you find scalability limits of storage systems. During the process you will understand, what a performance baseline is and how it can be established. This paper explains how results from performance baseline and scaled environments can be compared and analysed. It also explains some key factors which can affect storage performance. Guidelines to ensure system stability in scaled environment are also discussed.
2. Performance and Scalability terminology
Primary objective of performance and scalability testing is to find out how scalable your storage systems are without performance degradation. Some of the important terms are discussed in brief below.
Performance is the speed at which storage system operates. IOPS, throughput and latency are considered important measurements for storage system performance.
Scalability is the ability of storage system to continue to function with minimum or no drop in performance when it is changed in size or volume or by any other parameter.
IOPS refer to Input/output operations per second. If IOPS are analysed in isolation then it can mislead results. IOPS are meaningful only when considered with latency and workload (e.g. block size, sequential or random read). For example, with 4K and 8K block sizes, if storage system produces 2000 IOPS and 1000 IOPS respectively then it does not mean storage system performs better with 4K block size than 8K block size. In fact performance is equal in both cases when block sizes are different and all other factors are kept constant.
Throughput is the amount of data transferred in a unit of time and is measured in kilobytes per second and (KBps) or megabytes per second (MBps).
Latency is the time taken to complete IO request and usually measured in milliseconds (ms).
- IO Workload
IO workload defines block size, read/write percentage and percentage of random/sequential access.
- Establishing Baseline
Establishing baseline is a process to define acceptable performance results on predefined hardware and software configuration. This requires executing different workloads and concluding on acceptable performance which is agreed by stakeholders. As the scalability acceptance criteria, you can also define how much performance degradation is acceptable with respect to the baseline.
Depending upon the test case and measurements required, hardware and software configuration should be defined. In this process, you will finalize number of storage nodes, number of volumes, number of CPUs and memory for virtual machines, volume size, cache size, etc.
For example, LUN or Volume scalability test might require 8 storage nodes in the cluster with 16 volumes created as baseline configuration. As the test is expected to verify volume scalability, count of storage nodes in cluster and configuration parameters other than number of volumes will remain constant in scaled configuration.
For node scalability, baseline configuration might require just 2 storage nodes. As system scales storage node count increases. Number of volumes and configuration other than count of storage nodes remain constant.
In case of volume scalability, IOPS in scaled configuration are compared with baseline configuration for various queue depths whereas for node scalability, IOPS comparison is performed for various nodes count.
3. Finding scalability limits with some examples
When storage system scales in terms of number of storage nodes or number of LUNs/volumes etc. and capacity is almost full, you might not get expected performance. Monitoring is required at storage, compute, network and virtualization level. You can finalize the list of workloads that need to be simulated depending upon the application you are going to use in production environment. The charts and graphs based on the periodically collected IOPS, throughput and latency are very useful to find out deviation in performance after storage system scales.
First step in finding scalability limits is to compare performance measured in baseline configuration and scaled configuration. Following two examples demonstrate performance deviation. However, first example shows that deviation is within acceptable limits whereas second example shows performance degrades significantly after system scales. The examples are explained with the help of IOPS, and latency against multiple queue depths, which are important parameters from scalability testing perspective. Please note that these graphs and results are for illustration purposes only.
In order to test the scalability of the system, storage cluster was prefilled to 90% of its capacity. Also, the LUNs / volumes are equally distributed across storage nodes which are member of clustered storage system. The setup configuration is explained in Table 1.
Workload configuration used in examples:
Example 1: Block size = 8K, Read = 60%, Write = 40% and Access = random
Example 2: Block size = 8K, Read = 00%, Write = 100% and Access = random
Table 1. Setup configuration
Figure 1 describes IOPS and latency comparison of baseline environment and scaled environment in which no scalability and performance issues have been found. At all queue depths, IOPS and latency in scaled environment are within acceptable range (+ /- 5% in this example) with respect to baseline environment. At higher queue depths, IOPS saturate and do not increase even if queue depth is increased. Throughput and IOPS are directly proportional to each other. The workload (8K random 60% read 40% write) scales well and meets scalability requirements.
Figure 2 describes IOPS and latency comparison of baseline environment and scaled environment in which scalability and performance issues have been observed. At all queue depths, IOPS in scaled environment show significant degradation when compared with baseline environment. Also, latency in scaled environment is higher than baseline environment. The deviation measured in scaled environment is not within acceptable range (more than +/- 5% in this example) of baseline environment. The workload (8K random 100% write) does not meet scalability requirements or acceptance criteria. In these type of circumstances, we need to find out the bottleneck which limits the IOPS in scaled environment.
4. Factors affecting performance
The list of factors which affect performance and scalability of storage system is listed below.
- Disk configuration
It is a well-known fact that SSD is a much faster media than HDD. As total number of drives in a storage pool are increased, performance also increases. These days 10K RPM HDD drives are common and will result in lower latency than latency of 7K RPM drives. Depending upon RAID configuration and level of virtualization, performance will vary therefore underlying hardware configuration and software configuration which virtualizes storage hardware need to be configured properly. In case of software defined storage, local disks and disks contributing to cluster or storage pool need to be connected to separate HBAs because local IOs should not be taken into consideration when measurements are taken at cluster level.
SSDs are used for caching. The size and number of SSDs used for caching will play a role in storage performance. To improve read performance, write-through or read-ahead caching is used whereas write-back caching is used to improve write performance. If you experience a sudden drop in write performance, it could be due to cache being 100% full. If the speed at which data being written is constantly high then even flushing will not help. Statistics such as cache hits and misses need to be monitored.
- CPU and Memory Resources
It is necessary to monitor CPU and memory usage of hypervisors, virtual machines (VMs), storage nodes and servers in order to find bottleneck. The servers which perform IOs on SSDs might require more CPU resources than servers performing IOs on HDDs. In virtualized environment, total number of virtual sockets and cores per socket assigned to VMs are important settings from scalability point of view. There are other settings like CPU affinity, page sharing, memory ballooning, etc. We recommend reading of hypervisor related document for advanced CPU and memory configurations which are not in scope of this document.
In general, read operation works faster than write operation. On HDDs, sequential IOs perform better than random IOs due to high seek time for each block of random IO. Random write performance in SSDs is also slower than sequential write.
- Background Operations
When performance measurements are taken, background operations like disk zeroing, RAID rebuilding/recovery due to disk failure/replacement, restriping due to storage node shutdown/reboot or change in the RAID level should be ceased.
- Full Stroke
In full stroke, data read or written during performance test should be spread across all disks and all clustered storage nodes. Create or pick LUNs to perform IO in such a way that all disks of storage pool or cluster are used. Short stroke is performed on small portion of HDD and due to low seek time, short stroke might give you good performance (lower latency) result which can be an incorrect conclusion. Full stroke performance results are more reliable than short stroke results.
Multipathing policies configured on initiator side determine how IOs are distributed across multiple paths. For example, in case of dm-multipathing on Linux (multipath.conf), ‘path_grouping_policy’ will decide how many paths are used to transfer data and ‘path_selector’ will decide how to distribute data/IOs across paths.
- Command queuing
Multiple SCSI commands can be active on a LUN at the same time. Queue depth is the number of commands that can be active at a time which is configurable at SCSI driver level. If hypervisor issues more commands than configured queue depth then queuing takes place at hypervisor level. Under normal circumstances, command issued to disk in the storage array is executed immediately. It is not recommended for hypervisor (VMs running on hypervisor) to consistently issue more commands than LUN queue depth. This might result in disk queuing on storage array as well.
- Network Configuration
Sometimes incorrect network configuration contributes to degraded storage performance therefore industry best practices should be followed when it comes to keeping management network and data-path/VM network separate and VLAN configuration to control broadcast traffic. NIC teaming and TCP offload engine (TOE) supported NICs can be used to enhance iSCSI performance. When you decide to use jumbo frames, make sure that MTU of 9000 is configured on all network equipment.
- Virtualized Environment
In virtualized environment, as far as storage performance is considered, thick provisioned and eager zeroed LUNs are expected to be created. Thin provisioned LUNs lead to incorrect performance results as disk zeroing takes place just before actual write.
5. System stability in scaled environment
The continuing problem with storage system is how to deal with escalating requirements in a manageable, smooth and non-disruptive manner. Multiple storage admins work on large environments simultaneously and perform operations such as creating volumes, snapshots, assigning LUNs, etc. Multiple requests from multiple hosts come to the storage system at the same time when it scales to large extent therefore the reliability of the system should be maintained.
The level of concurrency defines how these incoming requests are distributed across storage nodes for processing. Test activity to verify efficiency of processing concurrent requests must be carried out. The response time for each request is measured after sending concurrent requests to storage nodes. The level of concurrency and response time for the requests should not degrade. Storage system should respond gracefully to all the requests.
Stability testing can be performed by perturbing the system. While IOs are being performed, shutting down, rebooting, removing or adding any of the nodes in the cluster are some tests that can be performed.
5.1. Test automation and tools
Creating large setup repeatedly and creating load of concurrent requests require test automation and tools to be in place. In the absence of test automation and tools, human errors are often introduced and it leads to misleading results and increased time to prepare setup and execute test strategy.
As storage system scales, it is important to find whether performance degrades. If at all, there is degradation in performance then extent of degradation needs to be known. Degradation can be determined and reduced by using a systematic approach of comparing with baseline results and constant monitoring of resources. System stability can be maintained when storage systems scales to large extent.
By Mahesh Kamthe and Shubhada Savdekar