Hybrid system analysis and design for scale-out storage environments
Date of Issue2018-11-01
School of Electrical and Electronic Engineering
Centre for Modelling and Control of Complex Systems
Data center storage architectures face rapidly increasing data volume and Quality of Service (QoS) expectation today. Hybrid storage systems have turned out to be one of the most democratic choices in fulfilling these requirements. A mixture of various types of storage devices and structures enables architects to address performance and capacity concerns of users within one storage infrastructure. There are lots of architectures, such as tiered storage and caching storage, etc. and policies, such as caching policy and replacement policy, etc. in a hybrid storage system. It is difficult to employ a general mathematical model to capture all the features of the storage architectures. Therefore, simulation is necessary to analyze the system performance and guide the system design. In a hybrid storage system, in addition to the performance requirement, the storage capacity provided is equally important. Shingled Magnetic Recording (SMR) is one of the techniques to enlarge the storage capacity with minimally increased cost. The batch process is applied to improve the SMR drive performance. The selection of different system parameters and policies affects the system performance and capacity efficiency. Thus, an analytical model is developed based on queuing model to study the SMR drive performance. Simulation is further conducted to validate the model and check the performance impacts of various drive parameters. In an SMR drive, due to its append-only write property, the data update is routinely handled through a log-structured manner, which makes the original data become invalid and dirty. Various types of Garbage Collection (GC) methods are therefore introduced to clean the dirty data and thus release the disk space. However, the necessity of GC process performed remains an issue, as the process may increase the energy consumption and downgrade the performance. Thus, an analytical model and a simulator are built to study the decision of the GC process under various kinds of workload environments and system settings for overall power consumption reduction and disk space saving. Based on the performance analysis of individual devices, the system level perfor- mance of the hybrid storage system can be studied. A flexible hybrid storage system simulator is designed and developed to simulate various kinds of hybrid storage ar- chitectures, including Solid State Drive (SSD) tiering method, SSD caching method and SSD hybrid method, various caching policies, such as read-only and write-back, and various hot data identification and data migration policies. The performances of these architectures and algorithms are evaluated and compared under different types of workload environments. The comparison results can be utilized as the benchmark for analyzing other types of hybrid storage systems, which are the extension of these basic storage architectures. Note that the conventional hybrid storage systems do not fully utilize the sequential access properties and the non-limited write cycles of Hard Disk Drive (HDD). An innovative approach is proposed to configure HDDs and SSDs in a hybrid structure such that the advantages of both sides can be fully utilized, i.e., the fast IO access of SSD (in particular for random access) and the non-limitation of write cycles of HDD. By carefully designing the disk data stripes, the (sequential) performance requirement for HDDs and SDDs can be matched to a certain degree. Therefore, they can be placed in the same array/pool without considering the high/low tiers or fast/slow cache. In a hybrid storage system, the caching policies, such as hot/cold identification and data migration, play a critical role in system performance. In particular, the migration size is one of the key factors. Fixed size of data migration typically cannot provide good performance when the workload properties change significantly and frequently. We design a hybrid caching algorithm based on the fuzzy control and decision tree which can adaptively adjust the data migration policies according to the workload properties. The fuzzy rules can be automatically generated through the training results of the decision tree classification and regression algorithms.
DRNTU::Engineering::Computer science and engineering::Data::Data storage representations