Partially collaborative storage codes for distributed storage systems
Date of Issue2015
School of Physical and Mathematical Sciences
This thesis contributes to the study of designing erasure codes for distributed storage systems. Erasure codes have traditionally been studied in the context of communication theory, but have also increasingly been used to ensure the reliability of distributed storage systems, since they provide a good trade-off between storage overhead (or rate) and reliability in case of storage node failures. There is however a critical difference between erasure codes for communication and for storage application: in the context of storage, reliability needs to be ensured over time. Failures of storage nodes keep happening, and the lost data must be rebuilt, to guarantee resiliency over time. This process is often called maintenance, or repair. Over the past years, there has been a vibrant research activity around the problem of designing good storage erasure codes, amenable to repair. A network coding inspired line of works was introduced by Dimakis et al. , who used a max-flow min-cut analysis to characterize the communication cost needed for repairing one failed node as a function of the code storage overhead. This setting has been generalized independently in [4, 5] to allow the simultaneous repair of several nodes failures thanks to node collaboration during the repair process, resulting in a better trade-off in terms of repair communication cost versus storage overhead than in the case of one failure repair at a time. The main contribution of this thesis is to introduce a family of storage codes that generalizes the framework of collaborative repairs, which subsumes the previously known constructions with either no collaboration (one failure at a time) or full collaboration (several failures at a time, involving all repair nodes in the collaboration as proposed in [4,5]). The motivation is twofold: to allow more degrees of freedom in the code design to best suit possible storage system requirements, and to trade gain in repair bandwidth with security impairments that typically appear in the presence of collaboration. More precisely, this thesis contains a trade-off between storage overhead and repair bandwidth as a function of the collaboration degree of the repair nodes involved, and code constructions: codes for the special cases where either the storage overhead or the repair bandwidth is minimized, and codes derived from group actions on Grassmannian spaces, known as orbit codes. Finally, the degree of collaboration is studied as a function of the confidentiality one would like to obtain in the presence of eavesdroppers.