Workshop: The 8th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-8) in Conjunction with SC22
Authors: Peng Zhou, Wen Xia, and Xiangyu Zou (Harbin Institute of Technology, China)
Abstract: With a massive upsurge in data, combining deduplication with distributed storage continuously suffer from a low deduplication ratio when providing the corresponding throughput. It is because distributed storage requires sharding data on different nodes, while global deduplication needs eliminating redundancies in a unified view.In this paper, we present clustering-based sharding method, D-Shard, in distributed deduplication storage systems that leads to a comparable deduplication efficiency on a single system while supporting a high throughput. First, using Dynamic K-Means approach to cluster super-blocks, then extracting every cluster center feature as the anchor point for sharding; Second, Construct a secondary deduplication index based on the Compact Hamming Index. Currently, preliminary results show that super-block clustering is convergent, and routing strategy based on anchor points can achieve a higher deduplication ratio compared to the state-of-the-art approach and the throughput of system has been greatly improved.