Dynamic Clustering-Based Sharding in Distributed Deduplication Systems
DescriptionWith a massive upsurge in data, combining deduplication with distributed storage continuously suffer from a low deduplication ratio when providing the corresponding throughput. It is because distributed storage requires sharding data on different nodes, while global deduplication needs eliminating redundancies in a unified view.In this paper, we present clustering-based sharding method, D-Shard, in distributed deduplication storage systems that leads to a comparable deduplication efficiency on a single system while supporting a high throughput. First, using Dynamic K-Means approach to cluster super-blocks, then extracting every cluster center feature as the anchor point for sharding; Second, Construct a secondary deduplication index based on the Compact Hamming Index. Currently, preliminary results show that super-block clustering is convergent, and routing strategy based on anchor points can achieve a higher deduplication ratio compared to the state-of-the-art approach and the throughput of system has been greatly improved.
Event Type
Workshop
TimeSunday, 13 November 20224pm - 4:30pm CST
LocationC141
W
Recorded