SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Revisit Data Partitioning in Data-Intensive Workflows

Workshop: PDSW22: 7th International Parallel Data Systems Workshop

Authors: Radita Liem (RWTH Aachen University, IT Center) and Shadi Ibrahim (French Institute for Research in Computer Science and Automation (INRIA))

Abstract: In this work in progress, we will showcase a comprehensive analysis of the current state-of-the-art solutions for data skew mitigation in several environments. Our experiments and evaluation comprise several data-intensive workflows running on Spark using the Grid’5000 testbed. The data-intensive workflows vary from a highly optimized WordCount application, an iterative application like PageRank, to an SQL-based decision support system benchmark, TPC-H with various sizes and configurations. Going forward, we will discuss our current efforts toward heterogeneity-aware multi-stages data partitioning.

Back to PDSW22: 7th International Parallel Data Systems Workshop Archive Listing

Back to Full Workshop Archive Listing