Revisit Data Partitioning in Data-Intensive Workflows
DescriptionIn this work in progress, we will showcase a comprehensive analysis of the current state-of-the-art solutions for data skew mitigation in several environments. Our experiments and evaluation comprise several data-intensive workflows running on Spark using the Grid’5000 testbed. The data-intensive workflows vary from a highly optimized WordCount application, an iterative application like PageRank, to an SQL-based decision support system benchmark, TPC-H with various sizes and configurations. Going forward, we will discuss our current efforts toward heterogeneity-aware multi-stages data partitioning.
Event Type
TimeMonday, 14 November 20222:50pm - 2:55pm CST
Registration Categories
Session Formats
Back To Top Button