· Contributors · Organizations · Search
Revisit Data Partitioning in Data-Intensive Workflows
DescriptionIn this work in progress, we will showcase a comprehensive analysis of the current state-of-the-art solutions for data skew mitigation in several environments. Our experiments and evaluation comprise several data-intensive workflows running on Spark using the Grid’5000 testbed. The data-intensive workflows vary from a highly optimized WordCount application, an iterative application like PageRank, to an SQL-based decision support system benchmark, TPC-H with various sizes and configurations. Going forward, we will discuss our current efforts toward heterogeneity-aware multi-stages data partitioning.
Next PresentationNext PresentationData Lifecycles for Optimizing Data Movement