Large Scale Data Management with a Data Lake in a Federated Environment

Authors: Stefano Gorini (Swiss National Supercomputing Centre (CSCS)), Alex Upton (Swiss National Supercomputing Centre (CSCS)), Julian Kunkel (GWDG, Germany)

Abstract: In this BoF, we present a vision from industry and academia of establishing a new forum that addresses the need for Next Generation Interfaces in data management in a federated environment. As part of the BoF, we will first introduce perspectives of this vision and the pressing challenges. Following up, we will then discuss promising approaches that address a subset of the vision, namely for heterogeneous storage and compute environments.

Long Description: Large-scale data management is a challenge for both users and data centres. Users struggle to organise millions of files that are involved in their scientific workflows and the relevant software. Data centres, in turn, struggle to deal with the complexity of providing and optimising storage environments without knowing the exact intentions of their users. The creation of data management plans and a clear definition of the information life cycle and workflows can help, by increasing reproducibility and portability, and many workflows integrate user-specific metadata into search engines that allow users to navigate data. Concepts such as data lakes and data lakehouses have become popular as centralised storage; in a federated environment, data lakes aim to integrate data from diverse sources into a unified management system, retaining data in its original format. The underlying concept is to ‘dump’ scientific data into the lake, and provide tools to set up user-friendly workflows that enable sharing in accordance with all necessary security measures, thereby ensuring data preservation and aiding scientists to comply with good scientific practices. As good data management is often difficult and domain-specific, interaction with users with similar challenges can accelerate the development of practical solutions. As such, the aim of this BoF is to aid community building in this area, and initiate a discussion with the audience in order to find common problems and their individual solutions. In the first part of the session, several speakers from industry, data centers, and academia will give lightning talks related to the topic of large-scale data management, with a particular focus on data lakes, workflows and Federation (ex. FENIX). Building on this, the second part will focus on interaction with the audience, and will involve an on-site survey, discussions and community building in order to identify promising approaches moving forward.


