Workshop: The 17th Workshop on Workflows in Support of Large-Scale Science (WORKS22)
Authors: Hongwei Jin and Krishnan Raghavan (Argonne National Laboratory (ANL)), George Papadimitriou (University of Southern California (USC)), Cong Wang and Anirban Mandal (Renaissance Computing Institute (RENCI)), Patrycja Krawczuk and Loic Pottier (University of Southern California (USC)), Mariam Kiran (Energy Sciences Network (ESnet)), Ewa Deelman (University of Southern California (USC)), and Prasanna Balaprakash (Argonne National Laboratory (ANL))
Abstract: Reliable execution of scientific workflows is a fundamental concern in computational campaigns. Therefore, detecting and diagnosing anomalies are both important and challenging for workflow executions that span complex, distributed computing infrastructures. We model the scientific workflow as a directed acyclic graph and apply graph neural networks (GNNs) to identify the anomalies at both the workflow and individual job levels. In addition, we generalize our GNN model to take into account a set of workflows together for the anomaly detection task rather than a specific workflow. By taking advantage of learning the hidden representation, not only from the job features but also from the topological information of the workflow, our GNN models demonstrate higher accuracy and better runtime efficiency when compared with conventional machine learning models and other convolutional neural network approaches.