The 1st Worshop on Workflow Monitoring, Observability, and in situ Analytics
August 12th, 2024
Gotland, Sweden

WOWMON 2024

In cooperation with IEEE Computer Society and ACM

Held in conjunction with ICPP 24: International Conference in Parallel Processing

Topics

In recent years, there has been an emergence in high-performance workflow systems to address large-scale, complex scientific applications involving the interoperation of heterogeneous parallel tasks, massive data storage, and computational resources. From scientific workflows coupling HPC simulations to distributed data-driven analytics frameworks to many task ensemble applications, workflow execution environments place significant demands on the system platform for its scalable and efficient operation. In contrast to the performance optimization of traditional HPC applications that can rely on post-mortem performance analysis, dynamic performance monitoring and in situ analysis are crucial for evaluating and guiding resource management strategies for modern workflows. The workshop will focus on research topics related to deploying high-performance workflow systems and integrating state-of-the-art technologies in this field. Topics of interest include, but are not limited to, performance modeling, resource management, fault tolerance, programming models, observability, and other aspects of workflow systems coupling in-situ/near-real-time analytics with HPC simulations/experiments.

The workshop is relevant to multiple aspects of parallel processing – HPC, workflows, runtime systems, performance measurement and analysis, scalability, and scheduling – that come together in high-performance workflow-driven applications where dynamic behavior and performance variability present challenges to effective workflow execution, resource scheduling, and tuning. The target audience would encompass researchers and practitioners across the parallel processing spectrum, from HPC experts and workflow system developers to runtime system engineers, performance analysts, and scalability specialists.

The main goals of the workshop are to bring together researchers across these aspects to discuss challenges facing the development of observability, online monitoring, and in situ analytics capabilities for scalable workflow operation and optimization.

Workshop Program

WOWMON Agenda : Monday, August 12, 2024

12:00 Lunch Break
13:30 Welcome
13:35 Session A
Position statement (45 minutes) from WOWMON organizers
  • Shantenu Jha, Rutgers University, USA
  • Ana Gainaru, Oak Ridge National Laboratory, USA
  • Silvina Caino-Lores, Inria, France
14:20 Talk 1 (25 minutes) - Scheduling Scientific Workflows in DFG CRC FONDA Presenter: Svetlana Kulagina, Humboldt University of Berlin, Germany Abstract: The DFG-funded collaborative research center (CRC) FONDA investigates methods for increasing productivity in the development, execution, and maintenance of Data Analysis Workflows for large scientific data sets. FONDA seeks to improve human productivity when creating and executing scientific software, rather than to optimize machine resource utilization. The aim of subproject B1 in FONDA is to enhance the portability of software and develop novel scheduling and load balancing algorithms, using data gathered from the execution environments about workflows and the infrastructure. I will present the main achievements of B1 in architecture discovery and my own in the development of scheduling algorithms for large scientific workflows. Bio: Svetlana Kulagina is a PhD student in the group "Modeling and Analysis of Complex Systems" at the Institute of Computer Science at Humboldt-Universität zu Berlin. Her research is focused on scheduling and load balancing in heterogeneous distributed execution environments, especially memory-aware algorithms for scheduling.
14:45 Talk 2 (25 minutes) - DaYu: A Two-level I/O Performance Tool for Distributed Scientific Workflows Presenter: Xian-He Sun, Illinois Institute of Technology, USA Abstract: Big data, AI, and other data-driven applications generate massive amounts of data and create new data-discovery demands. Data access has become a killer performance bottleneck for AI and data-driven applications. Many solutions have been developed to optimize I/O performance. However, current solutions have their limitations in a distributed workflow environment, where different processing phases exist, and each phase may have its different I/O access patterns. Data can be reused cross phases in a workflow environment and optimizing I/O on one phase may not lead to global optimization. In this talk, we introduce DaYu, a method and toolset designed to address I/O bottlenecks in distributed workflows. DaYu analyzes semantic relationships between logical datasets and file addresses, translates dataset operations into I/O patterns, and considers optimization across entire workflows. DaYu’s visualization aids in identifying critical bottlenecks, leading to up to 3.7x performance improvements in I/O time for obscure bottlenecks. We will motivate the I/O issues, describe the DaYu methodology, and propose optimization guidelines. Bio: Dr. Xian-He Sun is a University Distinguished Professor, the Ron Hochsprung Endowed Chair of Computer Science, and the director of the Gnosis Research Center for accelerating data-driven discovery at the Illinois Institute of Technology (Illinois Tech). Before joining Illinois Tech, he worked at DoE Ames National Laboratory, at ICASE, NASA Langley Research Center, at Louisiana State University, Baton Rouge, and was an ASEE fellow at Navy Research Laboratories. Dr. Sun is an IEEE fellow and is known for his memory-bounded speedup model, also called Sun-Ni’s Law, for scalable computing. His research interests include high-performance data processing, memory and I/O systems, and performance evaluation and optimization. He has over 300 publications and 6 patents in these areas and is currently leading multiple large software development projects in HPC I/O systems. Dr. Sun is the Editor-in-Chief of the IEEE Transactions on Parallel and Distributed Systems, and a former department chair of the Computer Science Department at Illinois Tech. He received the Golden Core award from IEEE CS society in 2017, the ACM Karsten Schwan Best Paper Award from ACM HPDC in 2019, the Ron Hocksprung Endowed Chair from Illinois Tech in 2020, and the first prize best paper award from ACM/IEEE CCGrid in 2021. More information about Dr. Sun can be found at his web site www.cs.iit.edu/~sun/
15:10 Break
15:40 Session B
Keynote (45 minutes) - Predicting Heterogeneity and Serverless Principles of Converged HPC, AI, and Workflows Presenter: Dejan Milojicic (Hewlett Packard Labs), USA Abstract: Traditional HPC and modern AI computing are converging with workflows as a common paradigm. We predict nine principles of heterogeneity and serverless for this convergence, from high-level programming to low-level hardware. Workflows enable a higher level of abstraction that is easier to develop, (re)use, and operate. Both HPC and AI depend heavily on accelerators, and they both adopt serverless computing. Similarly to workflows, serverless also raises the level of abstraction and simplifies DevOps. The principles and approaches we describe strive towards enabling seamless scalability and fluidity for end users; increased productivity of developers; and improved performance efficiency of providers. Bio: Dejan Milojicic is an HPE Fellow and VP at Hewlett Packard Labs, Palo Alto, CA [1998-present]. Previously, he worked at the OSF Research Institute, Cambridge, MA [1994-1998] and Institute "Mihajlo Pupin", Belgrade, Serbia [1983-1991]. He received his Ph.D. from the University of Kaiserslautern, Germany (1993); and his MSc/BSc from Belgrade University, Serbia (1983/86). His research interests include systems software, distributed computing, systems management, and HPC. Dejan has over 240 papers, 2 books, and 86 granted patents. Dejan is an IEEE Fellow (2010), ACM Distinguished Engineer (2008), and HKN and USENIX member. Dejan was on 9 Ph.D. thesis committees, and he mentored over 90 interns. Dejan was president of the IEEE Computer Society (2014), an IEEE presidential candidate in 2019, editor-in-chief of IEEE Computing Now and Distributed Systems Online and he has served on many editorial boards and TPCs. Dejan led large industry-government-university collaborations, such as Open Cirrus (2007-2011) and New Operating System (2014-2017).
16:25 Talk 3 (25 minutes) - Workflow Smarter, Not Harder Presenter: Dewi Yokelson, Lawrence Livermore National Laboratory, USA Abstract: The rise of workflows as the new paradigm for HPC applications presents many opportunities when it comes to building tools to monitor their performance. With independent tasks being scheduled and completed at different times, traditional HPC application monitoring tools typically lack the ability to adapt to the proper scope and lifetime of the workflow. From the perspective of a performance monitoring tool developer, there are four significant challenges. First, workflows introduce new performance data domains and scopes. Second, another layer of compatibility must be taken into account when encouraging tool adoption. Third, online analysis and modeling must now be adjusted for systems consisting of heterogeneous tasks and architectures. Finally, determining how and when to make online changes to the workflow becomes a more complex problem. Recently, we embarked on a project for monitoring workflows that are managed by the RADICAL-Pilot system. The framework we used is called SOMA, which stands for Service-based Observability, Monitoring, and Analytics. Integrating SOMA and RADICAL-Pilot allowed us to collect and analyze data online from two different workflows. We discuss how the approach in this project addressed the first two challenges, and where opportunities still exist with the remaining two challenges. Capitalizing on these opportunities for effective workflow performance monitoring tools will enable smarter workflows. Bio: Dewi Yokelson is a Postdoctoral Researcher at Lawrence Livermore National Lab in Livermore, California. She received her PhD in Computer Science at the University of Oregon in Eugene, Oregon under Dr. Allen Malony. Her research interests include monitoring and analyzing the performance of HPC applications. She is also interested in post-mortem performance observation, analysis, and visualization, performance modeling, and benchmarking.
16:50 Talk 4 (25 minutes) - EPCC’s Use of Monitoring for Resource Usage Optimisation Presenter: Michele Weiland, EPCC, Scotland Abstract: EPCC is the UK’s largest supercomputing centre, hosting both national HPC infrastructure as well as smaller data analytics and AI/ML focussed systems in a large state-of-the-art datacentre. Operating all our systems as efficiently as possible is a primary goal, and close monitoring of the infrastructure (e.g. from system workload to power and cooling) is key in achieving this. This talk will give an overview of the types of monitoring we do at EPCC and how the monitoring data is used to optimise our day-to-day operation. I will also discuss the challenges around creating a comprehensive monitoring environment in a datacentre with a large number of diverse systems. Bio: Professor Michele Weiland is the Met Office Joint Chair at EPCC, the supercomputing at the University of Edinburgh. She leads EPCC’s technical work in the UKRI funded ASiMoV Strategic Prosperity Partnership with Rolls-Royce, which is developing high-fidelity multi-physics simulations of aircraft engines. She also leads a large collaborative effort with the Met Office on optimising their next generation weather modelling systems, and she is a PI on the AI for Net Zero project “Real-time Digital Optimisation and Decision Making for Energy and Transport Systems”. She also leads research efforts in an international collaboration “CONTINENTS” between EPCC, the National Centre for Atmospheric Sciences in the UK, and the NSF National Centre for Atmospheric Research in the US, which aims to transform the state-of-the-art in sustainability and power/energy efficiency of computational modelling and simulation in weather and climate.
17:15 Closing

Call For Papers

Call for Papers

WOWMON, The 1st Worshop on Workflow Monitoring, Observability, and in situ Analytics will be held during ICPP 2024, the International Conference on Parallel Processing https://icpp2024.org. ICPP is one of the oldest computer science conferences; ICPP 2024 is the 53rd edition of ICPP.

Workshop Theme

In recent years, there has been an emergence in high-performance workflow systems to address large-scale, complex scientific applications involving the interoperation of heterogeneous parallel tasks, massive data storage, and computational resources. From scientific workflows coupling HPC simulations to distributed data-driven analytics frameworks to many task ensemble applications, workflow execution environments place significant demands on the system platform for its scalable and efficient operation. In contrast to the performance optimization of traditional HPC applications that can rely on post-mortem performance analysis, dynamic performance monitoring and in situ analysis are crucial for evaluating and guiding resource management strategies for modern workflows. The workshop will focus on research topics related to deploying high-performance workflow systems and integrating state-of-the-art technologies in this field. Topics of interest include, but are not limited to, performance modeling, resource management, fault tolerance, programming models, observability, and other aspects of workflow systems coupling in-situ/near-real-time analytics with HPC simulations/experiments.

Workshop Topic Relevance and Goals

The workshop is relevant to multiple aspects of parallel processing (HPC, workflows, runtime systems, performance measurement and analysis, scalability, and scheduling) that come together in high-performance workflow-driven applications where dynamic behavior and performance variability present challenges to effective workflow execution, resource scheduling, and tuning. The target audience would encompass researchers and practitioners across the parallel processing spectrum, from HPC experts and workflow system developers to runtime system engineers, performance analysts, and scalability specialists.

The main goals of the workshop are to bring together researchers across these aspects to discuss challenges facing the development of observability, online monitoring, and in situ analytics capabilities for scalable workflow operation and optimization.

ICPP 2024 will be held in Gotland, Sweden, from August 12 - 15, 2024. Topics of interest for the WOWMON workshop include, but are not limited to:

Description of Target Audience

The landscape for large HPC workflows is changing rapidly, and traditional workflow models relying on post-mortem analysis after completion often offer limited guidance for optimization. In situ analytics techniques embedded within the workflow are gaining traction, enabling real-time performance and intermediate results analysis. This allows for dynamic adjustments to scheduling, resource allocation, and execution strategies based on the unfolding data. The concept for the workshop grew out of two Dagstuhl seminars in 2023:

in which the workshop organizers participated. From our experiences there, we believe that the workshop will be of interest to a cross-section of people attending those seminars, as well as those interested in these topics.

Submission

Important Dates

Abstract submission deadline: May 24, 2024 June 14, 2024 (AoE)

Full Paper submission deadline: May 31, 2024 June 14, 2024(AoE)

Author notification: June 21, 2024 July 3, 2024

Camera-ready final papers submission deadline: TBD (AoE)

Submissions

• Paper submissions should not exceed 10 pages (including references) and all submissions must be made electronically through the ICPP conference submission portal (https://ssl.linklings.net/conferences/icpp/) in PDF format printable on US letter size (8.5" x 11") paper. Please use the ACM format located at: https://www.acm.org/publications/proceedings-template. More specifically, we recommend using \documentclass[sigconf,review,anonymous]{acmart} configuration for submissions prepared in LaTex. Changes to the template (e.g., margin, font size) could lead to automatic rejection.

• Submissions should represent original research results and cannot already be under review or accepted for publication in another venue.

• Paper submission should be in single-blind ACM format.

• Submitted papers will be evaluated by at least 3 reviewers based upon technical merits. The accepted papers will be published with IEEE/ACM.

• All accepted papers that are presented at the conference will be published in the ACM Digital Library.

• Accepted papers will also need to follow the conference registration policy to be included in the conference proceedings.

• Rejected ICPP submissions are welcome to submit to the workshop if the authors choose to do so.

Committee Members

Organizing Committee

Allen Malony Allen D. Malony, University of Oregon

Shantenu Jah Shantenu Jah, Rutgers University / Brookhaven National Laboratory

Ana Gainaru Ana Gainaru, Oak Ridge National Laboratory

Kevin Huck Kevin Huck, University of Oregon

Silvina Caino-Lores Silvina Caino-Lores, Inria

Technical Program Committee

Michael Ott, Leibniz Supercomputing Centre

Florina Ciorba, University of Basel

Srinivasan Ramesh, NVIDIA

Anthony Kougkas, Illinois Institute of Technology

Luan Teylo, Inria

Sean Wilkinson, Oak Ridge National Lab

Cyrus Harrison, Lawrence Livermore National Lab

Iacopo Colonnelli, Università di Torino

Douglas Thain, University of Notre Dame

David Marchant, University of Copenhagen

Tapasya Patki, Lawrence Livermore National Lab

Ulf Leser, Humboldt University of Berlin

Ivan Rodero, University of Utah

Jakob Luettgau, Inria

Raul Sirvent, Barcelona Supercomputing Center