I will be giving my talk about recent-achievements on our on-going project entitle “R&D of A Quantum-Annealing-Assisted Next Generation HPC Infrastructure and its Applications.” My talk will start with some performance evaluation results of SX-Aurora TSUBASA as a vector computing platform, and then go into its hybrid computing using VH-VE together. Finally, I will discuss two types of hybrid: computing mechanism and hardware platform levels, which are annealing on SX-Aurora TSUBASA itself and its on SX-Aurora TSUBASA and Quantum-Annealer.
The coming ten years of HPC will be dominated by three main developments. First, the end of Moore’s law is already now showing its impact. Second a new application − Artificial Intelligence − is forcing computing centers to adapt their strategies. Finally, Quantum Computing is knocking on the door of HPC without providing and hints about the direction in which QC is headed. HLRS has to adapt to these challenges and in the talk we will present challenges and opportunities as well as strategies to cope with both.
10:30 – 11:00
Software Methods for Product Virtualization Sabine Roller, DLR (Deutsches Zentrum für Luft- und Raumfahrt e.V.)
This talk will give an update on the EuroHPC Projects EuroCC and CASTIEL, which are implementing 33 National Competence Centres in Europe. Also, it will update on the FF4EuroHPC Projects with insights on the first closed open call for industrial experimentation.
In this contribution we look into the efficiency and scalability of our Lattice Boltzmann implementation Musubi when using OpenMP threads within an MPI parallel computation on Hawk. The Lattice Boltzmann methods enables explicit computation of incompressible flows and the mesh discretization can be automatically generated, even for complex geometries. The basic Lattice Boltzmann kernel is fairly simple and involves only few floating point operations for each lattice node. A simple loop over all lattice nodes each partition of the MPI parallel setup lends a straight forward loop parallelization with OpenMP. With increased core counts per compute node, the use of threads on the shared memory nodes is gaining importance, as it avoids overly small partitions with many outbound communications to neighboring partitions. We briefly discuss the hybrid parallelization of Musubi and investigate how the usage of OpenMP threads affects the performance when running simulations on the Hawk supercomputer at HLRS.
We have earthquakes frequently, and therefore buildings need to be earthquake resistant. A code for building responses in time for earthquakes was parallelized by using a hybrid execution model on a VH and VEs of the SX-Aurora TSUBASA. Computation performance will be presented.
09:35 – 10:05
Forecasting Intensive Care Unit Demand during the COVID-19 Pandemic: A Spatial Age-structured Microsimulation Model Ralf Schneider, HLRS, Sebastian Klüsener, Matthias Rosenbaum-Feldbrügge
Background: The COVID-19 pandemic poses the risk of overburdening health care systems, and in particular intensive care units (ICUs). Non-pharmaceutical interventions (NPIs), ranging from wearing masks to (partial) lockdowns have been implemented as mitigation measures around the globe. However, especially severe NPIs are used with great caution due to their negative effects on the economy, social life and mental well-being. Thus, understanding the impact of the pandemic on ICU demand under alternative scenarios reflecting different levels of NPIs is vital for political decision-making on NPIs. The aim is to support political decision-making by forecasting COVID-19-related ICU demand under alternative scenarios of COVID-19 progression reflecting different levels of NPIs.
Methods: In this talk we will present our implementation of a spatial age-structured microsimulation model of the COVID-19 pandemic by extending the Susceptible-Exposed-Infectious-Recovered (SEIR) framework. The model accounts for regional variation in population age structure and in spatial diffusion pathways. In a first step, we calibrate the model by applying a genetic optimization algorithm against hospital data on ICU patients with COVID-19. In a second step, we forecast COVID-19-related ICU demand under alternative scenarios of COVID 19 progression reflecting different levels of NPIs. The third step is the automation of the procedure for the provision of weekly forecasts. The automated estimation of the model’s parameters is done by means of Random-Forest regression.
Results: In the results section we will show the application of the model to Germany and demonstrate state-level forecasts over a 2-month period, which can be updated daily based on latest data on the progression of the pandemic.To illustrate the merits of our model, we present here “forecasts” of ICU demand for different stages of the pandemic during 2020 and 2021. Our forecasts for a quiet summer phase with low infection rates identified quite some variation in potential for relaxing NPIs across the federal states. By contrast, our forecasts during a phase of quickly rising infection numbers in autumn (second wave) suggested that all federal states should implement additional NPIs. However, the identified needs for additional NPIs varied again across federal states. In addition, our model suggests that during large infection waves ICU demand would quickly exceed supply, if there were no NPIs in place to contain the virus.
Some recent engineering applications will be presented, in which the turbulent flow in technical devices with embedded droplets and particles is simulated with the HPC platform HAWK installed at HLRS. The investigation of the flow field in the internal combustion engine is performed, to analyze the mixing of air with various injected biofuels. The distribution of the evaporated fuel in the internal combustion engine has a large influence on the emissions of pollutants and the engine efficiency. Since biofuels possess quite different fluid properties, it is important to accurately predict the concentration of the evaporated fuels for an optimization of the engine performance. The second application is connected to electrical discharge and electro-chemical machining processes, which are used to manufacture work pieces such as turbine blades of high strength material. In these processes, fluid flow plays an important role for the transport of removed material and thus on the quality of the final product. The simulations conducted are based on a solver formulated for hierarchical Cartesian meshes, in which a Lagrangian particle solver is used to track the motion of droplets of the fuel spray or the transport of removed material. Several aspects of the numerical methods, its parallelization, dynamic load balancing and the implementation on high performance computing platforms will be presented in this contribution.
We have performed three dimensional particle-in-cell (PIC) plasma simulation to investigate the filamentary coherent structure so called blob/hole.
Impurity ion transport by this structure is revealed. The impurity ion profile in the blob/hole structure becomes a dipole structure and this propagates with the blob/hole.
The performance of three dimensional PIC code on the new super computer build by NEC SX-Aurora TSUBASA at National Institute for Fusion Science will also be presented.
Assuming the non-unitary chiral superconductivity as a bulk state of Sr₂RuO₄, we show the field-induced chiral stability generating the paramagnetic current in the eutectic Sr₂RuO₄-Ru by computational simulation of the Ginzburg-Landau equation. The paramagnetic coupling with the chiral magnetization causes the field-induced chiral transition and the paramagnetic current. The field-induced chiral stability consists of the field-dependence of zero-bias anomaly in the tunneling spectroscopy. This good agreement with the experimental result indicates that the non-unitary chiral spin-triplet state is one of candidates for the superconducting state of Sr₂RuO₄, in addition to the chiral spin-single state as other candidates. The high-performance computing by code optimized for SX-Aurora makes it possible to analyze field dependence and spatial variation of chiral state and supercurrent in more detail.
Direct numerical simulations of turbulent flows require high computational power, only available on supercomputers such as those provided at the HLRS. At the Institute of Aerodynamics and Gas Dynamics an adapted in-house finite-difference code of high accuracy order is used for the analysis of canonical transitional and turbulent wall-bounded flows and (in-)stability investigations. The talk will present some recent results regarding supersonic mixing of two gases and show performance data for the massively-parallel ‘Hawk’ system and the ‘SX-Aurora’ vector system.
With the development of AI/BDA, the number of users of HPC systems is increasing. While the computational power required by users is also increasing On-premise HPC systems are difficult to expand due to limitations in power, space, and budget. In this session, we will introduce cloud bursting, which is a Job Scheduler function to temporarily expand computational power by using cloud computing resources. In addition, we will introduce the function that enables the HPC system to be used transparently under Job Scheduler.
Molecular docking simulations are widely used in computational drug discovery to predict molecular interactions at close distances. Specifically, these simulations aim to predict the binding poses between a small molecule and a macromolecular target, both referred to as ligand and receptor, respectively. The purpose of drug discovery is to identify ligands that effectively inhibit the harmful function of a certain receptor. In that context, molecular docking simulations are critical, by using them, the time-consuming preliminary tasks consisting of identifying potential drug candidates can be significantly shortened. Subsequent wet lab experiments can be carried out using only a narrowed list of promising ligands, hence reducing the overall cost of experiments.
AutoDock is one of the most widely used software applications for molecular docking simulations. Its main engine is a Lamarckian Genetic Algorithm (LGA), which combines a genetic algorithm and a local-search method to explore several molecular poses. The prediction of the best pose is based on the score, which is a function that evaluates the free energy (kcal/mol) of a ligand-receptor system. AutoDock is characterized by nested loops with variable upper bounds and divergent control structures. Moreover, the time-intensive score evaluations are typically invoked a couple of million of times within each LGA run. Based on its computation intensity, AutoDock suffers from long execution runtimes, which are mainly attributed to its inability to leverage its embarrassing parallelism. In recent years, an OpenCL-based implementation of AutoDock has been developed to accelerate its executions on a variety of devices including multi-core CPUs, GPUs, and even FPGAs.
In this work, we present our experiences porting and optimizing the OpenCL-based AutoDock onto the SX-Aurora Vector Engine. The OpenCL code is composed of a host and device parts that are maintained in the NEC VEO version. As the API functions of OpenCL and VEOffload resemble each other, porting the host code was very smooth. While the device part was easily ported too, an extra effort was required to increase the performance on SX-Aurora. For this, we used hardware-specific techniques that involve: appropriate data types for wider vectors, leveraging the multiple cores on the SX-Aurora, pushing outer into inner loops in score calculations and local search and using multi-process VEO to overcome OpenMP limitations in NUMA mode. Our evaluations were done on VE10B and VE20B models and compared to modern multicore CPUs and GPUs.
In October 2020, we have started the operation of Supercomputer AOBA, which employs the second-generation SX-Aurora TSUBASA as the main computing resource. In this talk, I would like to share the performance evaluation results to discuss the potential of the second-generation SX-Aurora TSUBASA through comparison with the previous generations. I will also introduce our recent research activities for making a good use of its performance while keeping the code portable.
Structural analysis using the finite element method (FEM) is widely used in the field of engineering.
Recently, NEC has introduced SX-Aurora TSUBASA, which has vector accelerator boards (Vector Engine, VE).
One VE has a high-speed memory with a bandwidth of about 1.2 TB/s and eight high-performance vector cores.
Each core has three Fused Multiply-Add (FMA) arithmetic units, each of which can perform 32 double-precision floating point element executions simultaneously.
The host CPU is called VH.
FrontISTR is one of the highly parallelized open-source FEM software programs for nonlinear structural analysis.
This software firstly generates a stiffness matrix using the FEM and then solves linear equations for the sparse matrix generated by FEM.
The stiffness matrix generation is not suitable for VE because it cannot process data continuously, and it involves many integer operations.
There is an API for transferring data between VH and VE called Another/Alternative/Awesome VE Offloading (AVEO), which can be used to execute compute-intensive portions of an entire program on VE.
We accelerate execution speed of the overall structural analysis program by running the linear equation solvers on VE using AVEO.
We chose the JAD format as the sparse matrix storage format and the conjugate gradient (CG) method as the linear solver.
In this study, we evaluate accelerated FrontISTR in terms of the following three parts.
(1) the generation time of stiffness matrices on VH, (2) the transfer time of sparse matrices and vectors from VH to VE using AVEO, and (3) the calculation time of linear equations by the CG method on VE.
We describe the effectiveness of accelerated structural analysis execution on NEC SX-Aurora TSUBASA.
Developing efficient graph algorithms implementations is an extremely important problem of modern computer science, since graphs are frequently used in various real-world applications. Graph algorithms typically belong to the data-intensive class, and thus using architectures with high-bandwidth memory potentially allows to solve many graph problems significantly faster compared to modern multicore CPUs. Among other supercomputer architectures, vector systems, such as the SX family of NEC vector supercomputers, are equipped with high-bandwidth memory. However, the highly irregular structure of many real-world graphs makes it extremely challenging to implement graph algorithms on vector systems, since these implementations are usually bulky and complicated, and a deep understanding of vector architectures hardware features is required. We present the world first attempt to develop an efficient and simultaneously simple graph processing framework for modern vector systems. Our vector graph library (VGL) framework targets NEC SX-Aurora TSUBASA as a primary vector architecture and provides relatively simple computational and data abstractions. These abstractions incorporate many vector-oriented optimization strategies into a high-level programming model, allowing quick implementation of new graph algorithms with a small amount of code and minimal knowledge about features of vector systems. The provided comparative performance analysis demonstrates that VGL-based implementations achieve significant acceleration over the existing high-performance frameworks and libraries: up to 14 times speedup over multicore CPUs (Ligra, Galois, GAPBS) and up to 3 times speedup compared to NVIDIA GPU (Gunrock, NVGRAPH) implementations.
This presentation introduces optimizations of the stencil computation on SX-Aurora TSUBASA, which focuses on the differences of bandwidth characteristics of SX-Aurora TSUBASA.
This presentation will share our efforts and experiences in evaluating performance and scalability of CODA on current HPC architectures. CODA is a CFD solver for external aircraft aerodynamics developed by DLR, ONERA, and Airbus, and one of the key next-generation engineering applications represented in the European Centre of Excellence for Engineering Applications (EXCELLERAT).
An Energy-aware Cache Control Mechanism for Deep Cache Hierarchy Ryusuke Egawa(Tokyo Denki University), Liu Jiaheng(Tohoku University)
To overcome the memory wall problem, cache hierarchies of modern microprocessors have become deeper and larger as the number of cores increases. Besides, the power and energy consumption of the deep cache hierarchy become non-negligible. In this talk, we present a mechanism to improve cache energy efficiency by adapting a cache hierarchy to individual applications and its evaluation results.
Using MPI in combination with asynchronous task-based programming models can be a daunting task. Applications typically have to manage a dynamic set of active operations, fall-back to a fork-join model, or rely on some middleware to coordinate the interaction between MPI and the task scheduler. In this talk, I will propose an extension to MPI, called MPI Continuations, that provides a callback-based notification mechanism to simplify the usage of MPI inside asynchronous tasks.
Basics on Quantum Computation Thomas Kloss, University Grenoble
With the announcement of quantum supremacy in 2019, Google claimed to have solved the first real-world problem out of reach for classical computers. Since then at the latest, quantum computing has moved into the political and economic spotlight. In this talk I will present some very basic slides on quantum computers and what makes them different of classical computers. I will also show new work which puts Googles supremacy claim into question.
The growth of artificial intelligence (AI) is accelerating. AI has left research and innovation labs, and nowadays plays a significant role in everyday lives. The impact on society is graspable: autonomous driving cars produced by Tesla, voice assistants such as Siri, and AI systems that beat renowned champions in board games like Go. All these advancements are facilitated by powerful computing infrastructures based on HPC and advanced AI-specific hardware, as well as highly-optimized AI codes. Since several years, HLRS is engaged in big data and AI-specific activities around HPC. In this talk, I will give a brief overview about our AI-focused research project CATALYST to engage with researchers and industry, present selected case studies, and outline our journey over the last years with respect to the convergence of AI and HPC from both a software and hardware point of view.
Many codes that were developed during the vector supercomputing era from the 1970’s to 1990’s are still in use with vector friendly constructs in their codebase. The recently released NEC Vector Engine provides an opportunity to exploit this vector heritage. The NEC Vector engine can potentially provide state of the art performance without a complete rewrite of the codebase. Given the time and cost required to port or rewrite codes, this is potentially an attractive solution. This presentation will assess how the NEC Vector engine performance compares with existing architectures using traditional benchmarks, a legacy CFD program FDL3DI and the effort required to take full advantage of the architecture.