In my talk, I would like to share with you some experiences with NEC’s New Vector system named SX-Aurora TSUBASA. I will also present our on-going project, named Quantum Annealing-Assisted Next generation HPC infrastructure, with the extension of SX-Aurora TSUBASA for the future.
This talk will summarize the research and development activities of HLRS and will highlight future activities and directions
10:15 – 10:45
Break
10:45 – 11:15
Status of HPC in Siegen
Sabine Roller, Simulationstechnik & Wissenschaftliches Rechnen, Universität Siegen
11:15 – 11:45
HPC, HPDA, Machine Learning, Deep Learning….a glance of the evolution of traditional HPC Centers
Bastian Koller, HLRS
In this talk the current evolution of HPC as a single resource offering towards a set of solutions will be presented, analysed and some thoughts about the future use of such systems will be given
This talk will present the Singular Value Decomposition (SVD) algorithm and give the results for smaller data sets using SVD as a data reduction algorithm, which could perform feature extraction in “raw” data.
The main interest of the climate models running at DKRZ focuses on two major directions. On one hand, high resolution grids are being used in order to resolve small-scale physical processes. In this way, parametrisation and the inherent uncertainty can be avoided , thus improving significantly climate change projections. On the other hand addressing questions related to climate variability also involves simulations running over long time periods e.g. modeling of complete glacial cycle on grids with coarser resolution.
Such simulations are computationally very intensive and high sustained performance is vital in order to be able to conduct real world experiments. Both, single node performance as well as good scaling capabilities of the soft- and hardware are important.
The NEC Aurora TSUBASA System promises high sustained performance by combining high floating point performance of vector processors with extremely high memory bandwidth. The presentation will show first results from our tests with earth system models on the NEC Aurora system at DKRZ.
A new vector supercomputer, SX-Aurora TSUBASA, has been released. It has a newly developed Vector Engine(VE) processor to achieve a high sustained performance by powerful vector processing and a high memory bandwidth. This presentation examines the basic potential of SX-Aurora TSUBASA through the performance evaluations.
Direct numerical simulations (DNSs) of turbulence at high Reynolds number are very important to understand behavior of turbulent flow and to establish tubulent models. We have carried out large-scale DNSs for more than a decade on the Earth Simulator and the K compute and would like to execute DNSs with larger grid points beyond the present grid points on future supercomputers. In this talk, the first evaluation results of a DNS code performance on SX-Aurora TSUBASA will be presented. In paticular, off-loaded I/O performace for checkpoints of instantanious velocity fields from a vector engin (VE) to a vector host (VH) as well as computing performace on VEs will be included.
MPI parallel simulations provide some challenges when dealing with user interaction. We present:
* a method to obtain configuration settings from Lua scripts in a scalable way,
* a strategy to manage logging output during the parallel execution with configurable level of detail,
* a concept to deal with errors detected by the application at runtime.
These components are put together into a Fortran library to build a basic infrastructure for parallel simulation applications. It relies on Fypp as a pre-processing tool, which allows the usage Python to generate Fortran code.
With limited processor frequency any performance increase results from parallelism. But the startup to fill the operating units decreases the effective performance if the executed kernels are not large. We discuss this effect by measurements on Aurora. A remedy could be in implementing bricks of kernels in hardware.
In this talk, we present a novel data-based approach to turbulence modelling for Large Eddy Simulation by artificial neural networks. We define the exact closure terms including the discretization operators and generate training data from direct numerical simulations of decaying homogeneous isotropic turbulence. We design and train artificial neural networks based on local convolution filters to predict the underlying unknown non-linear mapping from the coarse grid quantities to the closure terms without a priori assumptions. All investigated networks are able to generalize from the data and learn approximations with a cross correlation of up to 47% and even 73% for the inner elements, leading to the conclusion that the current training success is data-bound. We further show that selecting both the coarse grid primitive variables as well as the coarse grid LES operator as input features significantly improves training results. Finally, we construct a stable and accurate LES model from the learned closure terms. Therefore, we translate the model predictions into a data-adaptive, pointwise eddy viscosity closure and show that the resulting LES scheme performs well compared to current state of the art approaches. This work represents the starting point for further research into data-driven, universal turbulence models.
A simulation framework for a generalized multiphysics simulation concept is introduced which is based on an implementation of various solvers formulated for Cartesian hierarchical meshes. The implementation features a generalized mesh object which communicates with various solvers based on finite-volume or discontinous Galerkin methods. SInce all solution methods share a common mesh, solution adaptive meshes with dynamic load balancing are straighforward to implement. Interleaved time stepping of the solvers for the different physics allow an efficient implemention on HPC systems. Examples are presented for the coupling of various solution methods for flow simulations, acoustic fields and level sets used for tracking moving surfaces.
10:00 – 10:30
Moving geometries in high-order discontinuous Galerkin discretization
Neda Ebrahimi Pour, Uni Siegen
Representing geometries in high-order schemes is a crucial task with special requirements. An attractive solution to this problem is the employment of penalizing terms to represent the geometry within elements. This approach also allows for a convenient movement of obstacles through flows for example, as it avoids the need for expensive remeshing and interpolations. We present this concept in our high-order discontinuous Galerkin solver Ateles and show first results for compressible flows.
A significant part of homogeneous supercomputer consists of an extensive number of general-purpose processors (CPU),which are connected with each other over a high-performance network. The degree of parallelism of these high-performance computing (HPC)platforms is generally limited to the number of processor cores. The scalability and performance enhancement almost exclusively stems fromthe growing number of CPU cores, although that number no longer meets the constantly expanding HPC requirements.The growing number of CPU cores is to be seen not only as a simple increasing number of cores but also as the additionaloverhead in a distribution of the rest of the hardware resoursers, such as the network, last level cache, memory channels, power budget, etc.,between the growing number of the cores and hence also between the parallel processes and threads. Particulary in combination with the capabilitiesof the processor to change the operating voltage and frequency, so-called “Dynamic Voltage and Frequency Scaling (DVFS)”,the analysis of the scalability and energy efficienty of a multi-core processor even on the basis of the existing models,such as “Roofline Performance Model” or “Execution-Cache-Memory Model” (ECM), is additionally complicated.The performance and power dissipation of CPU and DRAM are in complex interaction with the number of the active cores and the CPU frequency.This talk presents an extension of ECM model (hereinafter referred as DTM – “Data Transfer Model”), which describes the performance taking intoconsideration the various frequencies of the hardware components. The evaluation of the model using the “STREAM” kernels (with temporally memory access)is performed on different hardware architectures.
Numerical simulations of phase change processes require a precise reconstruction of the interface between two phases. Based on the Volume of Fluid (VoF) method for multiphase flows the height function technique is able to reconstruct the sharp interface accurately and enables simulations with complex interface deformations. But this calculations increase the computational load for cells containing the phase interface. An equidistant domain decomposition leads to an imbalanced workload distribution. In order to perform investigations with a high spatial and temporal resolution, it is necessary to use the available HPC resources efficiently. The challenge of parallelization is to distribute the workload homogeneously among the cores. For simulations of the solidification processes with the multiphase code Free Surface 3D (FS3D), a load balanced domain decomposition is presented. The first part is the decomposition of the structured computational domain by recursive bisection. The second part is the corresponding process communication, which enables a nearest neighbor communication through non-blocking MPI calls. The transport of the diagonal element is realized via a communication sequence and thus an exchange of small amounts of data is avoided. A measure for the load imbalance is presented based on test cases. Finally, advantages and limitations of load balancing are discussed based on the tracing of the calculations for one timestep.
Simulations allow to improve the development of new high performance materials with tailored microstructures and defined properties. The process of sintering is of high interest in order to produce defined ceramic materials as needed for a broad range of applications e.g. healthcare, electronics, automotive and aerospace. The phase-field method allows to efficiently investigate the microstructure evolution in large scale 3D domains during the sintering process. A phase-field model based on the grand potential approach is implemented in the massively parallel phase-field solver framework PACE3D. It is optimized on various levels starting from the model and parameters down to the hardware. The solver allows to resolve and calculate an arbitrary number of individual particles in the green body by using a local reduction technique and a material class based parametrization concept. The evolution equations for the phase-fields and the concentration are explicitly vectorized using vector intrinsics. Performance results on a single core, single node and with up to 96100 processes on the German supercomputers Hazel Hen, SuperMuc and ForHLR II are shown and discussed. Besides an optimized voxel based format to store the simulation data for checkpointing with MPI-IO, an efficient and reduced mesh based output is used.
This talk first brings the MPI shared-memory model forward and then illustrates the usages of this model via two use cases. One is the hybrid RMA. It is fully employed in the DART-MPI. The other is the hybrid collectives (to be specific, allgather operation). These two use cases highlight the performance benefit brought by the appliance of MPI shared-memory model. However, extra synchronization operations should be added to guarantee a deterministic behaviour.
The current usage of MPI communication operations leads to a global synchronization across many processes and compute nodes. The problem becomes more severe when combining MPI with a thread-parallel programming model such as OpenMP: synchronization latencies are paid manyfold by all threads within an MPI processes. We present ongoing work to address this problem by implementing a task-based programming model which allows to express dependencies across MPI processes. This kind of fine-grained synchronization can replace global MPI synchronization in many cases and thus result in substantially improved communication efficiency.
One of the most intensive I/O operations of a scientific simulation is so-called checkpointing, which is to save the state of a running simulation into a checkpoint file so that the simulation can be resumed from the file upon a system failure. Generally, it is difficult to increase the I/O performance of a system at the same pace as its computational performance. Moreover, it is necessary for a future system to perform checkpointing more frequently because the system will consist of more hardware components and hence the probability of encountering a system failure during the simulation will significantly increase. As a result, the overhead for checkpointing will be relatively growing, and could dominate the total simulation time. In this talk, therefore, I will discuss the possibility and potential benefit of employing automatic parameter tuning for reducing the checkpointing overheads.
The NEC SX-Aurora TSUBASA is a high-performance vector CPU for sustained simulation performance. The existing compiler toolchain for the SX-Aurora is comprehensive but also proprietary restricting its use in research and confining its development to internal teams at NEC. In recent years, the open source LLVM compiler infrastructure has seen significant support and contributions by major players such as NVIDIA, AMD, ARM, Intel, Apple and Google. These employ LLVM in their official toolchains, GPU driver stacks and mission-critical infrastructure. Likewise, many compiler research labs have adopted LLVM for its accessibility, robustness and permissive license. Recently, the LLVM community has been discussing an extension for scalable vector architectures (LLVM-SVE), which feature an active vector length just as the SX-Aurora does. In this talk, we will discuss the potential of LLVM for the NEC SX-Aurora. The Compiler Design Lab at Saarland University is working with NEC on an LLVM-SVE backend for the SX-Aurora.
The role of supercomputer storage on JAMSTEC becomes increasingly important not only for simulation but also for data driven science. The information such as the storage I/O performance and the file system characteristics may improve the user’s availability and make new scientific knowledge and innovation. In this talk, we introduce our approach to provide I/O information toward users.