Scientific / Educational projects that are using the infrastructure provided by the Intel / Unesp Modern Code Program
The contribution of meta-heuristics – in particular evolutive algorithms – in the area of optimization are extremely important, as they help finding optimized solutions for complex real-life problems, offering an important flexibility in the modelling of problems. This work proposes to present a model to be used in the optimization of the job shop schedule and allows to search both the best sequence of operations and the lots and sizes that each operation can be independently subdivided within the same order. The possibility of using alternative resources, operations with two or more resources and unavailability intervals are features presented in the model, which lends it him great robustness and applicability. Also, the execution with parallel tasks can provide better performance in the search for solutions.
Leandro Mengue (master Student)
Arthur Tórgo Gómez (Supervisor)
Optimization of complex numerical modelling applications
This project focuses on the performance analysis of a complex numerical modelling application that evaluates the coupling between geomechanics and multiphase flows. The idea is to evaluate two-phase immiscible flow in a strongly heterogeneous deformable carbonate underneath a rock salt composed by halite and anhydrite displaying creep behaviour with the viscous strain ruled by a nonlinear constitutive law of power-law type. This application is fundamental in reservoir engineering to detect and explore deeper formations. In this project, we propose a detailed performance analysis of the application using the VTune and Advisor tools from Intel and a further parallelization of the code. Our goal is to provide a more efficient code that runs on multicore processors accelerated by a Intel Xeon Phi processor. The main idea is to detect the hot spots in the code and propose parallel solutions that include the use of OpenMP and vectorization.
Leandro Pereira (Undergraduate Student)
Cristiana Bentes (Supervisor)
Protein Structure Prediction (PSP) is one of the most important topics in the field of bioinformatics, and several important applications in medicine (such as drug design) and biotechnology (such as the design of novel enzymes) are based on PSP methods. Profrager is a fragment library generation tool developed at the Brazilian National Laboratory for Scientific Computing (LNCC) that aims to improve the performance of PSP by generating fragment libraries in order to minimize the PSP search space. Profrager experiments can be computationally intensive and a possible approach is to rely on parallel architectures to improve scalability. Current trends on the design of parallel computing architectures are towards increasing the computational power of multi-core processor servers by aggregating many-core coprocessors or accelerators. Such hybrid architectures have the potential to speed up and improve the throughput of applications, but it is challenging to efficiently use all the processing power offered by the heterogeneous resources. The objective of this project if evaluate how to achieve high level of performance using Intel Multi Core and Intel Many Core architectures.
Silvio Luiz Stanzani (Research Associate)
Raphael Mendes de Oliveira Cóbe (Research Associate)
Rogério Luiz Iope (Research Associate)
Institution: Unesp / Núcleo de Computação Científica
Parallel Programming Marathon
This project was part of the II Regional School of High Performance Computing of Rio de Janeiro (ERAD-RJ 2016) that aimed at putting students in touch with the area and increasing their interest in HPC. The marathon gives students the opportunity to test their parallel programming skills using Xeon and Xeon Phi processors. Students are group up in teams having up to 3 people. The competition has 2 stages: a warmup, where they get familiar with the computational environment and the contest itself, where they have 3 hours to parallelize a set of applications.
Judgment is strict. In the beginning of the contest, teams receive problem descriptions and sequential (serial) solutions. Resolution involves not only the correct problem solution but also performance speedup for parallel (or distributed) version, measured according to criteria defined by committee for current contest. The winning team is the one with the greatest accumulated speedup, considering all applications.
Optimization of Geant
UNESP Intel Parallel Computing Center is mainly involved in R&D efforts to transform high-energy physics (HEP) software tools, in particular the simulation framework known as Geant, towards using them with modern computing architectures that support multi-threading and other parallel processing techniques to make data processing more cost effective. Geant is a toolkit for the simulation of the passage of particles through matter using Monte Carlo methods. It is one of the most important software tools for the HEP community, incorporating physics knowledge from modern particle and nuclear physics experiments and theory, and it has been designed to model all the elements associated with detector simulation: the geometry of the system, the electromagnetic fields inside the materials, the physics processes governing particle interactions, the response of sensitive detector components, the storage of events and tracks, the visualization of the detector and the particle trajectories, and the capture and analysis of simulation data at different levels of details and refinements. It is an open source project, founded in 1994, and developed and maintained by an international collaboration of around one hundred physicists and computer scientists. Fully coded in C++, it is considered both a toolkit and a framework: users can choose any of its software libraries to use within an specific application, and its functionality can be expanded through its many interface points. Geant4, the current version, is the re-engineered, object-oriented successor of Geant3, written in Fortran. Geant is in general associated with long calculation times and it is ideal for compute-bound workloads that may be well suited for execution on Intel Xeon Phi coprocessors. The plan for code performance improvements at the UNESP Parallel Computing Center includes the development of the necessary tools and metrics to evaluate the performance of multi-threaded HEP applications running on Intel Xeon Phi coprocessors. The researchers intend to test vector-coprocessor prototypes in a hybrid computing system such as the one mae available by the Intel/Unesp Modern Code project and analyze the performance of the next generation of Intel MIC architecture (Knights Landing), evaluating the redesign efforts eventually necessary for adopting the new technology. These activities are closely related to the development of Geant-V, a new generation of the Geant simulation engine, which will include massive parallelism natively.
Institution: Unesp / Núcleo de Computação Científica
Optimization/Modernization for PDE solvers applied to flow dinamycs (Oil & Gas)
In this project we have the objective of performing analysis for potential optimization of a Godunov-type semi discrete central scheme, for a particular hyperbolic problem implicated in porous media flow for the Intel XEON architecture, in particular, the Haswell processor family which brings a new and mode advanced instruction set to support vectorization.
Institution: Laboratório Nacional de Computação Científica (LNCC)
Accelerating Weather Forecast microphysics using Heterogeneous Parallel Computing
The objectives of this project are: to understand the complexity of Weather Forecast using traditional CPU solutions; to increase the resolutions of Weather Forecast running for Colombia climate conditions; to determine the real effort necessary to rewrite, migrate or adapt a legacy solutions written mainly on Fortran to new Massive Parallel processor architectures.
The present research tries to use heterogeneous accelerators (Gpus and Vector Processor) for doing the most higher computation process as MP (on cloud and precipitaton) on a WRF model for using Higher resolutions and thus increase the accuracy of weather forecast for Colombian conditions.
The use of accelerators on the Weather Forecast incress the computations capacity with low cost for the Colombian Metheorological agencies as IDEAM (National Agency of Weather Forecast and Hidrology Studies).
Esteban Hernandez (PhD Student)
Carlos Montenegro Marin (Supervisor)
Institution: District University of Bogotá
Optimizing BRAMS for Multicore Processors
Over the last two years Jairo Panetta and Simone Shizue Tomita Lima isolated the dynamics from the remaining BRAMS code. This stand-alone code is the basis for future versions of BRAMS. A long process eliminated coding practices that may conduct to race conditions. Global scratch areas were eliminated. All procedures have explicit interface at the points of call. The intent of each procedure argument is declared. Use association is restricted to procedure interfaces. This BRAMS subset code is named isolated dynamics. Besides dynamics, it contains input, output and initialization.
The isolated dynamics is the base for OpenMP parallelism and memory hierarchy experimentation, as described on the next paragraph. But it also serves as the base for experimenting with higher order approximation of dynamics processes. The current mathematical formulation of dynamic processes uses a first order approximation of derivatives that requires a small integration time-step. The original BRAMS code hardwire ghost zone length to one, preventing exploitation of higher order approximations. The isolated dynamics was recoded to allow a user-defined ghost zone length. Coding and coupling of higher order approximations are currently under way on this version of the isolated dynamics, conducted by Saulo R. Freitas and researchers from Germany. In this work, 5th order transport schemes allied with 3rd order time integration methods will greatly enhance model accuracy, which is essential to improve the forecast of rainfall for the next generation of CPTEC/INPE operational products at cloud scales (~ km) that will run on future supercomputer systems.
Recently an MsC thesis at INPE compared two forms of exploring OpenMP parallelism on scalar advection, which is a small part of the isolated dynamics. First form was the classical parallelization of each code loop nest. Nests run through the entire MPI subdomain. Second form was tiling the horizontal MPI subdomain. Tiling changed the original 3D array of atmospheric fields to a 2D array of pointers to 3D tiles. These tiles occupy consecutive memory positions to avoid cache conflicts. Since advection on distinct tiles are mutually independent, OpenMP parallelism is trivially implemented by parallelizing the outermost loop that run through tiles, dispatching advection at each tile. While the second form of parallelism is potentially more efficient, it requires work replication since advection requires computing fluxes at cells boundaries. Fluxes at cells on tile boundaries are computed twice, once for each tile.
As expected, the speed-up of the second form was close to perfection and higher than the speed-up of the first form. But execution times of the second form were higher than the execution times of the first form, mainly on small core counts due to work replication. As core count increases, the execution times of the second form were lower than the execution times of the first form.
It is not clear at all if these results propagate to the entire dynamics. Work replication conducts to communication-avoiding algorithms and higher speed-ups, but not necessarily to lower execution times. Storing each tile at consecutive memory positions effectively uses the cache hierarchy and NUMA architecture, but optimal tile size varies with the dynamic process, due to the distinct number of atmospheric fields used by each process.
Experimentation is clearly required.
We propose two tracks of activities. First track is dedicated to the isolated dynamics. Second track is dedicated to the physics. This structure accommodates the differences of the current stage of both code packages. It allows experimentation on data structures, vectorization and parallelism on the section of the code that is ready for OpenMP parallelism (dynamics), while the elimination of potential race conditions is performed on the part of the code that is not ready for OpenMP parallelism (physics). Both tracks synchronize when physics is ready for OpenMP parallelism.
There are four activities on the isolated dynamics:
- Initial performance evaluation of the dynamics (M1-M3)
Performance evaluation measures how efficiently the base code uses memory hierarchy and measures vectorization ratio. The first performance evaluation will contemplate a single core. A second performance evaluation measures memory hierarchy and parallelism interference among cores using multiple MPI processes. Results of these evaluations guide the optimization effort of the next activities. Performance analysis will be based on Intel Parallel Studio tools, mostly VTune and Advisor.
- Experimenting coding strategies for the dynamics (M4-M6)
This activity experiments with data structures layout strategies to enhance memory hierarchy usage and experiments with forms of parallelism exploitation through OpenMP coding. The activity essentially replicates the experiment on advection to selected parts of the remaining dynamic processes. Intel Parallel Studio tools will be used to measure memory hierarchy usage, mostly VTune.
- Implementing selected coding strategy on the full dynamics (M7-M12)
This step applies the selected data structure and the form of OpenMP parallelism to the entire dynamics. The result of this activity is an OpenMP coded dynamics with potential improved vectorization ratio and potential enhanced memory hierarchy usage. Execution time as a function of OpenMP threads are the central measure of optimization. Intel VTune may be used to explain performance details.
- Final dynamics performance evaluation (M9-M12)
The performance of the final dynamics code and OpenMP scalability on a full node are measured and compared with corresponding base code performance. Execution time as a function of OpenMP threads are the final performance measure.
There are also four activities on physics modules:
- Select which physics modules to include (M1)
The current version of BRAMS has too many physics modules, some of them outdated and not used. This activity selects with physics modules should be included on the desired code.
- Insert thread safe and enhance vectorization on selected modules (M2-M12)
Visit each selected physics module and eliminate coding practices that prevent the introduction of OpenMP parallelism. We intent to use Intel Inspector in this activity.
Concerning vectorization, some physics modules receive a single atmospheric column at each invocation while other physics modules receive a set of independent atmospheric columns at each invocation. On the former case, vectorization is limited by dependencies on computations on consecutive atmospheric levels within a single column. On the latter case, the set of independent columns is the desired vectorization direction. Intel Advisor should help to find out vectorization ratio and potential candidates for improvement.
- Insert OpenMP on physics modules and couple with dynamics (M7-M12)
Build a driver for each physics module, introduce OpenMP on the driver and couple the driver with dynamics. Driver’s structure varies with the number of atmospheric columns that each module operates simultaneously. In any case, the driver loops over atmospheric columns, invoking the physics module at each loop iteration. If the module accepts a set of atmospheric columns at a time, the driver’s loop partitions the domain of atmospheric columns into sets. If the module accepts a single column at a time, the driver’s loop over all columns of its domain. Introducing OpenMP is trivial in both cases.
- Performance evaluation of the coupled model (M9-M12)
Measure performance of the coupled model on a single core. Measure OpenMP scalability on a full node. Execution time as a function of OpenMP threads is the final measure of performance. Intel VTune should help to explain details
Simone Shizue Tomita Lima
Daniel Massaru Katsurayama
Analysis of Energy Consumption and Performance Efficiency on the Intel MIC architecture
Heterogeneous architectures composed by CPU and accelerators (GPU or Xeon Phi) are present in most supercomputers listed in the Top500 ranking (which lists the 500 most computing power machines) and, consequently, it is a trend in the future of high performance computing. The Intel Xeon Phi coprocessor is a manycore architecture based on cores with efficient energy consumption. These architectures provide local memory and connection via bus at high speed, allowing native application execution. In this new scenario, it is important to consider programming techniques and models that enable the efficient use of computing resources to improve performance and power consumption of parallel applications.
This work aims to analyze the relationship between performance and power consumption in manycore architectures considering specifically Intel MIC (Many Integrated Core) and Xeon Phi coprocessors platforms. The main approach applied is the definition of a set of performance and energy consumption counters that will be collected using micsmc when running benchmarks.
There are several benchmarks and algorithms for high-performance homogeneous architectures. For heterogeneous or hybrid computing, benchmarks are more restricted. In this work we will consider initially: 1) HPLinpack (High-Performance Linpack) to solve systems of linear equations with double precision (64-bit); 2) SGEMM and DGEMM; 3) NAS-OpenMP Parallel Benchmark, a specific version of the NAS to manycore architectures focused on architecture MIC and Xeon Phi.
The tests and assessments carried out in this work should consider not only performance but mainly the energy eficiency. To achieve this, it is important to identify and understand which factors may influence the energy consumption and their impact on performance. The results will be extracted across micsmc counter’s during benchmarks execution in different scenarios. For example, for each benchmark we will consider: CPU and One Xeon Phi; CPU and Two Xeon Phi, CPU only and Xeon Phi only.
Robson Gonçalves (Master Student)
Márcia Cera (Supervisor)
Development a parallel cellular automaton using OpenCL
Maelso Bruno Pacheco Nunes Pereira (Master Student)
Alisson Brito (Supervisor)
Dataflow Resiliency and Scalability
The Trebuchet Dataflow Runtime enables programmers to write parallel code for multi and manycore architectures by describing their algorithms in terms of dataflow graphs. In this project we aim at providing experimental proof of the performance and resiliency achieved with Trebuchet. In the performance-oriented experiments, we show that applications parallelized with Trebuchet achieve excellent performance (equal or better than traditional approaches such as OpenMP) and scale in accordance with our theoretical model for scalability in dataflow graphs. The resiliency portion of the project shows that Trebuchet is also able to recover from transient faults using the Dataflow Error Recovery (DFER) model.
Institution: UERJ and UFRJ