No.134 Advances in Heterogeneous Computing from Hardware to Software

Icon

NII Shonan Meeting Seminar 134

Abstracts

Keynote: A Domain-Specific Architecture for Autonomous Driving Technology

Dr. Shinpei Kato

Autonomous driving is becoming a key technology for the automotive industry. From highways to city roads as well as geofenced areas, highly-convenient cost-effective mobility as a service will be provided by the autonomous driving technology. Computing challenges to this end are associated with its requirement of high performance and low power.

Autonomous driving modules, such as object detection, localization, mapping, planning, and prediction, often require high-performance computing capabilities, whereas the computing platform employed in autonomous vehicles need to be low-power.

In this talk, a domain-specific architecture for the autonomous driving technology is introduced, which addresses a trade-off problem of performance and power by a heterogeneous architecture approach.

This architecture implements specific autonomous driving functions by hardware logics, while main threads are still executed on traditional multi- and many-core CPUs. A prototype of this architecture is being developed with the open-source “Autoware” project.


Keynote: From TSUBAME3 to Post-K: Massive Scaling not only in HPC but also in Big Data and AI/ML

Dr. Satoshi Matsuoka

With the rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. The TSUBAME3 supercomputer at the Tokyo Institute of Technology which came online in Aug. 2017, embodies various BYTES-oriented features to allow for such convergence to happen at scale, including significant scalable horizontal bandwidth as well as support for deep memory hierarchy and capacity, along with high flops in low precision arithmetic for deep learning. TSUBAM3’s technologies have been commoditized to construct one of the world’s largest BD/AI focused open and public computing infrastructure called ABCI (AI-Based Bridging Infrastructure), hosted by AIST-AIRC (AI Research Center), the largest public funded AI research center in Japan. Although not a supercomputer for HPC, its Linpack ranking is No.1 in Japan and No.5 in the world, as well as embodying 550 AI-Petaflops for AI, as well as being extremely energy efficient with novel warm water cooling pod design. Finally, Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resources in one exascale machine, with well more than 100,000 nodes and number of sever-class Arm CPU cores approaching 10 million. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 10,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node.


Toward Near Data Processing Service Computing

Dr. Marco Aldinucci

In the realm of HPC, message passing has remained the programming paradigm of choice for over twenty years, and by extension to fairly new area of high-performance data processing. In message passing, each communication is orchestrated by the developer-based on precise knowledge of code, overhead, and data partitions. PGAS (Partitioned Global Address Space) programming aims at tackling with this complexity by (at least) abstracting data decomposition and the mapping of processes onto the hardware and promote distributed In-Memory processing for large data sets. There exist a variety of choices for PGAS languages and implementations, ranging from brand new languages to extension existing approaches such as MPI and OpenMP. None of them is (yet) in the mainstream of parallel programming or able to support interoperability with legacy code. In this talk we review these approaches and we discuss the possibility to generalise the PGAS into the mainstream C++ STL and then to service computing by way of a Near Data Processing (NDP) approach.


Navigating the Real-time 3D Scene Understanding Landscape

Dr. Bruno Bodin

The visual understanding of 3D environments in real-time and at low power is a huge computational challenge. This is central to applications such as industrial robotics and autonomous vehicles. In this presentation we will discuss results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM (Simultaneous Localisation and Mapping), by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware to meet their performance, accuracy, and energy consumption goals.


e2eML: High Performance, Power Efficient Application of End to End DNN for Heterogeneous Architectures

Dr. Mauricio Breternitz

Machine Learning(ML), specifically Deep Neural Networks (DNN), enjoy successful and widespread application to a growing number of relevant problems. Algorithmic developments in training of Deep Neural Networks(DNN) and computing (GPUs, FPGAs, data centers) plus availability of large data sets enables application of Machine Learning solutions to key societal problems. This includes speech recognition, anomaly detection in online transactions, security, image classification and recognition, to name a few.

End-to-end application of DNN is an approach where the deep network handles most of the processing steps, avoiding ancillary pre-processing and preparation steps, has demonstrated promise for efficient deployment. There is a growing body of evidence that suggests that end-to-end approaches to machine learning applications result in high efficiency implementations, due to cross-boundary partitioning of the multiple processing steps. However, more research is needed to overcome challenges related to adapting to data characteristics, taming a large design space.

Finally, understanding the resulting network’s structure and operation is a key challenge to deploy optimized instantiations of a trained network on the most desirable computational resources. Recent developments in introspection and understanding of deep neural networks enable identification of key processing sub-components. One key goal of this research is to identify such components for instantiation in the most cost-effective, efficient and resilient implementation. Cost effectiveness by choosing the most appropriate computation environment, ranging from distributed, cloud implementations, including hardware accelerators such as GPUs as well as programmable hardware (FPGAs) and/or ASICs. Resiliency and reliability are key dimensions that are considered. Resilience to ensure the resulting network’s immunity to spurious data artifacts. Reliability to ensure trustworthy operation in the presence of hardware failures.

We explore a full-stack end-to-end approach whereby key aspects of the whole-application processing are compiled to and assigned to the most efficient software/hardware realization.

Slides


ePython: An Implementation of Python for Novel Heterogeneous Architectures

Dr. Nick Brown

Whether it be FPGAs or micro-cores such as the Epiphany, programming novel heterogeneous architectures is difficult, requiring a considerable investment in time and skills. Whilst many of these architectures exhibit significant advantages, such as energy efficiency and performance, the high barrier to entry driven by poor programmability can severely limit their update.

We believe that working towards making these technologies trivial to use is worthwhile and as-such some of the programming hurdles associated with current generation heterogenous architectures are worthy of challenge. To this end we developed ePython, a tiny (24Kb) implementation of Python, initially for the Epiphany architecture but now also ported to the Microblaze. Pre-installed with every Epiphany shipped, ePython makes it trivial to quickly write parallel codes, often for prototyping and experimentation with the hardware. In addition to running codes directly we have also developed support for decorating kernels in existing Python codes and for these to be seamlessly offloaded, via ePython.

In addition to describing ePython I will also discuss a simple machine learning code we developed for detecting lung cancer in 3D CT scans, where our decorators are used to offload the neural network and accelerate this, via ePython, on the Epiphany. We are currently working towards implementing ePython in VHDL as an IP block that can be dropped into other FPGA designs, and we believe this ability to easily and quickly interact with other aspects of an FPGA design without requiring the regeneration of the bitstream is useful.


NUMA Optimizations for Algorithmic Skeletons

Dr. Christian Fensch

To address NUMA performance anomalies, programmers often resort to application specific optimizations that are not transferable to other programs, or to generic optimizations that do not perform well in all cases. Skeleton based programming models allow NUMA optimizations to be abstracted on a pattern-by-pattern basis, freeing programmers from this complexity. As a case study, we investigate computations that can be implemented with stencil skeletons. We present an analysis of the behavior of a range of simple and complex stencil programs from the NAS and Rodinia benchmark suites, under state-of-the-art NUMA aware page placement (PP) schemes. We show that even though an application (or skeleton) may have implemented the correct, intuitive scheduling of data and work to threads, the resulting performance can be disrupted by an inappropriate PP scheme. In contrast, we show that a NUMA PP-aware stencil implementation scheme can achieve speed ups of up to 2x over a similar scheme which uses the Linux default PP, and that this works across a set of complex stencil applications. Furthermore, we show that a supposed PP performance optimization in the Linux kernel never improves and in some cases degrades stencil performance by up to 0.27x and should therefore be deactivated by stencil skeleton implementations. Finally, we show that further speed ups of up to 1.1x can be achieved by addressing a work imbalance issue caused by poor conventional understanding of NUMA PP.


AnyDSL: A Partial Evaluation Framework for Programming High-Performance Libraries

Dr. Sebastian Hack

Writing performance-critical software productively is still a challenging task because performance usually conflicts genericity. Genericity makes programmers productive as it allows them to separate their software into components that can be exchanged and reused independently from each other. To achieve performance however, it is mandatory to instantiate the code with algorithmic variants and parameters that stem from the application domain, and tailor the code towards the target architecture. This requires pervasive changes to the code that destroy genericity.

In this talk, I advocate programming high-performance code using partial evaluation and present AnyDSL, a clean-slate programming system with a simple, annotation-based, online partial evaluator. I will show that AnyDSL can be used to productively implement high-performance codes from various different domains in a generic way map them to different target architectures (CPUs with SIMD units, GPUs, FPGAs). Thereby, the code generated using AnyDSL achieves performance that is in the range of multi man-year, industry-grade, manually-optimized expert codes and highy-optimized code generated from domain-specific languages.

Slides


OSCAR Compiler and OSCAR API for Heterogeneous Computing

Dr. Keiji Kimura

OSCAR compiler is an automatically parallelizing source-to-source compiler developed by Waseda University. It can take C and Fortran programs and generate parallelized and low-power optimized code for heterogeneous platforms as well as homogeneous multicores. To support multiple platforms, OSCAR compiler inserts compiler directives defined in OSCAR API in parallelized code. In this talk, an overview of OSCAR compiler and OSCAR API is presented. Our recent accelerator project is also introduced.

Slides


Reinforcement Learning-Based Adaptive Power Management for Energy Harvesting IoT Devices

Dr. Masaaki Kondo

Energy harvesting IoT (Internet of Things) devices are expected to operate perpetually and reliably without any regular maintenance by users. Energy autonomy is a necessary condition for such operation and must be addressed by the power manager integrated in IoT nodes. The power manager inherently needs adaptivity to adjust the behavior of the IoT nodes according to expected energy harvesting opportunities. This is not an easy task since there are a wide variety of IoT devices and their working environments. In this talk, we present an adaptive power management strategy using Reinforcement Learning for energy harvesting IoT nodes which train themselves from historical data. We show that our power manager is capable of adapting to changes in weather, climate and battery degradation while ensuring maximum performance without depleting or overcharging its battery.


A Network Simulator for On/Off Links of Large-Scale Interconnection Networks

Dr. Takatsugu Ono

Reducing the power consumption of the interconnection network in the HPC system is an important issue. As a method of reducing the power consumption of the interconnection network, there is a method of transitioning to the low power mode during the period when the packet is not processed (called On/Off links). In this talk, we present Trace RP, which is an interconnection simulator supporting On/Off links.


There and Back Again – Will the Human Ever Get Out of the Optimization Loop

Dr. Antoniu Pop

This is a story about cyclical optimistic and pessimistic answers to this question, as we dig deeper -and also leverage more powerful techniques for automation. We focus on the last three steps in our attempt to optimize the execution of task data-flow programs on NUMA, heterogeneous, distributed systems. We first succeed in dynamic univariate analysis and optimization on NUMA systems, which we approximately replicate on heterogeneous and distributed systems, then get to grips with the limitations of this approach, “disillusionedly” seeking to bring the human back in the loop under the guise of “helping programmers’ productivity”, and conclude with a path forward: machine learning is certainly not the panacea – yet – but it does help with, or even solve, some problems.


Customized Polyhedral Compilation for Low-Power High-Level SoC Synthesis

Dr. Louis-Noel Pouchet

Polyhedral compilation is a framework to represent and transform regular loop nests and array-based computations. In this talk we present the general design principles of a SystemC generation and design space exploration flow for fixed functions in System-on-Chips, alleviating several productivity and design time issues by focusing only polyhedral program regions. We show that key properties of affine computations enable the design of a fast and accurate latency and power characterization flow, reducing power analysis time by several orders of magnitude. This work was conducted during a 3-year project funded by Intel ISRA, in collaboration with Prof. Deming Chen.

Slides


Polyhedral Based Intermediate Representation for Inter-Procedural Code Regions

Dr. Fabrice Rastello

Profiling feedback is an important technique used by developers for performance debugging, where it is usually used to pinpoint performance bottlenecks and also to find optimization opportunities. Assessing the validity and potential benefit of a program transformation requires accurate knowledge of the data flow and data dependencies, which can be uncovered by profiling a particular execution of the program.

In this work we develop an end-to-end infrastructure for dynamic binary analysis, which produces feedback about the potential to apply structured transformations to uncover non-trivial parallelism and data locality via complex program rescheduling.

Our tool can handle both inter- and intra-procedural aspects of the program in a unified way, thus enabling structured inter-procedural transformations. It is based on QEMU and uses dynamic binary translation to instrument arbitrary programs at runtime. The design of this tool was driven by the goal of achieving portability, both in terms of targeted CPU architectures, but also in terms of programming environment and the use of third-party libraries for which no source code is available.


On Finding a Sweet Spot Between Productivity and Performance and Portability —Experiences from the SaC Compiler Project

Dr. Sven-Bodo Scholz

The SaC project aims to make high-performance computing readily available to a wider audience of application programmers. At the project’s core is an auto-parallelising compiler tool chain with support for heterogeneous parallel platforms. This talk discusses the challenges and potential solutions that we identified when aiming for high-performance on a range of different parallel architectures, looking at several different application domains and their needs.


Hardware-Software Codesign for Efficient Computing

Dr. Magnus Själander

With the end of Dennard Scaling and the slow down of Moore’s Law, computing systems have become increasingly power constrained and we can no longer rely on technology scaling for improved performance. Going forward, improvements in energy efficiency is, therefore, a requirement for improving the performance of a system. These new hardware trends challenge established assumptions and force us to rethink how we construct programs and build the systems that they run upon. Energy efficiency can mainly be improved by 1) co-optimization of the hardware and software, e.g., by conveying more static information to the hardware and exposing dynamic behavior to the compiler, or 2) specialization where the hardware is optimized for a specific task. In both cases, there is a need for extracting and representing program properties to enable such co-optimization and specialization.

In this talk, I will present examples of 1) co-optimization as a cross-layer redesign of the way compilers and the underlying micro-architecture interact, 2) specialization as a software programmable bit-serial matrix-multiply accelerator, and 3) program property representation as a new intermediate representation (IR) for optimizing compilers.

These new hardware trends challenge established assumptions and force us to rethink how we construct programs and build the systems that they run upon. Ultimately, we will need a holistic approach where software and hardware are optimized together across the complete stack.

Slides


Is it Time for RISC and CISC to Die?

Dr. Aaron Smith

Specialization, accelerators, and machine learning are all the rage. But most of the world’s computing today still uses conventional RISC or CISC CPUs, which expend significant energy to achieve high single-thread performance. Von Neumann ISAs have been so successful because they provide a clean conceptual target to software while running the complete gamut of algorithms reasonably well. We badly need clean new abstractions that utilize fine-grain parallelism and run energy efficiently. Prior work (such as the UT-Austin TRIPS EDGE ISA and others) showed how to form blocks of computation containing limited-scope dataflow graphs, which can be thought of as small structures (DAGs) mapped to silicon. In this talk I will describe work that addresses the limitations of early EDGE ISAs, and how those extensions can provide energy-efficient execution for single threads compared to conventional out-of-order superscalars. I will describe two specific microarchitectures and early results based on place and routed RTL for 10nm FinFET. In collaboration with Qualcomm Research.


Generating Performance Portable Code with Lift

Dr. Michel Steuwer and Dr. Christophe Dubach

Traditionally, high-performance is achieved via low-level code optimisations applied by an expert programmer. Such low-level optimisations often exploit peculiarities of a particular hardware device which renders them non-portable and requires the manual optimisation of code for every new device.

The Lift project approaches this problem of performance portability from a different angle. Programs are expressed by composing high-level primitives which describe computations in an abstract purely functional way. An automated exploration process optimises these programs by rewriting the high-level programs into low-level programs which encode implementation and optimisation choices explicitly. Lift has shown to achieve performance portability by generating high-performance code across a range of parallel architectures, including multi-core CPUs as well as mobile and server-class GPUs.

In this talk we will present the design and implementation of Lift. We will present encouraging performance result and sketch our ongoing and future research which embraces new and emerging architectures.

Slides


Undervolting Off-the-Shelf FPGAs for Energy-Efficiency

Dr. Osman Unsal

In general, I will be reporting on our new European research project LEGaTO, which aims for developing a toolset for energy-efficiency in heterogeneous compute fabrics. In particular, I will be detailing the conclusions we obtained from our preliminary experiments with undervolting off-the-shelf FPGAs. The most significant of these conclusions are: there exists a significant voltage guardband for FPGA BRAM on-chip memories, the “fault-map” resulting from undervolting is different for each FPGA – even for different chips of the same product – due to variability, the “fault-map” at the same low-voltage for a particular chip does not change, and increase in temperature decreases the undervolting induced fault rate.


Reducing Memory Requirements of Scientific Computations through Multi-stage Execution

Dr. Wim Vanderbauwhede

Accelerators often have limited on-board memory so it can be desirable to trade performance for memory utilisation. I will present a set of program transformation aimed at transforming scientific code (e.g. for weather simulation) so that intermediate arrays are eliminated. The approach involves loop analysis in terms of maps and folds and analysis of array accesses, and a transformation of the code to allow multi-stage execution of subgraphs of the dataflow graph ending in folds. This approach was originally developed for use in FPGAs but is effective for memory reduction on GPUs and manycore systems as well.


Parallelizing Compilation and Task Scheduling for Heterogeneous Platforms

Dr. Yasutaka Wada

To utilize heterogeneous multicore systems, we have to parallelize and optimize an application while assigning tasks inside of it for processor cores and accelerator cores in the target system considering characteristics, compatibilities, and dependencies among them. Task assignment and scheduling play essential roles in a parallelizing compilation because they give a significant impact on both performance and energy efficiency for the heterogeneous platforms. In this talk, we will introduce a parallelizing compilation scheme, including compilation flow to support various platforms and task scheduling scheme for both performance and energy efficiency, for heterogeneous platforms.


An OS-level Approach for Attaining Dependability of In-Memory
Databases

Dr. Hiroshi Yamada

In-memory databases(DBs) such as RocksDB and VoltDB play an important role in large-scale web services as well as big data analytics. It is difficult for conventional methods to efficiently recover from software and hardware due to its unique feature that is to manage numerous running states in the huge memory region. This talk introduces an operating system (OS)-level approach to attaining the dependability of in-memory DBs. This talk specifically shows recovery approaches of in-memory DBs from software bugs and hard memory errors.


Optimizing for Heterogeneous Locality, or Homogeneous Parallelism?

Dr. Ayal Zaks

There are two distinct and complementary transformations that can be applied to the iterations of a loop: aligning them to achieve data-level parallelism, also known as vectorization, and pipelining them according to an iteration initiation interval to achieve instruction-level parallelism at fine grain, and double-buffering to achieve memory-level parallelism at coarse grain. Vectorization, also related to loop coarsening, can handle uniform branches and certain dependencies. Pipelining can handle dependencies but not branches. Both are applied to countable loops, or loops whose trip count is known a few iterations ahead of time.

Aligning iterations is supported by data-parallel heterogeneous programming models such as OpenCL’s ND-range, facilitating both SIMD execution and dynamic load balancing across massively parallel GPUs. Pipelining and double-buffering, however, stitches all iterations together and leads to the static allocation of all iterations on one device. As a result, there is a mismatch between data-parallel models and deeply pipelined devices, such as FPGAs, which we seek to resolve.