No.031 Many-cores and On-chip Interconnects

Icon

NII Shonan Meeting Seminar 031

Building Block Networks with Wireless Inductive Coupling Though-Chip Interface (Amano)

Inductive Coupling Though-Chip Interface (TCI) connects stacked chips by
coils only with existing IC interconnections. Over-Gb/s data transfer
rate can be achieved with less than 10mW power dissipation. Parallel data bits
can be multiplexed into one single coil and burst-transferred.
With TCI, a high speed network can be formed just by stacking multiple chips
in various forms. A heterogeneous multi-core system called Cube-1 consisting of
an embedded CPU and multiple accelerators is now available.
By using TCI, a building block network, which
has intermediate properties between Network-on-Chips and wireless ad-hoc
networks, is formed by combining chips in various structures.

Does light speed affect topologies? (Fujiwara)

A massively parallel application run on a future supercomputer is expected to require very low end-to-end latencies. Most of off-chip interconnection topologies does not consider cable delay (i.e. near light speed), because switch delay (some hundred nanoseconds) dominates the end-to-end latency. So, what if a ultra-low-delay (some ten nanoseconds) switch becomes available in the near future? Do traditional off-chip topologies work well in those situations? We would like to discuss about the topology design for the future supercomputing systems, as well as for the future on-chip networks.

Integrating Hardware Network Stack and Database Processing Engines for Big Data (Matsutani)

We are facing two competing trends in ICT: Big data and green datacenters.
Since data reuse and repurposing are now expected to make innovations, IT equipments will be rapidly augmented for Big data, while energy-saving is essential for datacenters from a preventing global warming point of view.
To fill in the gap between these trends, we are studying FPGA-based database processing engines that support various structured storages or polyglot persistence. Since main bottleneck of conventional software-based memcached is
related to TCP/IP stack, we are now considering the integration of hardware-based network stack and these database processing engines in FPGA-based platforms.

Challenges for Dependable Many-Core Processors (Kise)

I will discuss the dependability issues, in particular soft errors and timing errors, for many-core processors.
One of our proposals is a NoC-based DMR mechanism named SmartCore to detect transient errors on a many-core processor.
It is unique because the packet level comparison for error detection is done by the new designed NoC router.

The layout evaluation and hierarchical layout method of MPSoC (Nakamura)

This talk presents how to layout many-core processors SoCs. Currently, the layout design time is quite significant for large scale LSIs, because of the complexity involved with the verification of timing and signal integrity constraints. The size and complexity of many-core SoC, limit the possibility to perform hierarchical layout designs. However, there are various methods for the hierarchical layout designs. In general, a strict hierarchical design method can provide ease of reconfigurability, but it results in worse area and timing with respect to a flat layout method, which, on the other hand, does not provide reconfigurability. To solve these problems and questions about the hierarchical layout design, the trials and the evaluations are applied for Network On Chip (NoC), which is the typical implementation of many-core processors SoC. A NoC which connects IP cores by network interfaces can be easily reconfigured during place and route and it has a strong regularity. In this talk, the several layout evaluation results of NoC and the discussions are presented. For example, it is confirmed that the strict hierarchical design method even makes the poor layout results for NoC. Furthermore, a reconfigurable layout method for Networks-on-Chip (NoCs) based on partial re-layout is introduced. I also welcome discussions on layout issues for many-core processor SoC.

Mathematical modeling of many-cores (Ginosar)

Many-cores come in many flavors: mesh-noc tiled arrays (e.g.
Tilera), hierarchical multi-cores (e.g. Rigel), hierarchical multi-threading
(e.g. Nvidia), SIMD and associative processors. Comparing them for
performance, power, area and ease of programming is a fuzzy art at best,
typically requiring the construction of complete applications, optimizing
them separately for each architecture and executing or simulating them. The
results are not always convincing and we often end up just where we started.
We attempt to extend mathematical analysis of architecture to this field. A
model accounts for performance and power of cores as a function of area and
other parameters. On-chip memories are also modeled. Basic axioms such as
Amdahl’s law and Pollack’s rule are employed to formulate the model.
However, adapting the model to a variety of many-core architectures remains
a challenge.

GPU Acceleration and Performance Optimization (Liang)

Graphics processing units (GPUs) are increasingly important for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, each of which employs the single-instruction multiple-data (SIMD) execution style. This massively parallel architecture allows GPUs to execute tens of thousands of threads in parallel. Thus, GPU architectures efficiently execute heavily data-parallel applications.
However, the performance of GPU applications critically depends on the compiler optimization. If it is not done right, it will seriously hurt the performance. In this talk, I will first present a case study of accelerating 3D sound localization using GPUs. Then, I will present the modeling and optimization techniques we have developed including control flow divergence modeling, register and thread structure optimization, and cache passing optimization.

Highly-scalable and light-weight design of the Tofu interconnect (Ajima)

The Tofu interconnect is an interconnection network designed for the K computer and its commercial version Fujitsu PRIMEHPC FX10. Tofu interconnects tens of thousands of nodes. The network topology of the Tofu interconnect is a highly-scalable six-dimensional mesh/torus. Some dimensions are configured as rings and contribute to the availability and the serviceability of the system. The packet delivery system and endpoint system are holistically designed to make the communication protocol of the Tofu interconnect light-weight. The protocol depends on guaranteed and in-order packet delivery.

Test-Delivery Optimization in Manycore SOCs (Chakrabarty)

A network-on-chip (NOC) enables the integration of the hundreds and even thousands of cores in a many core system-on-chip (SOC). Efficient testing and design-for-testability techniques must be developed for such “monster” chips. I will describe test-data delivery optimization algorithms for manycore SOCs with hundreds of cores, where a network-on-chip (NOC) is used as the interconnection fabric. I will first present an optimization algorithm based on a subset-sum formulation to solve the test-delivery problem in NOCs with arbitrary topology that use dedicated routing. Next I will propose an algorithm for the important class of NOCs with grid topology and XY routing. The proposed algorithm is the first to co-optimize the number of access points, access-point locations, pin distribution to access points, and assignment of cores to access points for optimal test resource utilization of such NOCs. Test-time minimization is modeled as an NOC partitioning problem and solved with dynamic programming in polynomial time. Both the proposed methods yield high-quality results and are scalable to large SOCs with many cores. Test scheduling under power constraints is also incorporated in the optimization framework.

Evolutionary and Revolutionary Technologies for Low Power On-Chip Communication (Bertozzi)

The advent of networks-on-chip is far from stabilizing the domain of on-chip communication architectures for multi- and many-core systems. For the high-performance computing domain, NoCs are a non-negligible source of power dissipation. For the embedded computing domain, the NoC design point stems from a trade-off between maximum resource utilization and communication performance, ultimately ending up in a system-level energy optimization issue.
Low-power on-chip communication can be achieved via evolutionary design techniques (e.g., by removing the clock and implementing clockless switching), or by means of disruptive technologies such as on-chip optical links. This talk will address the latest research findings on these technologies, taking the viewpoint of their crossbenchmarking against (aggressive) reference NoC implementations. The ultimate source of debate will be where, when and how such technologies will become viable for actual design in industry, and about how to accelerate this process.