NO.249 Fast, furious, and yet correct tensor processing

May 18 - 22, 2026 (Check-in: May 17, 2026 )

Organizers

Albert Cohen
- Google, France
Yukiyoshi Kameyama
- University of Tsukuba, Japan
Sven-Bodo Scholz
- Radboud University, The Netherlands
Oleg Kiselyov
- Graduate School of Information Sciences, Tohoku University, Japan

Overview

Description of the Meeting

Tensor processing is an advanced form of Basic Linear Algebra (BLAS), which is the foundation of numeric programming in general – such as computational fluid dynamics and various simulations, from tsunami to epidemics. Tensor processing is also the foundation of Machine Learning, in particular, Large Language (Transformer) Models – and Neural Networks and Probabilistic Programming in general. The highest performance is crucial. What advanced about tensor processing is the number of dimensions, sizes, the variety of storage/access formats, and the performance requirements.
To satisfy the performance requirements – to accomplish operations on huge (in the number of dimensions and especially in size) tensors in reasonable time, tensor processing often has to rely on special hardware, from GPU to Tensor Processing Units (TPU), Neural Processing Units (NPU), and to dedicated computer architectures such as Groq.

Background

Although the dedicated tensor processing hardware can deliver great performance, programming it is just as great a challenge. The first public, and the most well-known framework for tensor processing is TensorFlow, whose programming model, building explicit computational graphs, is quite different from conventional programming. There have been many attempts since to bridge the gap, of which one should mention PyTorch, which allowed writing programs in a subset of Python in more or less idiomatic style, and then ‘compile’ them to GPU(s) or other backends.
The great variety, and idiosyncrasy of tensor processors urgently called for some level of abstraction – hence the emergence and proliferation of intermediates: intermediate layers, DSL, or intermediate representations (IR). The examples include JAX (a framework for tensor transformation and compilation for GPU, which now underlies PyTorch), Futhark (array language for GPU programming) [2], Pallas (an elaboration of JAX mainly targeting TPUs); MLIR (or, a less lower level virtual machine, bring JAX and ordinary compiler optimizations into the same framework) [4] even has ‘IR’ in the name.

Main Problem: Little Trust, Few Guarantees

The intermediate DSL and IR undoubtedly help: that is why they are widely used. Yet they have not eliminated the need to keep writing and re-writing lowest-level (e.g., CUDA) kernels, by hand. One of the main reasons is the lack of trust: that the tensor processor code compiled/transformed from a higher-level description is really correct, and, especially, is fast.

One sees the urgent need in correctness by construction. Correctness has many aspects:

at the very least, the generated tensor processing code should be well-formed and well-typed: its preparation and (up)loading to the processor should assuredly be error-free;
shape correctness: all array accesses must be surely within bounds; dimensions of tensors to add, reduce, etc. must match;
handling of sparse tensors and ragged shapes;
correctly dealing with padding (appropriately masking padded data, etc.) and relating masking;
reasoning with and ensuring alignment;
assuring race-freedom;
assuring constraints on kernel scheduling and memory transfers;
user-friendliness: Although often overlooked, user-friendliness is crucial. If programmers cannot understand what the compiler is warning about, they would give up.
formal, provable, mechanized correctness. Although one can hardly expect a compiler automatically prove that the generated code meets the specification, at the very least it should help in such a proof by reporting the used assertions and assumptions (which could then be used in an offline, mechanized or paper proof).
some performance guarantees, such as assuring the lack of contention, of bank conflicts, etc.

Approaching key problems

The problem of correctness, broadly understood, is urgent. There are several approaches under development, such as:

Rank-polymorphism [5]
tensor comprehensions and their transformations [6]
biproducts [3]
patterns of array programming discovered in APL
exterior algebra: a general approach to tensor and tensor manipulation [1]
Shape calculi/size types [2]

Although all these approaches have shown promise, they are still under development – in isolated communities. The goal of the seminar is to bring the developers together to talk about common problems and found insights, to stimulate further development and perhaps convergence.

Structure of the workshop

We anticipate that the workshop participants will consist of representatives of several communities: advanced users of tensor processing frameworks; developers of many kinds tensor processing frameworks, DSLs and IR; algebraists; advanced APL users and developers – in short, practitioners, computer scientists and mathematicians.
We intend the meeting primarily consist of discussion and cross-fertilization rather than standard conferencestyle presentations. We identified a few themes (above), and will cluster presentations and discussions around them.
We do not expect to publish a book on the topic, but will collect key findings in the seminar report.

Related Workshops

Programming Language Support for Emerging Memory Technologies (Shonan Seminar 181, May 2024)
Staging and High-Performance Computing: Theory and Practice (Shonan Seminar 056, May 2014);

Also closely related is the Seminar on Tensor Computation organized by Jeremy Gibbons and Peter Bramm and conducted at Oxford’s Computer Science department in 2020-2022. Alas, it was conducted during the time of COVID epidemic, mostly online, and the opportunities for discussions were limited.

References

[1] John M. Browne. Grassmann Algebra: Exploring applications of extended vector algebra with mathematica. https://web.archive.org/web/20090219180241/http://grassmannalgebra.info/ grassmannalgebra/book/index.htm, 2007.
[2] Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. Futhark: purely functional gpu-programming with nested parallelism and in-place array updates. In Albert Cohen and Martin T. Vechev, editors, Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, pages 556–571. ACM, 2017.
[3] Hugo Daniel Macedo and Jos´e Nuno Oliveira. Typing linear algebra: A biproduct-oriented approach. Sci. Comput. Program., 78(11):2160–2191, 2013.
[4] Multi-level IR compiler framework. https://mlir.llvm.org/.
[5] Sven-Bodo Scholz. Why rank-polymorphism matters. In Thomas Noll and Ira Justus Fesefeldt, editors, KPS 2023: 22. Kolloquium Programmiersprachen und Grundlagen der Programmierung, volume AIB-2023-03 of Aachener Informatik-Berichte, 2023. https://doi.org/10.18154/RWTH-2023-10034.
[6] Sven-Bodo Scholz and Artjoms Sinkarovs. Tensor comprehensions in SaC. In Jurri¨en Stutterheim and Wei-Ngan Chin, editors, IFL ’19: Implementation and Application of Functional Languages, Singapore, September 25-27, 2019, pages 15:1–15:13. ACM, 2019.

Seminars