Tendrils Compute

Mission Statement

Current Architecture Landscape

Since the early 2000s, the free lunch of single-core CPU performance has ended. As both transistor size and clock frequencies cannot be scaled easily, we’ve seen only slow progress with occasional architectural improvements. The main driver for CPU performance in the last two decades has been going multi-core, but CPUs have largely kept the fundamental architecture from a time when everything was running on a single core. This has resulted in incredibly large overhead for communication between cores with a single large memory space shared by all cores. This means every single load and store instruction is a global instruction and requires all caches to be kept globally coherent, leading to scalability limited to a few cores.

GPUs scale to a large number of cores and can make better use of high-bandwidth memory, but have a more restrictive programming model which works well only for very few algorithms.

Engineers have tried to innovate on new general-purpose architectures, especially within academia. The results are architectures that are more efficient or performant, but that rely on either the programmer managing all the clever architectural elements themselves – which hinders adoption – or a magic compiler that can do this work for the programmer – which rarely works well in practice.

Interaction Net Processor

We are building a novel computer architecture based on interaction nets, a model of computation that is inherently parallel, local, and general. These properties allow the design of a massively parallel architecture that can be programmed straightforwardly with a high-level programming language.

The generality of interaction nets means we can use a straightforward programming model for the chips, as interaction nets are well suited as a compilation target for a high-level programming language. The crucial change in our compiler is that instead of throwing away information about the dependencies in the program and linearizing everything to a sequence of instructions, we keep this information all the way down so that it is accessible to the hardware. So unlike CPU cores which at run-time try to re-extract some of this information with out-of-order scheduling, our hardware can directly make use of dependency information to distribute the workload over all the available cores.

Furthermore, the locality of operations allows us to design the memory model with local SRAM memories, eliminating global cache coherency protocols and reducing access latency, bus contention, and energy usage.

Combining these properties allows the architecture to scale effectively to thousands of cores.

Economic Viability

With the key advantages of our architecture, we expect it to ultimately improve both performance and energy efficiency for most workloads currently running on CPUs. For the first version of our chip, this will not be the case for all workloads, due to the enormous engineering efforts that have already gone into CPU design, and needing to start on a slightly older process node. Still, there are applications that will benefit enormously from our chips, even in the beginning. These are algorithms that are highly parallel yet don’t run well on GPUs due to complex control flow or irregular memory access patterns, including ones found in financial portfolio optimization, mixed integer programming, shipping route optimization, graph neural networks, and telecom base stations.

This is similar to FPGAs that are made of bigger building blocks and not produced on the latest node, yet still outperform CPUs and GPUs in applications where CPUs/GPUs are ill-suited.

This means we can start commercializing with applications that already benefit from the early versions of our chips and potentially even the FPGA implementation. Unlike hardware accelerators, our chips are general-purpose, so as we continuously improve the architecture and manufacturing, we can take over more and more workloads currently running on CPUs.

Where We’re At

We have already built Vine, our high-level language and compiler stack that compiles directly down to interaction nets and allows programming of arbitrary algorithms. Moreover, we have implemented the first version of our core on FPGAs, showing a demo of the first ever interaction net program running natively in hardware at the FR8 demo day.

What’s Next

There is still lots of work to be done optimizing the cores, scaling it to our large FPGAs, and extending the Vine standard library, which will allow us to run the first benchmarks comparing against the current state-of-the-art in the applications we’re targeting first.