Vitalik's new article: Gluing and Co-Processor Architecture, a New Concept to Enhance Efficiency and Security
Original Title: 《Glue and coprocessor architectures》
Author: Vitalik Buterin, Founder of Ethereum
Compiler: Deng Tong, Jinse Finance
Special thanks to Justin Drake, Georgios Konstantopoulos, Andrej Karpathy, Michael Gao, Tarun Chitra, and various Flashbots contributors for their feedback and comments.
If you analyze any resource-intensive computation happening in the modern world with a moderate level of detail, one characteristic you will repeatedly find is that computation can be divided into two parts:
- A relatively small amount of complex but low-computation "business logic";
- A large amount of dense but highly structured "expensive work".
These two forms of computation are best handled in different ways: the former, whose architecture may be less efficient but needs to be highly general-purpose; the latter, whose architecture may be less general-purpose but needs to be highly efficient.
What are some examples of this different approach in practice?
First, let’s look at the environment I am most familiar with: the Ethereum Virtual Machine (EVM). Here is a geth debug trace of a recent Ethereum transaction: updating the IPFS hash of my blog on ENS. The transaction consumed a total of 46,924 gas, which can be categorized as follows:
- Base cost: 21,000
- Call data: 1,556
- EVM execution: 24,368
- SLOAD opcode: 6,400
- SSTORE opcode: 10,100
- LOG opcode: 2,149
- Other: 6,719
EVM trace of the ENS hash update. The second-to-last column is gas consumption.
The moral of the story is that most of the execution (about 73% if looking only at the EVM, and about 85% if including the base cost portion) is concentrated in a very small number of structured expensive operations: storage reads and writes, logs, and cryptography (the base cost includes 3,000 for signature verification, and the EVM also includes 272 for hash payments). The rest of the execution is "business logic": swapping bits of calldata to extract the ID of the record I am trying to set and the hash I am setting it to, and so on. In a token transfer, this would include adding and subtracting balances, and in more advanced applications, it might include loops, etc.
In the EVM, these two forms of execution are handled in different ways. High-level business logic is written in a higher-level language, typically Solidity, which can be compiled to EVM. The expensive work is still triggered by EVM opcodes (like SLOAD), but over 99% of the actual computation is done in dedicated modules written directly in client code (even libraries).
To reinforce the understanding of this pattern, let’s explore it in another context: AI code written in Python using torch.
Forward pass of a block in a transformer model
What do we see here? We see a relatively small amount of "business logic" written in Python that describes the structure of the operations being performed. In practical applications, there would also be another type of business logic that determines details such as how to fetch inputs and what operations to perform on the outputs. However, if we dive into each individual operation itself (self.norm, torch.cat, +, *, various steps inside self.attn…), we see vectorized computations: the same operation performed in parallel on a large number of values. Similar to the first example, a small portion of the computation is for business logic, while the majority is for executing large structured matrix and vector operations—indeed, most of it is just matrix multiplication.
Just like in the EVM example, these two types of work are handled in two different ways. The high-level business logic code is written in Python, which is a highly general-purpose and flexible language but also very slow, and we just accept the inefficiency because it only involves a small portion of the total computation cost. Meanwhile, the intensive operations are written in highly optimized code, often CUDA code running on GPUs. We are even increasingly starting to see LLM inference being performed on ASICs.
Modern programmable cryptography, such as SNARKs, again follows a similar pattern at two levels. First, the prover can be written in a high-level language, where the heavy lifting is done through vectorized operations, just like the AI example above. My circular STARK code here illustrates this. Second, the programs executed internally in cryptography can be written in a way that delineates between general business logic and highly structured expensive work.
To understand how this works, we can look at one of the latest trends in STARK proofs. To be general and easy to use, teams are increasingly building STARK provers for widely adopted minimal virtual machines (like RISC-V). Any program that needs to prove execution can be compiled into RISC-V, and then the prover can prove the execution of that RISC-V code.
Chart from RiscZero documentation
This is very convenient: it means we only need to write the proof logic once, and from then on, any program that needs proving can be written in any "traditional" programming language (for example, RiskZero supports Rust). However, there is a problem: this approach incurs significant overhead. Programmable cryptography is already very expensive; adding the overhead of running code in a RISC-V interpreter is too much. Therefore, developers have come up with a trick: identify specific expensive operations (usually hashes and signatures) that constitute most of the computation, and then create dedicated modules to prove these operations very efficiently. You then just combine the inefficient but general RISC-V proving system with the efficient but specialized proving system to get the best of both worlds.
Other forms of programmable cryptography beyond ZK-SNARKs, such as multiparty computation (MPC) and fully homomorphic encryption (FHE), may also use similar methods for optimization.
Overall, what is the phenomenon?
Modern computation increasingly follows what I call glue and coprocessor architectures: you have some central "glue" component that is highly general-purpose but inefficient, responsible for transferring data between one or more coprocessor components that are less general-purpose but highly efficient.
This is a simplification: in practice, the trade-off curve between efficiency and generality almost always has more than two layers. GPUs and other chips commonly referred to as "coprocessors" in the industry are less general than CPUs but more general than ASICs. The trade-off of specialization is complex, depending on predictions and intuitions about which parts of the algorithm will remain unchanged in five years and which parts will change in six months. In ZK proving architectures, we often see similar multi-layer specialization. But for broad thinking models, considering two layers is sufficient. Similar situations exist in many areas of computation:
From the examples above, computation can certainly be split in this way, which seems to be a natural law. In fact, you can find examples of computational specialization spanning decades. However, I believe this separation is increasing. I think there are reasons for this:
We have only recently reached the limits of CPU clock speed improvements, so further gains can only be achieved through parallelization. However, parallelization is difficult to reason about, so for developers, it is often more practical to continue reasoning sequentially and let parallelization happen in the backend, often wrapped in dedicated modules built for specific operations.
Computation speed has only recently become so fast that the computation cost of business logic has become truly negligible. In this world, optimizing the VM running business logic for goals beyond computational efficiency also makes sense: developer friendliness, familiarity, security, and other similar goals. Meanwhile, dedicated "coprocessor" modules can continue to be designed for efficiency and gain their security and developer friendliness from their relatively simple "interface" with the glue.
What the most important expensive operations are is becoming increasingly clear. This is most evident in cryptography, where the types of specific expensive operations most likely to be used include modular arithmetic, elliptic curve linear combinations (also known as multi-scalar multiplication), fast Fourier transforms, and so on. In artificial intelligence, this situation is also becoming increasingly evident, where for over two decades, most computation has been "primarily matrix multiplication" (albeit at different levels of precision). Similar trends are emerging in other fields. Compared to 20 years ago, there are far fewer unknown unknowns in (computationally intensive) computation.
What does this mean?
One key point is that the glue should be optimized to be a good glue, and the coprocessor should also be optimized to be a good coprocessor. We can explore the implications of this in several key areas.
EVM
Blockchain virtual machines (like the EVM) do not need to be efficient; they just need to be familiar. By simply adding the right coprocessors (also known as "precompiles"), the computation in an inefficient VM can actually be as efficient as in a natively efficient VM. For example, the overhead produced by the EVM's 256-bit registers is relatively small, while the benefits of the EVM's familiarity and existing developer ecosystem are huge and lasting. The development team optimizing the EVM has even found that the lack of parallelization is often not the main barrier to scalability.
The best way to improve the EVM may simply be (i) to add better precompiles or dedicated opcodes, such as some combination of EVM-MAX and SIMD, and (ii) to improve storage layout, for example, the changes to Verkle trees as a side effect greatly reduce the cost of accessing adjacent storage slots.
Storage optimization in the Ethereum Verkle tree proposal, placing adjacent storage keys together and adjusting gas costs to reflect this. Optimizations like this, along with better precompiles, may be more important than adjusting the EVM itself.
Secure computation and open hardware
One major challenge in improving the security of modern computation at the hardware level is its overly complex and proprietary nature: chips are designed for efficiency, which requires proprietary optimizations. Backdoors are easily hidden, and side-channel vulnerabilities are constantly being discovered.
Efforts continue from multiple angles to push for more open and secure alternatives. Some computations are increasingly being done in trusted execution environments, including on users' phones, which has already improved user security. The movement towards more open-source consumer hardware continues, with some recent victories, such as RISC-V laptops running Ubuntu.
RISC-V laptop running Debian
However, efficiency remains an issue. The author of the linked article above writes:
Newer open-source chip designs like RISC-V cannot compete with processor technologies that have existed and been improved for decades. Progress always has a starting point.
More paranoid ideas, like the design of building RISC-V computers on FPGAs, face greater overhead. But what if the glue and coprocessor architecture means that this overhead is actually not significant? What if we accept that open and secure chips will be slower than proprietary chips, and if necessary, even forgo common optimizations like speculative execution and branch prediction, but try to compensate for this by adding (if needed, proprietary) ASIC modules for the most dense specific types of computation? Sensitive computations could be done in the "main chip," which would be optimized for security, open-source design, and resistance to side-channel attacks. Denser computations (like ZK proofs, AI) would be done in ASIC modules that would have less information about the computations being performed (possibly zero information in some cases, through cryptographic blinding).
Cryptography
Another key point is that all of this is very optimistic for cryptography, especially for programmable cryptography becoming mainstream. We have already seen some highly optimized implementations of specific structured computations in SNARKs, MPC, and other setups: the overhead of certain hash functions is only a few hundred times more than running the computation directly, and the overhead for AI (primarily matrix multiplication) is also very low. Further improvements like GKR may further reduce this level. Fully general VM execution, especially when executed in a RISC-V interpreter, may continue to incur about ten thousand times the overhead, but for the reasons described in this article, this is not significant: as long as the most dense parts of the computation are handled separately with efficient dedicated techniques, the total overhead is manageable.
A simplified diagram of dedicated MPC for matrix multiplication, which is the largest component in AI model inference. See this article for more details, including how to keep the model and inputs private.
The idea that "the glue layer only needs to be familiar, not efficient" has one exception: latency, and to a lesser extent, data bandwidth. If the computation involves performing heavy operations on the same data dozens of times (as in cryptography and AI), then any latency caused by the inefficient glue layer could become the main bottleneck in runtime. Therefore, the glue layer also has efficiency requirements, although these requirements are more specific.
Conclusion
Overall, I believe the trends described above are very positive developments from multiple perspectives. First, this is a reasonable way to maximize computational efficiency while maintaining developer friendliness, allowing for more of both to benefit everyone. In particular, by implementing specialization on the client side to improve efficiency, it enhances our ability to run sensitive and performance-demanding computations (like ZK proofs, LLM inference) locally on user hardware. Second, it creates a huge window of opportunity to ensure that the pursuit of efficiency does not compromise other values, most notably security, openness, and simplicity: side-channel security and openness in computer hardware, reducing circuit complexity in ZK-SNARKs, and lowering complexity in virtual machines. Historically, the pursuit of efficiency has led these other factors to take a back seat. With glue and coprocessor architectures, this no longer needs to be the case. One part of the machine optimizes for efficiency, while another part optimizes for generality and other values, working in synergy.
This trend is also very favorable for cryptography, as cryptography itself is a major example of "expensive structured computation," and this trend accelerates the development of that trend. It adds another opportunity to enhance security. In the blockchain world, improvements in security also become possible: we can worry less about optimizing the virtual machine and focus more on optimizing precompiles and other functionalities that coexist with the virtual machine.
Third, this trend provides opportunities for smaller, newer participants to get involved. If computation becomes less monolithic and more modular, it will greatly lower the barriers to entry. There is potential to make a difference even with ASICs for one type of computation. The same goes for ZK proof and EVM optimization. Writing code with near state-of-the-art efficiency becomes easier and more accessible. Auditing and formal verification of such code becomes easier and more accessible. Finally, as these very different areas of computation converge towards some common patterns, there is more room for collaboration and learning between them.