Vitalik's new article: The possible future of Ethereum, The Surge

2024-10-17 13:21:54

Collection

What is the Surge phase like? DA, data compression, Generalized Plasma, improvements in cross-L2 interoperability, scaling execution on L1, etc...

Scan with WeChat

Original Title: 《Possible futures for the Ethereum protocol, part 2: The Surge》

Author: Vitalik Buterin

Compiled by: Karen, Foresight News

Special thanks to Justin Drake, Francesco, Hsiao-wei Wang, @antonttc, and Georgios Konstantopoulos.

Initially, the Ethereum roadmap included two scaling strategies. One (see an early paper from 2015) is "sharding": each node only needs to verify and store a small portion of transactions, rather than verifying and storing all transactions in the chain. Other peer-to-peer networks (like BitTorrent) work this way, so we can certainly make the blockchain work in the same way. The other is Layer 2 protocols: these networks will be built on top of Ethereum, allowing them to fully benefit from its security while keeping most data and computation off the main chain. Layer 2 protocols refer to state channels from 2015, Plasma from 2017, and then Rollup from 2019. Rollup is more powerful than state channels or Plasma, but they require a lot of on-chain data bandwidth. Fortunately, by 2019, sharding research had solved the problem of large-scale verification of "data availability." As a result, the two paths merged, and we got a Rollup-centric roadmap, which remains Ethereum's scaling strategy today.

The Surge, 2023 roadmap version

The Rollup-centric roadmap proposes a simple division of labor: Ethereum L1 focuses on being a robust and decentralized base layer, while L2 takes on the task of helping the ecosystem scale. This model is ubiquitous in society: the existence of the court system (L1) is not to pursue ultra-high speed and efficiency, but to protect contracts and property rights, while entrepreneurs (L2) build on this solid foundation, leading humanity to (both literally and metaphorically) Mars.

This year, the Rollup-centric roadmap achieved significant results: with the launch of EIP-4844 blobs, Ethereum L1's data bandwidth has greatly increased, and multiple Ethereum Virtual Machine (EVM) Rollups have entered their first phase. Each L2 exists as a "shard" with its own internal rules and logic, and the diversity and pluralism of shard implementations have now become a reality. But as we have seen, following this path also faces some unique challenges. Therefore, our current task is to complete the Rollup-centric roadmap and address these issues while maintaining the robustness and decentralization unique to Ethereum L1.

The Surge: Key Objectives

Future Ethereum can achieve over 100,000 TPS through L2;
Maintain the decentralization and robustness of L1;
At least some L2 fully inherit Ethereum's core attributes (trustlessness, openness, censorship resistance);
Ethereum should feel like a unified ecosystem, rather than 34 different blockchains.

Chapter Contents

Scalability Trilemma
Further Progress on Data Availability Sampling
Data Compression
Generalized Plasma
Mature L2 Proof Systems
Cross-L2 Interoperability Improvements
Scaling Execution on L1

Scalability Trilemma

The scalability trilemma is an idea proposed in 2017, which posits that there is a contradiction between three characteristics of blockchain: decentralization (more specifically: low cost of running nodes), scalability (high number of transactions processed), and security (attackers need to compromise a large portion of nodes in the network to cause a single transaction to fail).

It is worth noting that the trilemma is not a theorem, and the posts introducing the trilemma do not come with a mathematical proof. It does provide a heuristic mathematical argument: if a decentralized-friendly node (like a consumer-grade laptop) can verify N transactions per second, and you have a chain that processes k*N transactions per second, then (i) each transaction can only be seen by 1/k nodes, meaning an attacker only needs to compromise a few nodes to push a malicious transaction through, or (ii) your nodes will become powerful, while your chain will not be decentralized. The purpose of this article is not to prove that breaking the trilemma is impossible; rather, it aims to show that breaking the trilemma is difficult and requires stepping outside the thinking framework implied by the argument.

Over the years, some high-performance chains have claimed to solve the trilemma without fundamentally changing their architecture, often by applying software engineering techniques to optimize nodes. This is always misleading, as running a node on these chains is much more difficult than running a node on Ethereum. This article will explore why this is the case and why Ethereum cannot scale solely based on L1 client software engineering.

However, the combination of data availability sampling with SNARKs does solve the trilemma: it allows clients to verify that a certain amount of data is available and that a certain number of computational steps have been correctly executed, while only downloading a small amount of data and performing minimal computation. SNARKs are trustless. Data availability sampling has a subtle few-of-N trust model, but it retains the fundamental property of non-scalable chains, which is that even a 51% attack cannot force bad blocks to be accepted by the network.

Another way to solve the trilemma is through the Plasma architecture, which cleverly incentivizes users to take on the responsibility of monitoring data availability. Back in 2017-2019, when we only had fraud proofs to scale computational capacity, Plasma was very limited in secure execution, but with the rise of SNARKs (zero-knowledge succinct non-interactive arguments), the Plasma architecture has become much more viable for a wider range of use cases than ever before.

Further Progress on Data Availability Sampling

What Problem Are We Solving?

On March 13, 2024, when the Dencun upgrade goes live, the Ethereum blockchain will have 3 blobs of approximately 125 kB each per 12-second slot, or about 375 kB of data availability bandwidth per slot. Assuming transaction data is published directly on-chain, an ERC20 transfer is about 180 bytes, so the maximum TPS for Rollups on Ethereum is: 375000 / 12 / 180 = 173.6 TPS.

If we add Ethereum's calldata (theoretical maximum: 30 million Gas per slot / 16 gas per byte = 1,875,000 bytes per slot), it becomes 607 TPS. Using PeerDAS, the number of blobs could increase to 8-16, which would provide 463-926 TPS for calldata.

This is a significant improvement for Ethereum L1, but it is not enough. We want more scalability. Our mid-term goal is 16 MB per slot, which, combined with improvements in Rollup data compression, would bring approximately 58,000 TPS.

What Is It? How Does It Work?

PeerDAS is a relatively simple implementation of "1D sampling." In Ethereum, each blob is a 4096-degree polynomial over a 253-bit prime field. We broadcast shares of the polynomial, where each share contains 16 evaluation values from 16 adjacent coordinates out of a total of 8192 coordinates. Of these 8192 evaluation values, any 4096 (according to the currently proposed parameters: any 64 out of 128 possible samples) can recover the blob.

PeerDAS works by having each client listen to a small number of subnets, where the i-th subnet broadcasts the i-th sample of any blob and requests the blobs it needs from other subnets by querying peers in the global p2p network (who will listen to different subnets). A more conservative version, SubnetDAS, uses only the subnet mechanism without additional peer-layer queries. The current proposal is for staking nodes to use SubnetDAS, while other nodes (i.e., clients) use PeerDAS.

Theoretically, we can scale "1D sampling" quite large: if we increase the maximum number of blobs to 256 (targeting 128), we can achieve the 16MB goal, with each node in data availability sampling having 16 samples * 128 blobs * 512 bytes per sample blob = 1 MB of data bandwidth per slot. This is just barely within our tolerance: it is feasible, but it means that bandwidth-constrained clients cannot sample. We can optimize this to some extent by reducing the number of blobs and increasing the blob size, but this will raise the cost of reconstruction.

Thus, we ultimately want to go further and implement 2D sampling, which not only samples randomly within blobs but also samples randomly between blobs. By leveraging the linear properties of KZG commitments, we can expand the set of blobs in a block with a new set of virtual blobs that redundantly encode the same information.

Thus, ultimately, we want to go further and implement 2D sampling, which samples randomly not only within blobs but also between blobs. The linear properties of KZG commitments are used to expand the set of blobs in a block, containing a new list of virtual blobs that redundantly encode the same information.

2D sampling. Source: a16z crypto

Crucially, the computation of the commitment expansion does not require blobs, so this scheme is fundamentally friendly to distributed block building. Nodes that actually build blocks only need to have the blob KZG commitments, and they can rely on data availability sampling (DAS) to verify the availability of data blocks. One-dimensional data availability sampling (1D DAS) is also inherently friendly to distributed block building.

What Are the Links to Existing Research?

Original post introducing data availability (2018): https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding
Follow-up paper: https://arxiv.org/abs/1809.09044
Explanatory article on DAS, paradigm: https://www.paradigm.xyz/2022/08/das
2D availability with KZG commitments: https://ethresear.ch/t/2d-data-availability-with-kate-commitments/8081
PeerDAS on ethresear.ch: https://ethresear.ch/t/peerdas-a-simpler-das-approach-using-battle-tested-p2p-components/16541 and paper: https://eprint.iacr.org/2024/1362
EIP-7594: https://eips.ethereum.org/EIPS/eip-7594
SubnetDAS on ethresear.ch: https://ethresear.ch/t/subnetdas-an-intermediate-das-approach/17169
Nuances of data recoverability in data availability sampling: https://ethresear.ch/t/nuances-of-data-recoverability-in-data-availability-sampling/16256

What Still Needs to Be Done? What Are the Trade-offs?

Next is to complete the implementation and rollout of PeerDAS. After that, continuously increasing the number of blobs on PeerDAS while carefully monitoring the network and improving the software to ensure security is a gradual process. At the same time, we hope for more academic work to standardize PeerDAS and other versions of DAS and their interactions with issues such as fork choice rule security.

In the further future, we need to do more work to determine the ideal version of 2D DAS and prove its security properties. We also hope to eventually transition from KZG to a quantum-safe and trustless alternative. Currently, it is unclear which candidates are friendly to distributed block building. Even using expensive "brute force" techniques, such as using recursive STARKs to generate validity proofs for reconstructing rows and columns, is not sufficient, because while technically a STARK's size is O(log(n) * log(log(n))) hashes (using STIR), in practice, a STARK is almost as large as the entire blob.

I believe the long-term realistic path is:

Implement the ideal 2D DAS;
Stick with 1D DAS, sacrificing sampling bandwidth efficiency for simplicity and robustness while accepting a lower data ceiling;
(Hard pivot) Abandon DA and fully embrace Plasma as our primary Layer 2 architecture.

Note that even if we decide to scale execution directly at the L1 layer, this option exists. This is because if the L1 layer is to handle a large amount of TPS, L1 blocks will become very large, and clients will want an efficient way to verify their correctness, so we will have to use the same techniques at the L1 layer as with Rollups (like ZK-EVM and DAS).

How Does It Interact with Other Parts of the Roadmap?

If data compression is implemented, the need for 2D DAS will be reduced, or at least delayed; if Plasma is widely used, the demand will further decrease. DAS also poses challenges to distributed block building protocols and mechanisms: while DAS is theoretically friendly to distributed reconstruction, this needs to be combined with proposals for package inclusion lists and their surrounding fork choice mechanisms in practice.