In-depth analysis of two ZK vulnerabilities

CertiK
2024-05-29 19:16:41
Collection
The ZK vulnerability will allow hackers to submit forged ZK proofs to demonstrate false transactions and have the ZK proof verifier accept them.

Author: CertiK

In previous articles, we discussed the advanced formal verification of zero-knowledge proofs: how to verify a ZK instruction. By formally verifying each zkWasm instruction, we can fully validate the technical security and correctness of the entire zkWasm circuit. In this article, we will focus on the perspective of discovering vulnerabilities, analyzing specific vulnerabilities found during the auditing and verification process, as well as the experiences and lessons learned from them. For a general discussion on advanced formal verification of zero-knowledge proof (ZKP) blockchains, please refer to the article on advanced formal verification of zero-knowledge proof blockchains.

Before discussing ZK vulnerabilities, let’s first understand how CertiK conducts ZK formal verification. For complex systems like the ZK virtual machine (zkVM), the first step in formal verification (FV) is to clarify what needs to be verified and its nature. This requires a comprehensive review of the design, code implementation, and testing setup of the ZK system. This process overlaps with conventional auditing, but differs in that the verification goals and nature need to be established after the review. At CertiK, we refer to this as verification-oriented auditing. Auditing and verification work is usually a whole. For zkWasm, we conducted both auditing and formal verification simultaneously.

What are ZK vulnerabilities?

The core feature of zero-knowledge proof systems is to allow a brief cryptographic proof of computations executed offline or privately (such as blockchain transactions) to be passed to a zero-knowledge proof verifier, who checks and confirms that the computation indeed occurred as claimed. In this regard, ZK vulnerabilities would allow hackers to submit forged ZK proofs to prove false transactions and have the ZK proof checker accept them.

For the zkVM prover, the ZK proof process involves running a program, generating an execution trace for each step, and converting the execution trace data into a set of numerical tables (a process known as "arithmetization"). These numbers must satisfy a set of constraints (i.e., "circuit"), which include equations relating specific table cells, fixed constants, database lookup constraints between tables, and polynomial equations that must be satisfied between each pair of adjacent table rows (i.e., "gates"). On-chain verification can confirm that there indeed exists a table that satisfies all constraints while ensuring that the specific numbers in the table are not visible.

In-depth Analysis of Two ZK VulnerabilitieszkWasm Arithmetization Table

The accuracy of each constraint is crucial. Any error in a constraint, whether it is too weak or missing, could allow a hacker to submit a misleading proof that appears to represent a valid execution of a smart contract, but is not. The opacity of zkVM transactions amplifies these vulnerabilities compared to traditional VMs. In non-ZK chains, the computational details of transactions are publicly recorded on the blocks; however, zkVM does not store these details on-chain. The lack of transparency makes it difficult to determine the specifics of an attack, and it can even be hard to ascertain whether an attack has occurred.

The ZK circuit that executes zkVM instruction rules is extremely complex. For zkWasm, the implementation of its ZK circuit involves over 6,000 lines of Rust code and hundreds of constraints. This complexity often means that multiple vulnerabilities may be waiting to be discovered.


zkWasm Circuit Architecture

Indeed, we discovered several such vulnerabilities through the auditing and formal verification of zkWasm. Below, we will discuss two representative examples and the differences between them.

Code Vulnerability: Load8 Data Injection Attack

The first vulnerability involves the Load8 instruction of zkWasm. In zkWasm, reading heap memory is accomplished through a set of LoadN instructions, where N is the size of the data to be loaded. For example, Load64 should read 64 bits of data from the zkWasm memory address. Load8 should read 8 bits of data (i.e., one byte) from memory and pad it with zeros to create a 64-bit value. Internally, zkWasm represents memory as an array of 64-bit bytes, so it needs to "select" a portion of the memory array. Four intermediate variables (u16_cells) are used for this purpose, which together form the complete 64-bit loaded value.

The constraints for these LoadN instructions are defined as follows:

In-depth Analysis of Two ZK Vulnerabilities

This constraint is divided into three cases: Load32, Load16, and Load8. Load64 has no constraints because the memory unit is exactly 64 bits. For the Load32 case, the code ensures that the high 4 bytes (32 bits) of the memory unit must be zero.

In-depth Analysis of Two ZK Vulnerabilities

For the Load16 case, the high 6 bytes (48 bits) of the memory unit must be zero.

In-depth Analysis of Two ZK Vulnerabilities

For the Load8 case, it should require that the high 7 bytes (56 bits) of the memory unit be zero. Unfortunately, this is not the case in the code.

In-depth Analysis of Two ZK Vulnerabilities

As you can see, only the high 9 to 16 bits are constrained to be zero. The other high 48 bits can be arbitrary values, yet still masquerade as "read from memory."

By exploiting this vulnerability, a hacker can tamper with the ZK proof of a legitimate execution sequence, causing the Load8 instruction to load these unexpected bytes, leading to data corruption. Moreover, by carefully arranging surrounding code and data, it may trigger false executions and transfers, thereby stealing data and assets. This forged transaction can pass the zkWasm checker’s verification and be mistakenly recognized as a real transaction by the blockchain.

Fixing this vulnerability is actually quite simple.

In-depth Analysis of Two ZK Vulnerabilities

This vulnerability represents a class of ZK vulnerabilities known as "code vulnerabilities," as they stem from the code writing and can be easily fixed with minor localized code modifications. As you might agree, these vulnerabilities are also relatively easier to spot.

Design Vulnerability: Forged Return Attack

Let’s look at another vulnerability, this time concerning the calls and returns in zkWasm. Calls and returns are fundamental VM instructions that allow one running context (such as a function) to call another and resume the execution of the caller's context after the callee completes its execution. Each call is expected to return later. The dynamic data tracked during the lifecycle of calls and returns in zkWasm is referred to as the "call frame." Since zkWasm executes instructions sequentially, all call frames can be sorted based on the time they occurred during execution. Below is an example of call/return code running on zkWasm.

Users can call the buytoken() function to purchase tokens (possibly by paying or transferring something of value). One of its core steps is to increase the token account balance by 1 through calling the addtoken() function. Since the ZK prover itself does not support call frame data structures, it needs to use the execution table (E-Table) and jump table (J-Table) to record and track the complete history of these call frames.

In-depth Analysis of Two ZK Vulnerabilities

The above diagram illustrates the execution process of buytoken() calling addtoken() and the process of returning from addtoken() to buytoken(). It can be seen that the token account balance increased by 1. In the execution table, each execution step occupies a row, which includes the current execution's call frame number, the name of the current context function (for illustration purposes only), the instruction number currently running within that function, and the current instruction stored in the table (for illustration purposes only). In the jump table, each call frame occupies a row, storing its caller frame number, caller function context name (for illustration purposes only), and the next instruction position of the caller frame (so that the frame can return). In both tables, there is a jops column that tracks whether the current instruction is a call/return (in the execution table) and the total number of call/return instructions that occurred in that frame (in the jump table).

As one would expect, each call should have a corresponding return, and each frame should have exactly one call and one return. As shown in the diagram, the jops value of frame 1 in the jump table is 2, corresponding to the jops values of 1 in the first and third rows of the execution table. Everything seems normal at this point.

However, there is a problem: while one call and one return would make the jops count for the frame equal to 2, two calls or two returns would also make the count equal to 2. Having two calls or two returns in a frame may sound absurd, but keep in mind that this is precisely what hackers aim to do by breaking expectations.

You might be getting a bit excited now, but have we really found the problem?

It turns out that two calls are not an issue because the constraints of the execution table and call table prevent two calls from being encoded into the same row of a frame, as each call generates a new frame number, which is the current call frame number plus 1.

However, the situation with two returns is less fortunate: since no new frame is created upon return, a hacker could indeed obtain the execution table and call table of a legitimate execution sequence and inject forged returns (along with the corresponding frames). For example, the earlier execution table and call table of buytoken() calling addtoken() could be tampered with by a hacker to the following situation:

In-depth Analysis of Two ZK Vulnerabilities

The hacker injected two forged returns between the original call and return in the execution table and added a new forged frame row in the call table (the original return and subsequent instruction step numbers in the execution table would need to be incremented by 4). Since the jops count for each row in the call table is 2, it satisfies the constraints, and the zkWasm proof checker will accept the "proof" of this forged execution sequence. As can be seen from the diagram, the token account balance increased 3 times instead of 1 time. Thus, the hacker could obtain 3 tokens for the price of 1 token.

There are several ways to solve this problem. One obvious method is to separately track calls and returns and ensure that each frame has exactly one call and one return.

You may have noticed that so far we have not shown a single line of code for this vulnerability. The main reason is that there is no problematic line of code; the code implementation fully complies with the table and constraint design. The issue lies in the design itself, and so does the fix.

You might think that this vulnerability should be obvious, but in reality, it is not. This is because there is a gap between "two calls or two returns can also lead to a jops count of 2" and "in fact, two returns are possible." The latter requires a detailed and complete analysis of the various constraints related to the execution table and call table, making it difficult to conduct a complete non-formal reasoning.

Comparison of the Two Vulnerabilities

For the "Load8 Data Injection Vulnerability" and the "Forged Return Vulnerability," both could allow hackers to manipulate ZK proofs, create false transactions, deceive proof checkers, and conduct theft or hijacking. However, their nature and the way they were discovered are entirely different.

The "Load8 Data Injection Vulnerability" was discovered during the auditing of zkWasm. This was no easy task, as we had to review over 6,000 lines of Rust code and hundreds of constraints of zkWasm instructions. Nevertheless, this vulnerability was relatively easy to spot and confirm. Since this vulnerability was fixed before formal verification began, it was not encountered during the verification process. If it had not been discovered during the auditing process, we could expect it to be found during the verification of the Load8 instruction.

The "Forged Return Vulnerability" was discovered during formal verification after the audit. One reason we failed to find it during the audit is that there is a mechanism in zkWasm very similar to jops called "mops," which tracks the memory access instructions corresponding to the historical data of each memory unit during zkWasm execution. The counting constraints for mops are indeed correct, as it only tracks one type of memory instruction, namely memory writes; furthermore, the historical data of each memory unit is immutable and will only be written once (mops count is 1). However, even if we had noticed this potential vulnerability during the audit, we would still not have been able to easily confirm or rule out every possible scenario without strict formal reasoning on all relevant constraints, as there is no erroneous line of code.

In summary, these two vulnerabilities belong to "code vulnerabilities" and "design vulnerabilities," respectively. Code vulnerabilities are relatively superficial and easier to discover (erroneous code) and easier to reason and confirm; design vulnerabilities can be very subtle, harder to detect (no "erroneous" code), and more difficult to reason and confirm.

Best Practices for Discovering ZK Vulnerabilities

Based on our experience in auditing and formal verification of zkVM and other ZK and non-ZK chains, here are some recommendations on how to best protect ZK systems:

Check Code and Design

As mentioned earlier, vulnerabilities can exist in both the code and design of ZK. Both types of vulnerabilities can compromise the ZK system, so they must be eliminated before the system goes live. One issue with ZK systems compared to non-ZK systems is that any attack is harder to expose and analyze because its computational details are not publicly available or retained on-chain. Therefore, people may know that a hacking attack occurred but cannot ascertain how it happened at a technical level. This makes the cost of any ZK vulnerability very high. Correspondingly, the value of ensuring the security of ZK systems in advance is also very high.

Conduct Audits and Formal Verification

The two vulnerabilities we introduced here were discovered through auditing and formal verification, respectively. Some may think that using formal verification eliminates the need for auditing since all vulnerabilities will be discovered through formal verification. In fact, our recommendation is to conduct both. As explained at the beginning of this article, high-quality formal verification work begins with a thorough review, inspection, and informal reasoning of the code and design; this work itself overlaps with auditing. Additionally, discovering and eliminating simpler vulnerabilities during the audit will make formal verification simpler and more efficient.

If you want to conduct both auditing and formal verification of a ZK system, the best timing is to perform both tasks simultaneously so that auditors and formal verification engineers can collaborate efficiently (which may uncover more vulnerabilities since the objects and goals of formal verification require high-quality audit input).

If your ZK project has already undergone an audit (great) or multiple audits (very great), our recommendation is to conduct formal verification of the circuit based on the previous audit results. Our experience in auditing and formal verification of zkVM and other ZK and non-ZK projects has repeatedly shown that verification often captures vulnerabilities that are missed and hard to detect during audits. Due to the nature of ZKP, while ZK systems should provide better blockchain security and scalability than non-ZK solutions, the criticality of their own security and correctness is far higher than that of traditional non-ZK systems. Therefore, the value of high-quality formal verification for ZK systems is also much greater than that for non-ZK systems.

Ensure the Security of Circuits and Smart Contracts

ZK applications typically consist of two parts: circuits and smart contracts. For zkVM-based applications, there are general zkVM circuits and smart contract applications. For non-zkVM-based applications, there are application-specific ZK circuits and corresponding smart contracts deployed on L1 chains or on the other side of bridges.

Our auditing and formal verification work on zkWasm only involved the zkWasm circuit. From the perspective of the overall security of ZK applications, it is also very important to audit and formally verify their smart contracts. After all, after investing significant effort to ensure the security of the circuit, it would be very unfortunate if the application ultimately suffers due to laxity in the smart contract area.

There are two types of smart contracts that deserve special attention. The first type is smart contracts that directly handle ZK proofs. Although they may not be large in scale, their risks are very high. The second type is medium to large smart contracts running on zkVM. We know that they can sometimes be very complex, and the most valuable among them should be audited and verified, especially since their execution details cannot be seen on-chain. Fortunately, after years of development, formal verification of smart contracts is now practical and ready for suitable high-value targets.

Let’s summarize the impact of formal verification (FV) on ZK systems and their components through the following illustration.

In-depth Analysis of Two ZK Vulnerabilities

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators