A Detailed Explanation of Blockchain Data Availability Solutions
Original Title: Data Availability in Blockchains Original Author: AltLayer Founder YQ Translation: Qianwen, ChainCatcher
Data availability refers to ensuring that all participants in a blockchain network have access to the complete set of transaction data contained in a block. This concept is crucial for maintaining security, especially as blockchain systems scale to handle larger transaction volumes.
New methods such as sharding, roll-ups, and light clients decentralize transaction processing to shards or roll-up chains, rather than having every node process all transactions. This allows for workload distribution and increases throughput. However, the consequence of this is that no single node can see all the data. This means that individual nodes can no longer fully verify each transaction, and if some transaction data is lost or concealed, they cannot generate fraud/ validity proofs. If data availability is not guaranteed, light clients become particularly vulnerable.
Therefore, ensuring the accessibility of necessary data has become a key challenge for blockchain scalability. Various technologies are emerging to provide this guarantee without incurring excessive redundancy overhead.
Data Availability Issues
In traditional proof-of-work blockchains like Bitcoin and (pre-POS) Ethereum, each block contains a metadata header and a list of transactions. Full nodes in these networks download and verify each transaction in every block, primarily executing transactions independently and checking their validity according to the blockchain's protocol rules. This independent execution of transactions allows full nodes to compute the current state necessary for validating and processing the next block. Since full nodes perform this transaction execution and verification, they can enforce critical transaction validity rules, preventing miners or block producers from including invalid transactions in blocks.
Light Clients
Light clients, also known as SPV ("Simplified Payment Verification") clients, adopt a different approach than full nodes to save bandwidth and storage space. SPV clients only download and verify block headers, not executing or verifying any transactions. Instead, SPV clients rely on an assumption; the chain preferred by the blockchain consensus algorithm, namely the longest chain in Bitcoin, contains only valid blocks that correctly follow the protocol rules. In this way, SPV clients can outsource the actual transaction execution and verification work to the consensus mechanism of the blockchain itself.
The security model of SPV clients fundamentally depends on whether there is an "honest majority" among the participants, for example, that miners in a proof-of-work blockchain correctly apply transaction validity rules and reject invalid blocks proposed by a minority. If a majority of dishonest miners or block producers collude, they can together create blocks with illegal state transitions, thereby creating tokens out of thin air, violating asset protection, or engaging in other forms of theft or exploitation. SPV nodes themselves cannot detect such malicious behavior because they do not actually verify transactions. In contrast, full nodes execute all protocol rules regardless of the consensus mechanism, so they immediately reject invalid blocks created by dishonest majority nodes.
To improve the security assumptions of SPV clients, a mechanism called fraud/ validity proofs can enable full nodes to generate cryptographic proofs that show light clients that a given block explicitly contains invalid state transitions. Upon receiving a valid fraud/ validity proof, even if the consensus mechanism erroneously accepts an invalid block, the light client can reject that invalid block.
However, fraud/validity proofs fundamentally require that the full nodes generating these proofs have access to the complete set of transaction data referenced in the block to re-execute transactions and identify any invalid state changes. If block producers selectively publish only the block headers without releasing the complete dataset of transactions for a specific block, then full nodes will not have the information needed to construct fraud/validity proofs. This situation where the network cannot access transaction data is referred to as the "data availability problem."
Without guarantees of data availability, light clients have no choice but to trust that the behavior of block producers is honest and correct. This complete reliance on trust undermines the purpose of fraud/validity proofs and compromises the security advantages of the light client model. Therefore, data availability is crucial for maintaining the expected security and validity of fraud/validity proofs in blockchain networks, especially as they scale to larger transaction volumes.
The Demand for Data Availability in Scalability Solutions
In addition to the existing network's demand for data availability, data availability becomes even more critical in new scalability solutions like sharding and roll-ups, which aim to increase transaction throughput. Currently, many initiatives and projects, such as proto-danksharding, EIP 4484, Celestia, EigenDA, and Avail, are making significant progress in providing efficient and economical DA for roll-ups.
In sharded blockchain architectures, the single network of validators is divided into smaller groups or "shards," with each group or "shard" only processing and validating a portion of the transactions. Since shards do not process or validate transactions from other shards, a single shard node can only access the transaction data specific to its shard.
In the roll-up process, transaction execution occurs off-chain in an optimized environment, significantly increasing transaction throughput. Only compressed and aggregated transaction data is periodically published to the main chain L1 by roll-up operators. This approach reduces fees and congestion on L1 compared to executing all transactions directly on L1.
In both sharding and roll-ups, no single node can verify or even observe all transactions in the entire system. The previous assumptions about data availability in traditional monolithic blockchains have been broken. If a sorter operator withholds the entire dataset of roll-up block transactions, or if a group of colluding malicious validators generates an invalid block in a shard, other full nodes in other shards or L1 will not be able to access the missing data. Without this data, they cannot generate fraud/validity proofs indicating invalid state transitions because they cannot obtain the data needed to identify the issues.
Unless new robust methods are introduced to ensure data availability, malicious actors may exploit these new scalability models to selectively hide invalid transactions while maintaining sufficient visible block validity to avoid detection. Users would have to trust that shard nodes and roll-up operators will act honestly at all times, but trusting a large number of distributed participants to always act honestly is risky, which is precisely the situation that blockchain aims to avoid through incentive mechanisms, decentralization, and cryptographic techniques.
In the context of cross-shard transactions and L2 solutions, to maintain the expected security advantages of the light client model and effective fraud/ validity proofs, it is essential to ensure that the complete set of transaction data is always accessible somewhere in the network. The data itself does not need to be downloaded by all nodes across all shards, but if participants wish to verify blocks and generate fraud/validity proofs regarding potential issues, at least access to this data must be available at all times.
Data Availability Solutions
Many methods have been proposed and explored to provide "data availability" without requiring all nodes in a sharded or L2 network to excessively download and store the complete dataset of transactions:
Data Availability Sampling
Data availability sampling refers to a class of techniques that allow light clients to download random fragments of the entire transaction dataset to probabilistically check whether the transaction data is available. Projects like proto-danksharding, Celestia, EigenDA, and Avail have experimented with various new technologies, such as KZG commitments and ZK proofs, to achieve better sampling.
Typically, data availability sampling schemes rely on erasure coding, which takes the complete transaction dataset and mathematically transforms it into a longer coded dataset by adding computational redundancy. As long as a sufficient subset of coded fragments is available, the original data can be reconstructed from the coded data by reversing the mathematical transformation.
Light clients randomly obtain and verify a small number of erasure-coded data fragments. If any sampled fragment is lost or unavailable, it indicates that the entire network may not have access to the complete erasure-coded dataset. The more samples a client can collect from random fragments of the dataset, the greater the likelihood that the client will detect any missing data. The parameters of erasure coding can be adjusted so that light clients only need to randomly sample a very small proportion of fragments (about 1%) to verify the availability of the complete dataset with extremely high statistical confidence.
This general approach allows light clients to efficiently check the availability of even very large transaction datasets without actually downloading the entire dataset. Samples will also be shared with full nodes on the network to help reconstruct any missing data blocks and restore unavailable data blocks if necessary.
Data Availability Committees
Data availability solutions based on committees assign the responsibility for verifying transaction data availability to a relatively small group of trusted nodes known as Data Availability Committees (DAC). Committee nodes store complete copies of block transaction data and indicate that the data is indeed fully available by publishing cryptographic signatures on the main chain. In this way, light clients can verify these signatures at low cost, ensuring that the data from committee nodes is available without having to actually handle or store the data themselves.
The fundamental trade-off of data availability committees is that light clients must trust that committee nodes will correctly issue data availability signals. Relying on centralized permissioned committees introduces a certain degree of centralization risk and single points of failure to the network. Using DAC technology composed of Proof-of-Stake validators and imposing severe penalties for misconduct can reduce but not completely eliminate the trust requirements for light clients.
Data Sharding
In data sharding schemes, transaction data is divided into multiple shards, and light clients perform probabilistic sampling from all shards to verify the data availability of the entire system. However, implementing cross-shard sampling often significantly increases the complexity of data availability protocols and may require complex network topologies to prevent single points of failure.
Succinct Proofs
Emerging cryptographic proofs like zero-knowledge proofs and zk-SNARKs can be used to prove the validity of state transitions in a block without revealing any underlying transaction data. For example, validity proofs can demonstrate that roll-up block transitions are entirely valid without disclosing any private transaction data used by the roll-up itself.
However, fundamentally, the data still needs to be available somewhere for full nodes to accurately update their local state. If block producers completely fail to provide the underlying transaction data of a block, full nodes cannot accurately track the latest state balances and integrity. Succinct proofs can guarantee the validity of state changes but cannot ensure the availability of the underlying data driving those changes.
Conclusion
As blockchain transaction volumes expand and the transition to advanced architectures like sharding and roll-ups occurs, data availability is a critical challenge that must be addressed. However, it is encouraging that with the development of decentralized blockchain networks, there are multiple viable pathways to achieve data availability, preventing this issue from becoming a permanent barrier to scalability and censorship resistance.