The Pillars of Web3: An Overview of the Decentralized Storage Ecosystem

Foresight News
2022-06-23 19:00:40
Collection
Filecoin, Arweave, Storj, Crust Network, Sia, and Swarm, which is the best decentralized storage solution?

Written by: 0xPhillan, Fundamental labs

Translated by: Tia, ForesightNews

If we want to advance further in the decentralized internet, we will ultimately need these three pillars: consensus, storage, and computation. If humanity successfully decentralizes these three areas, we will embark on the next journey of the internet: Web3.

Figure 1: Examples of projects for each Web3 pillar

Storage, as the second pillar, is rapidly maturing, with various storage solutions already applied to use cases. This article will further explore the pillar of decentralized storage.

This article is a summary of the full-length piece, which can be downloaded from decentralized storage Arweave and Crust Network.

The Need for Decentralized Storage

The Blockchain Perspective

From the perspective of blockchain, we need decentralized storage because the design of blockchain is not meant for storing large volumes of data. The mechanism for achieving consensus in blocks relies on the data (transactions) placed in the blocks (collecting transactions), which is quickly shared across the network for node validation.

First, storing data in blocks is very expensive. At the time of writing, storing a complete BAYC #3368 on layer1 costs over $18,000.

Figure 2: Projects with active mainnets. Choosing a 200-year storage term to meet Arweave's definition of permanence. Source: Network documentation, Arweave storage calculator

Second, if we want to store a large amount of arbitrage data in these blocks, network congestion will become severe, leading to gas wars when using the network, which will drive up prices. This is the consequence of the implicit time value of blocks; if users need to submit transactions to the network at a specific time, they will need to pay extra gas fees to have their transactions prioritized.

Therefore, it is recommended to store NFT metadata and image data, as well as the front end of dApps, off-chain.

The Centralized Network Perspective

If on-chain data storage is so expensive, why not store data directly off-chain in a centralized network?

Centralized networks are susceptible to censorship and have variability. This requires users to trust data providers to maintain the security of the data. No one can ensure that the operators of a centralized network will not betray users' trust: data may be intentionally or accidentally erased. For example, data providers may change policies, experience hardware failures, or be attacked by third parties.

NFTs

As the floor price of NFT collectibles exceeds $100,000, some NFT images are worth as much as $70,000 per kb of image data; merely making promises is insufficient to ensure the availability of data at all times. Stronger guarantees are needed to ensure the immutability and permanence of the underlying NFT data.

Figure 3: The floor price of Crypto Punk based on the last sale (no floor price at the time of writing); Crypto Punk image size based on the byte length of the on-chain byte string of Crypto Punks V2. Data as of May 10, 2022. Source: OpenSea, on-chain data, IPFS metadata

NFTs do not contain any image data; instead, they have a pointer to off-chain stored metadata and image data. But it is precisely this metadata and image data that need protection; if this data disappears, the NFT becomes an empty container.

Figure 4: A simplified illustration of blockchain, blocks, NFTs, and off-chain metadata

It can be said that the value of NFTs is not primarily driven by the metadata and image data they point to, but by the community of collectors that creates related activities around the collectibles. While this may be correct, without the underlying data, NFTs are meaningless, and a meaningless community cannot form.

In addition to profile pictures and art collectibles, NFTs can also represent ownership of real-world assets, such as real estate or financial instruments. Such data has external real-world value, and since it is represented by NFTs, the value of preserving every byte of NFT data cannot be lower than the value of on-chain NFTs.

dApps

If NFTs are goods existing on the blockchain, then dApps can be considered services that exist on the blockchain and facilitate interaction with it. dApps are a combination of off-chain front-end user interfaces and smart contracts that exist on the network and interact with the blockchain. Sometimes they also have a simple back end that can offload certain computations to reduce the gas required, thereby lowering the costs incurred by end users for certain transactions.

Figure 5: A simplified illustration of dApps interacting with the blockchain

Although the value of dApps should be considered based on their context (e.g., DeFi, GameFi, social, metaverse, name services, etc.), the value brought by dApps is astonishing. In the past 30 days at the time of writing, the top 10 dApps ranked on DappRadar collectively facilitated over $150 billion in transfers.

Figure 6: The most popular dApps by dollar amount reported by DappRadar as of May 11, 2022

Although the core mechanism of dApps is executed by smart contracts, end users can ensure accessibility through the front end. Therefore, in a sense, ensuring the accessibility of the dApp front end is about ensuring the availability of the underlying services.

Figure 7: Aave founder Stani Kulechov stated on Twitter that the Aave dApp front end went offline on January 20, 2022, but can still be accessed via an IPFS-hosted website copy

Decentralized storage reduces server failures, DNS hacks, and access deletion by centralized entities to the dApp front end. Even if the development of the dApp stops, access to the smart contracts can continue through the front end.

The Landscape of Decentralized Storage

The existence of blockchains like Bitcoin and Ethereum is primarily to facilitate value transfer. When it comes to decentralized storage networks, some networks have adopted this approach: they use native blockchains to record and track storage orders, representing value transfer in exchange for storage services. However, this is just one of many potential methods—storage is vast, and over the years, different solutions with varying trade-offs and use cases have emerged.

Figure 8: Overview of some arbitrarily selected decentralized storage protocols (non-exhaustive)

Despite many differences, all of the above projects share a common point: these networks do not replicate all data across all nodes, as is the case with Bitcoin and Ethereum blockchains. In decentralized storage networks, the immutability and availability of stored data are not achieved by most networks storing and validating sequentially linked data, as is the case with Bitcoin and Ethereum. As mentioned earlier, many networks choose to use blockchains to track storage orders.

It is unsustainable for all nodes on a decentralized storage network to store all data, as the indirect costs of running the network would rapidly increase users' storage costs and ultimately drive the network towards centralization, shifting to a few node operators who can afford the hardware costs.

Therefore, decentralized storage networks need to overcome unusual challenges.

Challenges of Decentralized Storage

Reflecting on the previously mentioned limitations of on-chain data storage, it is clear that decentralized storage networks must store data in a way that does not affect the network's value transfer mechanism while ensuring that the data remains persistent, immutable, and accessible. Essentially, decentralized storage networks must be able to store, retrieve, and maintain data while ensuring that all participants in the network are incentivized for their storage and retrieval work, while also maintaining the trustlessness of the decentralized system.

These challenges can be summarized into the following questions:

  • Data storage format: Store complete files or file fragments?
  • Data replication: Across how many nodes should data (complete files or fragments) be stored?
  • Storage tracking: How does the network know where to retrieve files?
  • Proof of stored data: Do nodes store the data they are required to store?
  • Data availability over time: Is the data still stored over time?
  • Storage price discovery: How is the cost of storage determined?
  • Persistent data redundancy: How does the network ensure data remains available if nodes leave the network?
  • Data transmission: Network bandwidth is costly—how does the network ensure nodes retrieve data when asked?
  • Network token economics: Besides ensuring data is available on the network, how does the network ensure its long-term existence?

Part of this study explores the extensive mechanisms of each network and the trade-offs that these mechanisms achieve decentralization.

Figure 9: Summary of technical design decisions for audited storage networks

For an in-depth comparison of how the above networks address each challenge, as well as detailed profiles of each network, please refer to the full research article available on Arweave or Crust Network.

Data Storage Format

Figure 10: Data replication and erasure coding

In these networks, there are two main approaches to storing data on the network: storing complete files and using erasure coding. Arweave and Crust Network store complete files, while Filecoin, Sia, Storj, and Swarm use erasure coding. In erasure coding, data is broken down into fixed-size fragments, each of which is expanded and encoded with redundant data. The redundant data saved in each fragment allows only a subset of the fragments to be needed to reconstruct the original file.

Data Replication

In Filecoin, Sia, Storj, and Swarm, the network determines the number of erasure-coded fragments and the range of redundant data to be stored in each fragment. However, Filecoin also allows users to determine the replication factor, which decides how many separate physical devices the erasure-coded fragments should be replicated on as part of a storage transaction with a single storage miner. If users want to store files with different storage miners, they must conduct separate storage transactions. Crust and Arweave let the network decide replication, while it is possible to manually set the replication factor on Crust. On Arweave, the storage proof mechanism incentivizes nodes to store as much data as possible. Therefore, the replication limit on Arweave is the total number of storage nodes on the network.

Figure 11: The method of data storage and replication will affect retrieval and reconstruction

The methods used for storing and replicating data will impact how the network retrieves data.

Storage Tracking

Once data is distributed across the nodes in the network, the network needs to be able to track the stored data. Filecoin, Crust, and Sia use local blockchains to track storage orders, while storage nodes also maintain a local list of network locations. Arweave uses a blockchain-like structure. Unlike blockchains like Bitcoin and Ethereum, on Arweave, nodes can decide whether to store data from blocks. Therefore, if comparing the chains of multiple nodes on Arweave, they will not be identical—some blocks may be missing on certain nodes while found on others.

Figure 12: Illustration of three nodes in blockweave

Finally, Storj and Swarm use two completely different approaches. In Storj, a second type of node called a satellite node acts as a coordinator for a group of storage nodes to manage and track the storage locations of data. In Swarm, the address of the data is directly embedded in the data blocks. When retrieving data, the network knows where to look based on the data itself.

Proof of Stored Data

When it comes to proving data storage, each network adopts its unique approach. Filecoin uses proof of replication—a proprietary storage proof mechanism that first stores data on storage nodes and then seals the data in a sector. The sealing process allows two replicated fragments of the same data to prove each other is unique, ensuring the correct number of copies are stored on the network (hence "proof of replication").

Crust breaks a piece of data into many small chunks, which are hashed into a Merkle tree. By comparing the hash of the individual data stored on physical storage devices with the expected Merkle tree hash value, Crust can verify whether the file has been stored correctly. This is similar to Sia's approach, with the difference being that Crust stores the entire file on each node, while Sia stores erasure-coded fragments. Crust can store the entire file on a single node and still achieve privacy by using a Trusted Execution Environment (TEE), a sealed hardware component that even the hardware owner cannot access.

Crust refers to this storage proof algorithm as "proof of meaningful work," where "meaningful" indicates that new hash values are only calculated when changes are made to the stored data, thus reducing meaningless operations. Both Crust and Sia store the Merkle tree root hash on the blockchain as a true source for verifying data integrity.

Storj checks whether data has been stored correctly through data audits. Data audits are similar to how Crust and Sia use Merkle trees to verify data fragments. On Storj, once enough nodes return their audit results, the network can determine which nodes are faulty based on the majority response, rather than comparing with the factual source on the blockchain. This mechanism in Storj is intentional, as developers believe that reducing coordination across the network through the blockchain can enhance performance in terms of speed (no need to wait for consensus) and bandwidth usage (no need for the entire network to regularly interact with the blockchain).

Arweave uses cryptographic proof-of-work puzzles to determine whether a file has been stored. In this mechanism, for nodes to mine the next block, they need to prove they can access the previous block and another random block in the network's block history. Because the data uploaded to Arweave is directly stored in blocks, proving access to the previous block demonstrates that the storage provider has indeed preserved the file correctly.

Finally, Swarm also uses Merkle trees, but unlike others, the Merkle tree is not used to determine file locations; instead, data blocks are directly stored in the Merkle tree. When storing data on Swarm, the root hash of the tree (which is also the address of the stored data) proves that the file has been correctly chunked and stored.

Data Availability Over Time

Similarly, each network has a unique approach to determining whether data is stored over a specific period. In Filecoin, to reduce network bandwidth, storage miners must continuously run the proof of replication algorithm during the time period they are storing data. The resulting hash for each time period proves that the storage space has been occupied by the correct data during that specific time period, thus it is "proof of spacetime."

Crust, Sia, and Storj regularly verify random data fragments and report the results to their coordinating mechanisms—Crust and Sia's blockchains, and Storj's satellite nodes. Arweave ensures consistent availability of data through its proof of access mechanism, which requires miners to prove not only that they can access the last block but also that they can access a random historical block. Storing older and rarer blocks is incentivized, as it increases the likelihood of miners winning the proof-of-work puzzle, which is a prerequisite for accessing specific blocks.

On the other hand, Swarm regularly runs lotteries to reward nodes holding less popular data, while also running ownership proof algorithms for nodes that commit to storing data for longer periods.

Filecoin, Sia, and Crust require nodes to deposit collateral to become storage nodes, while Swarm only requires it for long-term storage requests. Storj does not require upfront collateral, but it deducts a portion of the miners' storage income. Finally, all networks regularly pay nodes during the time periods in which they can prove they have stored data.

Storage Price Discovery

To determine storage prices, Filecoin and Sia use storage markets, where storage providers set their asking prices, and storage users set the prices they are willing to pay, along with other settings. The storage market then connects users with storage providers that meet their requirements. Storj adopts a similar approach, with the main difference being that there is no single network-wide market that connects all nodes on the network. Instead, each satellite has its own set of storage nodes it interacts with.

Finally, Crust, Arweave, and Swarm let the protocol decide storage prices. Crust and Swarm can make certain settings based on users' file storage requirements, while files on Arweave are permanently stored.

Persistent Data Redundancy

Over time, nodes will leave these open public networks, and when nodes disappear, the data they stored will also vanish. Therefore, the network must actively maintain a certain level of redundancy within the system. Sia and Storj recreate lost fragments by collecting subsets of fragments, reconstructing the underlying data, and then re-encoding the files, thereby supplementing the lost erasure-coded fragments. In Sia, users must regularly log into the Sia client to replenish fragments, as only the client can distinguish which data fragments belong to which data and user. In Storj, the satellite remains online and regularly runs data audits to replenish data fragments.

Arweave's proof of access algorithm ensures that data is regularly replicated across the network, while in Swarm, data is replicated to nearby nodes. In Filecoin, if data disappears over time and the remaining file fragments fall below a certain threshold, the storage order will be reintroduced to the storage market, allowing another storage miner to take over that storage order. Crust's replenishment mechanism is currently under development.

Incentivizing Data Transmission

As time goes on and data is securely stored, users will want to retrieve the data. Since bandwidth is costly, data must be provided to storage nodes as an incentive when needed. Crust and Swarm use a debt and credit mechanism, where each node tracks how inbound and outbound traffic flows with the nodes they interact with. If a node only accepts inbound traffic but does not accept outbound traffic, it will be deprioritized for future communications, which may affect its ability to accept new storage orders. Crust uses the IPFS Bitswap mechanism, while Swarm uses a proprietary protocol called SWAP. In Swarm's SWAP protocol, the network allows nodes to repay their debts with stamps (only accepting inbound traffic without sufficient outbound traffic), which can be exchanged for their utility tokens.

Figure 13: Swarm Accounting Protocol (SWAP), Source: Swarm White Paper

This tracking of node generosity is also how Arweave ensures data is transmitted upon request. In Arweave, this mechanism is called wildfire, where nodes prioritize better-ranked peers and rationalize bandwidth usage accordingly. Finally, in Filecoin, Storj, and Sia, users ultimately pay for bandwidth, incentivizing nodes to deliver data upon request.

Token Economics

The design of token economics ensures the stability of the network and guarantees that the network will exist long-term, as ultimately data will be as permanent as the network itself. In the table below, we can find a brief summary of the design decisions regarding token economics, along with the inflationary and deflationary mechanisms embedded in the respective designs.

Figure 14: Token economics design decisions of audited storage networks.

Which Network is the Best?

There will not be one network that is objectively better than another. There are countless trade-offs in designing decentralized storage networks. While Arweave is excellent for permanently storing data, it may not be suitable for migrating Web2.0 industry participants to Web3.0—not all data needs to be permanently preserved. However, a strong subfield of data does require permanence: NFTs and dApps.

Ultimately, design decisions will be based on the purpose of the network.

Below is a summary overview of various storage networks, comparing them on a set of scales defined below. The scales used reflect the comparative dimensions of these networks, but it should be noted that the methods for overcoming the challenges of decentralized storage do not have a clear right or wrong in many cases; they merely reflect design decisions.

  • Storage parameter flexibility: The extent to which users control file storage parameters
  • Storage permanence: The extent to which files can achieve theoretical permanence through the network (i.e., without intervention)
  • Redundancy permanence: The network's ability to maintain data redundancy through supplementation or repair
  • Data transmission incentives: The extent to which the network ensures nodes generously transmit data
  • Universality of storage tracking: The degree of consensus among nodes regarding the location of stored data
  • Guaranteed data accessibility: The network's ability to ensure that individual participants in the storage process cannot delete access to files on the network

Higher scores indicate stronger capabilities in the above areas.

Filecoin's token economics supports increasing the overall storage capacity of the network for storing large amounts of data in an immutable manner. Additionally, their storage algorithm is better suited for data that is unlikely to change significantly over time (cold storage).

Figure 15: Summary overview of Filecoin

Crust's token economics ensures super redundancy and fast retrieval, making it suitable for high-traffic dApps and for quickly retrieving data for popular NFTs.

Crust scores lower in storage permanence because, without persistent redundancy, its ability to provide permanent storage is severely impacted. Nevertheless, permanence can still be achieved by manually setting a very high replication factor.

Figure 16: Summary overview of Crust

Sia is about privacy. The reason users need to manually restore health is that nodes do not know which data fragments they have stored and which data those fragments belong to. Only data owners can reconstruct the original data from the shards in the network.

Figure 17: Summary overview of Sia

In contrast, Arweave is about permanence. This is reflected in their design, which makes storage costs higher but also makes them an attractive choice for storing NFTs.

Figure 18: Summary overview of Arweave

Storj's business model seems to significantly influence their billing and payment methods: Amazon AWS S3 users are more familiar with monthly billing. By removing the complex payment and incentive systems common in blockchain-based systems, Storj Labs sacrifices some decentralization but significantly lowers the entry barrier for their key target audience of AWS users.

Figure 19: Summary overview of Storj

Swarm's joint curve model ensures that as more data is stored on the network, storage costs remain relatively low, and its proximity to the Ethereum blockchain makes it a strong competitor for primary storage for more complex Ethereum-based dApps.

Figure 20: Summary overview of Swarm

There is no single best approach to the various challenges faced by decentralized storage networks. Depending on the purpose of the network and the problems it aims to solve, it must make trade-offs in the technical design and token economics of the network.

Figure 21: Summary of powerful use cases for audited storage networks

Ultimately, the purpose of the network and the specific use cases it seeks to optimize will determine various design decisions.

The Next Chapter

Returning to the pillars of Web3 infrastructure (consensus, storage, computation), we see that the decentralized storage space has a small number of powerful participants, and they have positioned themselves in the market for specific use cases. This does not exclude the possibility of emerging networks optimizing existing solutions or opening up new markets, but it does raise the question: what comes next?

The answer is: computation. The next frontier for achieving a truly decentralized internet is decentralized computing. Currently, only a few solutions are capable of pushing trustless, decentralized computing solutions to the market that can support complex dApps, enabling more complex computations at far lower costs than executing smart contracts on the blockchain.

The Internet Computer (ICP) and Holochain (HOLO) are networks that hold a strong position in the decentralized computing market at the time of writing. Nevertheless, the computing space is not as crowded as the consensus and storage spaces. Therefore, strong competitors will inevitably enter the market and position themselves accordingly. Stratos (STOS) is one such competitor. Stratos offers a unique network design through its decentralized data grid technology.

We will view decentralized computing, particularly the network design of Stratos, as a field for future research.

Conclusion

Thank you for reading this article on decentralized storage research. If you enjoyed the research aimed at uncovering the fundamental building blocks for constructing our shared Web3 future, consider following @FundamentalLabs on Twitter.

Did I miss any valuable concepts or other information? Please reach out to me on Twitter @0xPhillan so we can solidify this research together.

The complete work can be found on Arweave* and* Crust Network.

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators