Reading, indexing to analysis, a brief overview of the Web3 data indexing track

2024-09-13 14:35:51

Collection

This article explores the development of blockchain data accessibility, comparing the characteristics of three data service protocols: The Graph, Chainbase, and Space and Time, in terms of architecture and AI technology application. It points out that blockchain data services are evolving towards intelligence and security, and will continue to play an important role as industry infrastructure in the future.

Scan with WeChat

1 Introduction

Since the first wave of dApps like Etheroll, ETHLend, and CryptoKitties in 2017, to the flourishing of various financial, gaming, and social dApps based on different blockchains today, have we ever considered the sources of the various types of data adopted by these dApps in their interactions when we talk about decentralized on-chain applications?

In 2024, the spotlight is on AI and Web3. In the world of artificial intelligence, data is like the lifeblood of its growth and evolution. Just as plants rely on sunlight and water to thrive, AI systems depend on vast amounts of data to continuously "learn" and "think." Without data, even the most sophisticated AI algorithms are mere castles in the air, unable to exert their intended intelligence and effectiveness.

This article analyzes the evolution of blockchain data indexing from the perspective of data accessibility, comparing the established data indexing protocol The Graph with emerging blockchain data service protocols Chainbase and Space and Time, particularly exploring the similarities and differences in data services and product architecture of these two new protocols that integrate AI technology.

2 The Complexity and Simplicity of Data Indexing: From Blockchain Nodes to Full-Chain Databases

2.1 Data Source: Blockchain Nodes

From the moment we start to understand "what blockchain is," we often see a phrase: blockchain is a decentralized ledger. Blockchain nodes are the foundation of the entire blockchain network, responsible for recording, storing, and disseminating all transaction data on the chain. Each node has a complete copy of the blockchain data, ensuring the decentralized nature of the network is maintained. However, for ordinary users, building and maintaining a blockchain node is not an easy task. It requires not only specialized technical skills but also incurs high hardware and bandwidth costs. Additionally, the querying capabilities of ordinary nodes are limited, making it difficult to retrieve data in the format developers need. Therefore, while theoretically anyone can run their own node, in practice, users often prefer to rely on third-party services.

To address this issue, RPC (Remote Procedure Call) node providers have emerged. These providers manage the costs and maintenance of nodes and provide data through RPC endpoints, allowing users to easily access blockchain data without having to build their own nodes. Public RPC endpoints are free but come with rate limits, which may negatively impact the user experience of dApps. Private RPC endpoints offer better performance by reducing congestion, but even simple data retrieval requires a lot of back-and-forth communication. This makes them resource-intensive and inefficient for complex data queries. Moreover, private RPC endpoints are often difficult to scale and lack compatibility across different networks. However, the standardized API interfaces provided by node providers lower the barrier for users to access on-chain data, laying the groundwork for subsequent data parsing and application.

2.2 Data Parsing: From Raw Data to Usable Data

The data obtained from blockchain nodes is often raw data that has been encrypted and encoded. While this data preserves the integrity and security of the blockchain, its complexity also increases the difficulty of data parsing. For ordinary users or developers, directly handling this raw data requires a significant amount of technical knowledge and computational resources.

In this context, the process of data parsing becomes particularly important. By parsing complex raw data and converting it into a more understandable and operable format, users can more intuitively understand and utilize this data. The success or failure of data parsing directly determines the efficiency and effectiveness of blockchain data applications, making it a key step in the entire data indexing process.

2.3 Evolution of Data Indexers

As the volume of blockchain data increases, the demand for data indexers has also grown. Indexers play a crucial role in organizing on-chain data and sending it to databases for easier querying. The working principle of an indexer is to index blockchain data and make it readily available through a SQL-like query language (such as GraphQL APIs). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, greatly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

Full Node Indexers: These indexers run complete blockchain nodes and extract data directly from them, ensuring data completeness and accuracy, but requiring significant storage and processing power.
Lightweight Indexers: These indexers rely on full nodes to obtain specific data as needed, reducing storage requirements but potentially increasing query times.
Specialized Indexers: These indexers focus on certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.
Aggregate Indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is particularly useful for multi-chain dApps.

Currently, Ethereum archive nodes in the Geth client occupy about 13.5 TB of storage in archive mode, while the archive requirement in the Erigon client is about 3 TB. As the blockchain continues to grow, the data storage requirements for archive nodes will also increase. Faced with such a massive amount of data, mainstream indexing protocols not only support multi-chain indexing but also customize data parsing frameworks for different application data needs. For example, The Graph's "subgraph" framework is a typical case.

The emergence of indexers has significantly improved the efficiency of data indexing and querying. Compared to traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed queries. These indexers allow users to perform complex queries, easily filter data, and analyze it after extraction. Additionally, some indexers support aggregating data sources from multiple blockchains, avoiding the need to deploy multiple APIs in multi-chain dApps. By running distributed across multiple nodes, indexers not only provide stronger security and performance but also reduce the risk of interruptions and downtime that centralized RPC providers may bring.

In contrast, indexers allow users to directly access the information they need without having to deal with the underlying complex data through predefined query languages. This mechanism significantly improves the efficiency and reliability of data retrieval, representing an important innovation in blockchain data access.

2.4 Full-Chain Databases: Aligning with Stream-First

Using indexed nodes to query data often means that APIs become the sole portal for digesting on-chain data. However, when a project enters the scaling phase, it often requires more flexible data sources, which standardized APIs cannot provide. As application demands become more complex, primary data indexers with their standardized indexing formats gradually struggle to meet increasingly diverse query needs, such as searching, cross-chain access, or off-chain data mapping.

In modern data pipeline architectures, the "stream-first" approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to respond immediately to incoming data, deriving insights and making decisions almost instantaneously. Similarly, the development of blockchain data service providers is also moving towards building blockchain data streams, with traditional indexer service providers gradually launching products that obtain real-time blockchain data in a streaming manner, such as The Graph's Substreams, Goldsky's Mirror, and real-time data lakes like Chainbase and SubSquid that generate data streams based on blockchains.

These services aim to meet the demand for real-time parsing of blockchain transactions and provide more comprehensive querying capabilities. Just as the "stream-first" architecture has revolutionized data processing and consumption in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream service providers also hope to support the development of more applications and assist on-chain data analysis through more advanced and mature data sources.

By redefining the challenges of on-chain data through the lens of modern data pipelines, we gain a new perspective on the management, storage, and provision of on-chain data's full potential. When we begin to view subgraphs and Ethereum ETL as data streams within data pipelines rather than final outputs, we can envision a possible world where high-performance datasets are tailored for any business use case.

3 AI + Database? A Deep Comparison of The Graph, Chainbase, and Space and Time

3.1 The Graph

The Graph network achieves multi-chain data indexing and querying services through a decentralized network of nodes, facilitating developers to conveniently index blockchain data and build decentralized applications. Its main product models are the data query execution market and the data indexing cache market, both of which essentially serve the product query needs of users. The data query execution market specifically refers to consumers paying the appropriate indexing nodes for the data they need, while the data indexing cache market is where indexing nodes allocate resources based on the historical indexing popularity of subgraphs, the query fees collected, and the demand from on-chain curators for subgraph outputs.

Subgraphs are the fundamental data structure within The Graph network. They define how to extract and transform data from the blockchain into a queryable format (e.g., GraphQL schema). Anyone can create subgraphs, and multiple applications can reuse these subgraphs, enhancing data reusability and efficiency.

The Graph Product Structure (Source: The Graph Whitepaper)

The Graph network consists of four key roles: indexers, curators, delegators, and developers, who collectively provide data support for web3 applications. Here are their respective responsibilities:

Indexer: Indexers are node operators within The Graph network. They participate in the network by staking GRT (The Graph's native token) and provide indexing and query processing services.
Delegator: Delegators are users who stake GRT tokens to support the operation of indexing nodes. They earn a portion of the rewards through the indexing nodes they delegate to.
Curator: Curators are responsible for signaling which subgraphs should be indexed by the network. They help ensure that valuable subgraphs are prioritized.
Developer: Unlike the first three roles, developers are the demand side and the primary users of The Graph. They create and submit subgraphs to The Graph network, waiting for the network to fulfill their data needs.

Currently, The Graph has transitioned to a fully decentralized subgraph hosting service, with circulating economic incentives among different participants ensuring the system operates:

Indexer Rewards: Indexers earn revenue through consumer query fees and a portion of GRT token block rewards.
Delegator Rewards: Delegators receive a portion of the rewards from the indexing nodes they support.
Curator Rewards: If curators signal valuable subgraphs, they can earn a portion of the query fees.

In fact, The Graph's products are also rapidly evolving in the AI wave. As one of the core development teams in The Graph ecosystem, Semiotic Labs has been dedicated to optimizing indexing pricing and user query experience using AI technology. Currently, tools developed by Semiotic Labs, such as AutoAgora, Allocation Optimizer, and AgentC, enhance the ecosystem's performance in various aspects.

AutoAgora introduces a dynamic pricing mechanism that adjusts prices in real-time based on query volume and resource usage, optimizing pricing strategies to ensure indexers' competitiveness and maximize revenue.
Allocation Optimizer addresses the complex issue of resource allocation for subgraphs, helping indexers achieve optimal resource configuration to enhance revenue and performance.
AgentC is an experimental tool that allows users to access The Graph's blockchain data through natural language, thereby enhancing user experience.

The application of these tools enables The Graph to further enhance the system's intelligence and user-friendliness through AI assistance.

3.2 Chainbase

Chainbase is a full-chain data network that integrates all blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:

Real-Time Data Lake: Chainbase provides a real-time data lake specifically for blockchain data streams, allowing data to be accessed instantly as it is generated.
Dual-Chain Architecture: Chainbase builds an execution layer based on Eigenlayer AVS, forming a parallel dual-chain architecture with the CometBFT consensus algorithm. This design enhances the programmability and composability of cross-chain data, supporting high throughput, low latency, and finality, while improving network security through a dual-staking model.
Innovative Data Format Standard: Chainbase introduces a new data format standard called "manuscripts," optimizing the structuring and utilization of data in the crypto industry.
Cryptographic World Model: With its vast blockchain data resources, Chainbase combines AI model technology to create AI models that can effectively understand, predict blockchain transactions, and interact with them. The basic version model, Theia, has been launched for public use.

These features make Chainbase stand out in the blockchain indexing protocol, particularly focusing on the accessibility of real-time data, innovative data formats, and the creation of smarter models through the combination of on-chain and off-chain data to enhance insights.

Chainbase's AI model Theia is a key highlight that distinguishes it from other data service protocols. Theia is based on the DORA model developed by NVIDIA, combining on-chain and off-chain data with temporal and spatial activities to learn and analyze crypto patterns, responding through causal inference to deeply explore the potential value and patterns of on-chain data, providing users with more intelligent data services.

AI-powered data services make Chainbase not just a blockchain data service platform but a more competitive intelligent data service provider. With powerful data resources and proactive analysis through AI, Chainbase can offer broader data insights and optimize users' data processing.

3.3 Space and Time

Space and Time (SxT) aims to create a verifiable computing layer that extends zero-knowledge proofs over decentralized data warehouses, providing trustworthy data processing for smart contracts, large language models, and enterprises. Currently, Space and Time has secured $20 million in its latest Series A funding round, led by Framework Ventures, Lightspeed Faction, Arrington Capital, and Hivemind Capital.

In the field of data indexing and verification, Space and Time introduces a new technical path—Proof of SQL. This is an innovative zero-knowledge proof (ZKP) technology developed by Space and Time that ensures SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. When a query is run, Proof of SQL generates an encrypted proof that verifies the integrity and accuracy of the query results. This proof is attached to the query results, allowing any verifier (such as smart contracts) to independently confirm that the data has not been tampered with during processing. Traditional blockchain networks typically rely on consensus mechanisms to verify the authenticity of data, while Space and Time's Proof of SQL implements a more efficient data verification method. Specifically, in Space and Time's system, one node is responsible for data acquisition, while other nodes verify the authenticity of that data using zk technology. This approach changes the resource consumption of multiple nodes redundantly indexing the same data under a consensus mechanism to ultimately reach consensus and obtain data, enhancing the overall performance of the system. As this technology matures, it lays a foundation for a series of traditional industries focused on data reliability to use blockchain data to construct products.

At the same time, SxT has been closely collaborating with Microsoft's AI Joint Innovation Lab to accelerate the development of generative AI tools, making it easier for users to process blockchain data through natural language. Currently, in Space and Time Studio, users can experience inputting natural language queries, and AI will automatically convert them into SQL and execute the query on behalf of the user to present the final results they need.

3.4 Comparative Differences

Conclusion and Outlook

In summary, blockchain data indexing technology has evolved from the initial node data sources, through the development of data parsing and indexers, to AI-powered full-chain data services, undergoing a gradual improvement process. The continuous evolution of these technologies not only enhances the efficiency and accuracy of data access but also brings users an unprecedented intelligent experience.

Looking ahead, with the continuous development of new technologies such as AI and zero-knowledge proofs, blockchain data services will further become more intelligent and secure. We have reason to believe that blockchain data services will continue to play an important role as infrastructure in the future, providing strong support for industry progress and innovation.

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.

Trustless Labs Exploring and empowering innovations in Web3

Story Blockchain Empowers Intellectual Property

Fractal, OP_NET, AVM, BRC100, programmable runes, what other expansion solutions are there for BTC?