In-depth Analysis of the Web3 Data Economy: The Next Billion-Dollar Track After LSD
Author: Yuxing, SevenX Ventures
This article is for communication and learning purposes only and does not constitute any investment advice.
The popularity of ChatGPT and GPT-4 has shown us the power of artificial intelligence. Behind artificial intelligence, besides algorithms, the more important factor is the vast amount of data. We have built a large-scale complex system around data, and the value of this system mainly comes from Business Intelligence (BI) and Artificial Intelligence (AI). Due to the rapid growth of data in the internet era, the work and best practices of data infrastructure are also evolving rapidly. In the past two years, the core systems of the data infrastructure technology stack have become very stable, and supporting tools and applications are growing quickly.
Web2 Data Infrastructure Architecture
Cloud data warehouses (such as Snowflake) are rapidly growing, primarily focusing on SQL users and business intelligence user scenarios. The adoption of other technologies is also accelerating, with unprecedented customer growth for data lakes (such as Databricks), and the heterogeneity within the data technology stack will coexist.
Other core data systems, such as data extraction and transformation, have also proven to be durable. This is particularly evident in the modern data intelligence field. The combination of Fivetran and dbt (or similar technologies) is almost ubiquitous. To some extent, this is also true in business systems. The combination of Databricks/Spark, Confluent/Kafka, and Astronomer/Airflow is also beginning to become a standard.
Source: a16z
Among them,
- Data sources generate relevant business and operational data;
- Data extraction and transformation are responsible for extracting data from business systems (E), transporting it to storage, aligning the formats between the data source and destination (L), and sending the analyzed data back to the business systems as needed;
- Data storage stores data in a queryable and processable format, optimized for low cost, high scalability, and analytical workload;
- Querying and processing translate high-level programming languages (usually SQL, Python, or Java/Scala) into low-end data processing tasks. Based on stored data, distributed computing is used to execute queries and data models, including historical analysis (describing past events) and predictive analysis (describing expected future events);
- Transformation converts data into structures usable for analysis, managing processes and resources;
- Analysis and output provide analysts and data scientists with an interface for traceable insights and collaboration, showcasing the results of data analysis to internal and external users, embedding data models into user-facing applications.
With the rapid development of the data ecosystem, the concept of a "data platform" has emerged. From an industry perspective, the defining characteristics of a platform are that influential platform providers and a large number of third-party developers can technically and economically depend on each other. From the platform perspective, the data technology stack is divided into "front-end" and "back-end."
The "back-end" roughly includes data extraction, storage, processing, and transformation, and has begun to integrate around a small number of cloud service providers. As a result, customer data is collected in a standardized system, and vendors are heavily investing to make it easy for other developers to access this data. This is also the fundamental design principle of systems like Databricks, and it has been realized through SQL standards and custom computation APIs (such as Snowflake).
"Front-end" engineers leverage this single-point integration to build a range of new applications. They rely on cleaned and integrated data within the warehouse/lakehouse without worrying about the underlying details of how it was generated. Individual customers can build and purchase many applications on top of a core data system. We are even beginning to see traditional enterprise systems, such as finance or product analysis, being restructured using warehouse-native architectures.
As the data technology stack matures, data applications on data platforms are also surging. Due to standardization, adopting new data platforms has become unprecedentedly important, and accordingly, maintaining these platforms has become extremely crucial. At scale, platforms can be very valuable. Currently, competition among core data system vendors is fierce, and this competition is not only for current business but also for long-term platform positioning. If you believe that the data extraction and transformation module is a core part of emerging data platforms, then the astonishing valuations of data extraction and transformation companies become easier to understand.
However, the formation of these technology stacks has occurred under a data utilization model dominated by large companies. As society deepens its understanding of data, it is believed that data, like land, labor, capital, and technology, is also a marketable factor of production. Data, as one of the five major factors of production, reflects the asset value of data.
To achieve the market allocation of data factors, the current technology stack is far from meeting the demand. In the Web3 domain, closely integrated with blockchain technology, new data infrastructures are evolving and developing. These infrastructures will be embedded in modern data infrastructure architectures to achieve data property rights definition, circulation transactions, revenue distribution, and factor governance. These four areas are crucial from the perspective of government regulation and thus require special attention.
Web3 Hybrid Data Infrastructure Architecture
Inspired by a16z's unified data infrastructure architecture (2.0) and integrating an understanding of Web3 infrastructure architecture, we propose the following Web3 hybrid data infrastructure architecture.
The orange units are unique to the Web3 technology stack. As decentralized technology is still in its early development stage, most applications in the Web3 domain still adopt this hybrid data infrastructure architecture. The vast majority of applications are not true "superstructures." Superstructures possess characteristics such as unstoppable, free, valuable, scalable, permissionless, positive externalities, and trusted neutrality. They exist as public goods in the digital world, serving as the public infrastructure of the "metaverse." This requires a fully decentralized underlying architecture to support it.
Traditional data infrastructure architecture has evolved based on enterprise business development. a16z summarizes it into two systems (analytics system and business system) and three scenarios (modern business intelligence, multi-model data processing, and artificial intelligence and machine learning). This is a summary made from the enterprise perspective—data serves the development of enterprises.
However, not only enterprises but also society and individuals should benefit from the productivity improvements brought by data factors. Countries around the world have successively introduced policies and regulations to regulate data usage from a regulatory perspective and promote data circulation. This includes various Data Banks commonly seen in Japan, the recently emerging data exchanges in China, and trading platforms widely used in Europe and the United States, such as BDEX (USA), Streamr (Switzerland), DAWEX (France), and CARUSO, among others.
When data begins to define property rights, circulate transactions, distribute revenues, and govern, their systems and scenarios are no longer just empowering enterprises' own decision-making and business development. These systems and scenarios either need to leverage blockchain technology or strongly rely on policy regulation.
Web3 is the natural soil for the data factor market; it technically eliminates the possibility of cheating, significantly alleviating regulatory pressure, allowing data to exist as a true factor of production and undergo market allocation.
In the context of Web3, the new paradigm of data utilization includes market systems that carry flowing data factors and public systems that manage public data factors. They encompass three new data business scenarios: property rights data development and integration, composable initial data layer, and public data mining.
Some of these scenarios are closely integrated with traditional data infrastructure and belong to the Web3 hybrid data infrastructure architecture; others deviate from traditional architecture and are fully supported by new Web3-native technologies.
Web3 and the Data Economy
The data economy market is key to allocating data factors, including the development and integration of product data and the market for composable initial data layers. In an efficient and compliant data economy market, the following points are crucial:
- Data property rights are key to safeguarding rights and compliant usage and should be structurally allocated and disposed of, while data usage needs to confirm the authorization mechanism. Each participant should have relevant rights.
- Circulation transactions require a combination of on-site and off-site, as well as compliance and efficiency. They should be based on four principles: data source verifiability, definable usage scope, traceable circulation process, and preventable security risks.
- The revenue distribution system needs to be efficient and fair. Following the principle of "who invests, who contributes, who benefits," the government can play a guiding and regulating role in the revenue distribution of data factors.
- Factor governance should be secure, controllable, and resilient. This requires innovative government data governance mechanisms, establishing a credit system for the data factor market, and encouraging enterprises to actively participate in the construction of the data factor market, promoting data circulation transaction declarations and commitments for data providers and third-party professional service agencies around data sources, data property rights, data quality, and data usage.
The above principles are the basic principles for regulatory authorities to consider the data economy. In the three scenarios of property rights data development and integration, composable initial data layer, and public data mining, we can think based on these principles. What kind of infrastructure do we need to support? What kind of value can these infrastructures capture at various stages?
Scenario 1: Property Rights Data Development and Integration
Note: Orange represents the intersection of Web2 and Web3 units
In the process of property rights data development, a classification and grading authorization mechanism needs to be established to determine the ownership, usage rights, and operational rights of public data, enterprise data, and personal data. Based on data sources and generation characteristics, property rights are defined through "data adaptation." Typical projects include Navigate, Streamr Network, and KYVE. These projects achieve data quality standardization, data collection, and interface standardization through technical means, ensuring property rights for off-chain data in some form, and categorizing and grading authorization through smart contracts or internal logic systems.
- The applicable data types in this scenario are non-public data, namely enterprise data and personal data. They should be "jointly used and shared in revenue" in a market-oriented manner, thereby activating the value of data factors.
- Enterprise data includes various types of data collected and processed by market entities in their production and operational activities that do not involve personal information and public interests. Market entities have the right to legally hold, use, and obtain benefits from this data, as well as the right to ensure reasonable returns on their labor and other factor contributions.
Personal data requires data processors to collect, hold, manage, and use data according to the scope authorized by individuals in accordance with the law. Innovative technical means should be used to promote the anonymization of personal information, ensuring information security and personal privacy when using personal information data. Explore mechanisms where trustees represent personal interests and supervise market entities in collecting, processing, and using personal information data. For special personal information data involving national security, relevant units can be authorized to use it in accordance with the law.
Scenario 2: Composable Initial Data Layer
Note: Orange represents the intersection of Web2 and Web3 units
The composable initial data layer is an important component of the data economy market. Unlike general property rights data, this part of the data is characterized by the need to define data standards through "data model management." Unlike "data adaptation," which focuses on quality, collection, and interface standardization, this emphasizes the standardization of data models, including standard data formats and standard data models. Ceramic and Lens are pioneers in this field, ensuring standard models for off-chain (decentralized storage) and on-chain data, thus enabling data composability.
Built on these data model management tools is the composable initial data layer, commonly referred to as the "data layer," such as Cyberconnect, KNN3, etc.
The composable initial data layer involves less of the Web2 technology stack, but the hot data reading tools led by Ceramic break this barrier, which will be a crucial breakthrough. Many similar data do not need to be stored on the blockchain and are difficult to store on the blockchain, but they need to be stored on decentralized networks, such as high-frequency low-value density data like user posts, likes, and comments. Ceramic provides a storage paradigm for this type of data.
Composable initial data is a key scenario for innovation in the new era and an important sign of the end of data hegemony and data monopoly. It can solve the cold start problem for startups in terms of data, combining mature datasets and new datasets, enabling startups to establish data competitive advantages more quickly. At the same time, it allows startups to focus on incremental data value and data freshness, thereby gaining sustained competitiveness for their innovative ideas. In this way, a large amount of data will not become a moat for large companies.
Scenario 3: Public Data Mining
Note: Orange represents the intersection of multiple categories
Public data mining is not a new application scenario, but it has received unprecedented emphasis in the Web3 technology stack.
Traditional public data includes public data generated by party and government agencies and enterprises in the course of performing their duties or providing public services. Regulatory agencies encourage the provision of such data to society in the form of models, verifications, and other products and services, under the premise of protecting personal privacy and ensuring public safety, according to the requirements of "raw data not leaving the domain, data being usable but not visible." They adopt a traditional technology stack (blue and some orange, with orange representing the intersection of multiple types of technology stacks, the same below).
In Web3, transaction data and activity data on the blockchain represent another type of public data, characterized by being "available and visible," thus lacking data privacy, data security, and confirmation of data usage authorization capabilities, making them true "public goods." They adopt a technology stack centered on blockchain and smart contracts (yellow and some orange).
Data on decentralized storage is mostly Web3 application data other than transactions, primarily based on file and object storage, and the corresponding technology stack is still immature (green and some orange). The common issues in the production and mining utilization of this type of public data include hot and cold storage, indexing, state synchronization, permission management, and computation, among others.
This scenario has given rise to many data applications that do not belong to data infrastructure but are more data tools, including Nansen, Dune, NFTScan, 0xScope, etc.
Case: Data Exchange
A data exchange refers to a platform that trades data as a commodity. They can be classified and compared based on trading objects, pricing mechanisms, quality assurance, and other aspects. DataStreamX, Dawex, and Ocean Protocol are several typical data exchanges in the market.
Ocean Protocol (with a market cap of $200 million) is an open-source protocol designed to enable businesses and individuals to exchange and monetize data and data-based services. This protocol is based on the Ethereum blockchain and uses "data tokens" to control access to datasets. Data tokens are a special type of ERC20 token that represents ownership or usage rights of a dataset or a data service. Users can obtain the information they need by purchasing or earning data tokens.
The technical architecture of Ocean Protocol mainly includes the following components:
- Providers: Refers to the suppliers who provide data or data services, and they can issue and sell their own data tokens through Ocean Protocol to generate income.
- Consumers: Refers to the demand side that purchases and uses data or data services, and they can buy or earn the required data tokens through Ocean Protocol to gain access.
- Marketplaces: Refers to an open, transparent, and fair data trading market provided by Ocean Protocol or third parties, connecting providers and consumers globally and offering various types and fields of data tokens. The marketplace can help organizations discover new business opportunities, increase revenue sources, optimize operational efficiency, and create more value.
- Network: Refers to a decentralized network layer provided by Ocean Protocol that supports different types and scales of data exchange while ensuring security, trustworthiness, and transparency during the data trading process. The network layer consists of a set of smart contracts used for registering data, recording ownership information, facilitating secure data exchanges, etc.
- Curator: Refers to a role in the ecosystem responsible for screening, managing, and auditing datasets. They are responsible for reviewing information about the source, content, format, and licensing of datasets to ensure that they meet standards and can be trusted and used by other users.
- Verifier: Refers to a role in the ecosystem responsible for verifying and auditing data transactions and data services. They audit and verify transactions between data service providers and consumers to ensure the quality, availability, and accuracy of data services.
Source: Ocean Protocol
The "data services" created by data providers include data, algorithms, computation, storage, analysis, and curation. These components are bound to the execution agreements of the services (such as service level agreements), secure computation, access control, and licensing. Essentially, this controls access to a "cloud service suite" through smart contracts.
Source: Ocean Protocol
Its advantages include:
An open-source, flexible, and scalable protocol that helps organizations and individuals create their unique data ecosystems.
A decentralized network layer based on blockchain technology that ensures security, trustworthiness, and transparency during data transactions while also protecting the privacy and rights of providers and consumers.
An open, transparent, and fair data market that connects providers and consumers globally and offers various types and fields of data tokens.
Ocean Protocol is a typical representative of a hybrid architecture. Its data can be stored in different places, including traditional cloud storage services, decentralized storage networks, or the data provider's own servers. The protocol uses data tokens and data non-fungible tokens (data NFTs) to identify and manage data ownership and access rights. Additionally, the protocol provides a compute-to-data functionality, allowing data consumers to analyze and process data without exposing the raw data.
Source: Ocean Protocol
While Ocean Protocol is currently one of the most complete data trading platforms on the market, it still faces many challenges:
- Establishing an effective trust mechanism to increase trust between data providers and consumers and reduce transaction risks. For example, establishing a credit system for the data factor market to identify data trading dishonesty, incentivize good faith, punish dishonesty, repair credit, and handle disputes, with evidence and verification through blockchain.
- Establishing a reasonable pricing mechanism to reflect the true value of data products, incentivizing data providers to offer high-quality data and attracting more consumers.
- Establishing a unified standard specification to promote interoperability and compatibility between data of different formats, types, sources, and uses.
Case: Data Model Market
Ceramic has mentioned their goal of creating an open data model market in its data universe, as data needs interoperability, which can greatly enhance productivity. Such a data model market is achieved through urgent consensus on data models, similar to the ERC contract standards in Ethereum, from which developers can choose as functional templates, thus having applications that encompass all data conforming to that data model. Currently, at this stage, such a market is not a trading market.
Regarding data models, a simple example is that in decentralized social networks, a data model can be simplified into four parameters:
- PostList: Index of user posts
- Post: Storage of a single post
- Profile: Storage of user profiles
- FollowList: Storage of user follow lists
So how can data models be created, shared, and reused on Ceramic to achieve cross-application data interoperability?
Ceramic provides a DataModels Registry, which is an open-source, community-built repository for reusable application data models for Ceramic. Here, developers can publicly register, discover, and reuse existing data models—this is the foundation for building client operational applications based on shared data models. Currently, it is based on Github storage, and in the future, it will be decentralized on Ceramic.
All data models added to the registry will be automatically published under the npm package @datamodels. Any developer can install one or more data models using @datamodels/model-name, making these models available for use at runtime with any IDX client to store or retrieve data, including DID DataStore or Self.ID.
Additionally, Ceramic has also set up a DataModels forum based on Github, where each model in the data model registry has its own discussion thread, allowing the community to comment and discuss. At the same time, this forum can also be used by developers to post ideas about data models, seeking community feedback before adding them to the registry. Currently, everything is in the early stages, and there are not many data models in the registry. Data models entering the registry should be evaluated by the community and referred to as CIP standards, similar to Ethereum's smart contract standards, providing composability for data.
Case: Decentralized Data Warehouse
Space and Time is the first decentralized data warehouse that connects on-chain and off-chain data to support a new generation of smart contract use cases. Space and Time (SxT) has the industry's most mature blockchain indexing service, and the SxT data warehouse employs a new cryptographic method called Proof of SQL™ to generate verifiable tamper-proof results, allowing developers to join trustless on-chain and off-chain data in a simple SQL format and load the results directly into smart contracts, supporting sub-second queries and enterprise-level analytics in a completely tamper-proof and blockchain-anchored manner.
Space and Time consists of a two-layer network, comprising a validator layer and a data warehouse. The success of the SxT platform depends on the seamless interaction between validators and the data warehouse to facilitate simple and secure queries of on-chain and off-chain data.
The data warehouse consists of a network of databases and computing clusters, which are controlled by Space and Time validators and routed to them. Space and Time employs a highly flexible warehousing solution: HTAP (Hybrid transactional/analytic processing).
Validators monitor, command, and verify the services provided by these clusters, orchestrating the data flow and queries between end-users and data warehouse clusters. Validators provide a means for data to enter the system (such as blockchain indexing) and exit the system (such as smart contracts).
- Routing------Supports transactions and query interactions with the decentralized data warehouse network
- Streaming------Acts as a receiver for high-volume customer streaming (event-driven) workloads
- Consensus------Provides high-performance Byzantine fault tolerance for data entering and exiting the platform
- Query Proof------Provides SQL proof to the platform
- Table Anchor------Provides storage proof to the platform by anchoring tables on-chain
- Oracle------Supports Web3 interactions, including smart contract event listening and cross-chain messaging/relaying
- Security------Prevents unauthorized and unverified access to the platform
As a platform, Space and Time is the world's first decentralized data structure, opening up a powerful yet underserved market: data sharing. Within the Space and Time platform, companies can freely share data and can trade shared data using smart contracts. Additionally, datasets can be monetized in an aggregated manner through SQL proof without requiring consumers to access the raw data. Data consumers can trust that the aggregation is accurate without seeing the data itself, so data providers no longer have to be data consumers. For this reason, the combination of SQL proof and data structure architecture has the potential to democratize data operations, as anyone can contribute to the ingestion, transformation, and servicing of datasets.
Web3 Data Governance and Discovery
Currently, there is a lack of a practical and efficient data governance architecture in the Web3 data infrastructure architecture. However, a practical and efficient data governance infrastructure is crucial for allocating relevant rights of data factors to all participants.
- For data sources, there needs to be the right to freely obtain, copy, and transfer data with informed consent.
- For data processors, there needs to be the power to autonomously control, use data, and obtain benefits.
- For data derivatives, there needs to be operational rights.
Currently, Web3 data governance capabilities are singular, often only controlling assets and data (including Ceramic) through private key control, with almost no hierarchical classification and configuration capabilities. Recently, the innovative mechanisms of Tableland, FEVM, and Greenfield can achieve trustless governance of data to some extent. Traditional data governance tools like Collibra are generally only usable within enterprises, possessing platform-level trust, while non-decentralized technologies also make it impossible to prevent individual malfeasance and single points of failure. Through data governance tools like Tableland, the necessary security technology, standards, and solutions for data flow processes can be ensured.
Case: Tableland
Tableland Network is a decentralized web3 protocol for structured relational data, starting from Ethereum (EVM) and EVM-compatible L2. With Tableland, traditional web2 relational database functionalities can now be achieved by utilizing blockchain layers for access control. However, Tableland is not a new database—it is simply a web3-native relational table.
Tableland provides a new way for dapps to store relational data in a web3-native network without having to make these trade-offs.
Solution
Using Tableland, metadata can be changed (if needed, using access control), queried (using familiar SQL), and composable (with other tables on Tableland)—all in a completely decentralized manner.
Tableland breaks down traditional relational databases into two main components: an on-chain registry with access control logic (ACL) and off-chain (decentralized) tables. Each table in Tableland is initially minted as an ERC721 token on the underlying EVM-compatible layer. Therefore, on-chain table owners can set ACL permissions for the table, while the off-chain Tableland network manages the creation and subsequent changes of the table itself. The links between on-chain and off-chain are handled at the contract level, simply pointing to the Tableland network (using baseURI + tokenURI, similar to many existing ERC721 tokens that use IPFS gateways or hosted servers for metadata).
Only those with appropriate on-chain permissions can write to specific tables. However, table reads do not necessarily have to be on-chain operations and can use the Tableland gateway; thus, read queries are free and can come from simple front-end requests or even from other non-EVM blockchains. Now, to use Tableland, a table must first be created (i.e., minted as an ERC721 on-chain). The deployment address is initially set to the table owner, and this owner can set permissions for any other users attempting to interact with the table. For example, the owner can set rules for who can update/insert/delete values, what data they can change, and even decide whether they are willing to transfer ownership of the table to another party. Additionally, more complex queries can connect data from multiple tables (owned or not) to create a fully dynamic and composable relational data layer.
Consider the following diagram, which summarizes the interaction of new users with tables already deployed to Tableland by certain dapps:
Here is the overall information flow:
A new user interacts with the dapp's UI and attempts to update some information stored in a Tableland table.
The dapp calls the Tableland registry smart contract to execute this SQL statement, and this contract checks the dapp's smart contract, which contains the custom ACL defining the permissions for this new user. A few points to note:
- The custom ACL in the dapp's separate smart contract is a completely optional but advanced use case; developers do not need to implement a custom ACL and can use the default policy of the Tableland registry smart contract (only the owner has full permissions).
- Write queries can also use the gateway instead of directly calling the Tableland smart contract. The dapp always has the option to directly call the Tableland smart contract, but any query can be sent through the gateway, which will relay the query to the smart contract itself in a subsidized manner.
The Tableland smart contract receives the user's SQL statement and permissions and merges these into emitted events that describe the SQL-based operations to be taken.
Tableland Validator nodes listen for these events and subsequently take one of the following actions:
- If the user has the correct permissions to write to the table, the validator will run the SQL command accordingly (e.g., inserting a new row into the table or updating an existing value) and broadcast the confirmation data to the Tableland network.
- If the user does not have the correct permissions, the validator will not perform any operations on the table.
- If the request is a simple read query, the corresponding data will be returned; Tableland is a completely open relational data network where anyone can perform read-only queries on any table.
- The dapp will be able to reflect any updates occurring on the Tableland network through the gateway.
(Usage Scenarios) What to Avoid
- Personal identity data------Tableland is an open network where anyone can read data from any table. Therefore, personal data should not be stored in Tableland.
- High-frequency, sub-second writes------such as high-frequency trading bots.
- Storing every user interaction in the application------storing this data in web3 tables may not make sense, such as keystrokes or clicks. The frequency of writes can lead to high costs.
- Very large datasets------should be avoided; it is better to handle these through file storage using solutions like IPFS, Filecoin, or Arweave. However, pointers to these locations and related metadata are actually a good use case for Tableland tables.
Thoughts on Value Capture
Different units in the entire data infrastructure architecture have irreplaceable roles, and their value capture is mainly reflected in market capitalization/valuation and estimated revenue, leading to the following conclusions:
- Data sources are the modules with the highest value capture in the entire architecture.
- Data replication, transformation, streaming processing, and data warehouses follow.
- The analytics layer may have good cash flow, but its valuation will have an upper limit.
In simple terms, companies/projects on the left side of the overall structure diagram tend to have greater value capture.
Industry Concentration
According to incomplete statistical analysis, the industry concentration can be judged as follows:
- The highest industry concentration is in the data storage and data querying and processing modules.
- The industry concentration is moderate in data extraction and transformation.
- The industry concentration is relatively low in data sources, analytics, and output modules.
The low industry concentration in data sources, analytics, and output is preliminarily judged to be due to different business scenarios, allowing for the emergence of leading players in each vertical scenario, such as Oracle in the database field, Stripe in third-party services, Salesforce in enterprise services, Tableau in dashboard analytics, and Sisense in embedded analytics, etc.
The moderate industry concentration in the data extraction and transformation module is preliminarily judged to be due to the technical orientation of business attributes. The modular middleware form also makes switching costs relatively low.
The highest industry concentration in data storage and data querying and processing modules is preliminarily judged to be due to the singularity of business scenarios, high technical content, high startup costs, and significant costs associated with subsequent switching, giving companies/projects a strong first-mover advantage and network effects.
Business Models and Exit Paths of Data Protocols
From the perspective of establishment time and listing:
- Most of the companies/projects established before 2010 are data source companies/projects, as the mobile internet had not yet risen, and the data volume was not very large. There were also some data storage and analysis output projects, mainly dashboard-related.
- From 2010 to 2014, on the eve of the rise of the mobile internet, data storage and querying projects like Snowflake and Databricks emerged, and data extraction and transformation projects also began to appear, gradually perfecting a mature big data management technology solution, during which a large number of analysis output projects emerged, mainly dashboard-related.
- From 2015 to 2020, querying and processing projects sprang up like mushrooms, and a large number of data extraction and transformation projects continued to appear, allowing people to better leverage the power of big data.
- After 2020, newer real-time analytical databases and data lake solutions emerged, such as Clickhouse and Tabular.
- The improvement of infrastructure is the prerequisite for so-called "mass adoption." During large-scale applications, new opportunities continue to arise, but these opportunities almost exclusively belong to "middleware," while underlying solutions like data warehouses and data sources are almost a winner-takes-all situation, making it difficult to grow unless there are substantial technological breakthroughs.
Analysis output projects have always been opportunities for entrepreneurial projects, regardless of the period. However, they are also constantly iterating and innovating, doing new things based on new scenarios. Tableau, which emerged before 2010, occupies most of the desktop dashboard analysis tool market, while newer scenarios include more professionally oriented DS/ML tools, more comprehensive data workstations, and more SaaS-oriented embedded analytics, etc.
From this perspective, the current data protocols in Web3:
- Data source and storage projects are still unsettled, but leading players are emerging, with on-chain state storage led by Ethereum (with a market cap of $22 billion), and decentralized storage led by Filecoin (with a market cap of $2.3 billion) and Arweave (with a market cap of $280 million), with the potential for upstarts like Greenfield.------Highest value capture
- Data extraction and transformation projects still have room for innovation; data oracle Chainlink (with a market cap of $3.8 billion) is just the beginning, and event stream and stream processing infrastructure like Ceramic and more projects will emerge, but space is limited.------Medium value capture
Querying and processing projects, such as The Graph (with a market cap of $1.2 billion), can already meet most needs, but the types and numbers of projects have not yet reached an explosive period.------Medium value capture - Data analysis projects, led by Nansen and Dune (with a valuation of $1 billion), need new scenarios to create new opportunities; NFTScan and NFTGo are somewhat similar to new scenarios, but they are merely content updates rather than new demands at the analytical logic/paradigm level.------General value capture, with considerable cash flow.
However, Web3 is not a replica of Web2, nor is it merely an evolution of Web2. Web3 has a very native mission and scenario, giving rise to entirely different business scenarios (the first three scenarios are all the abstractions that can currently be made).