From Islands to Collaboration: The Significance of Web3 Native Data Pipelines

Deep Tide TechFlow
2023-08-12 11:48:44
Collection
Building a data pipeline in the Web3 market, in addition to having decentralized characteristics, can also play a key role as a starting point for actually capturing these opportunities.

Written by: Jay : : FP

Compiled by: Deep Tide TechFlow

The release of the Bitcoin white paper in 2008 sparked a rethinking of the concept of trust. Blockchain subsequently expanded its definition to include the concept of trustless systems and rapidly evolved, positing that different types of value such as individual sovereignty, financial democratization, and ownership could be applied to existing systems. Of course, before blockchain can be practically applied, a significant amount of validation and discussion may be required, as its characteristics may seem somewhat radical compared to various existing systems. However, if we take an optimistic view of these scenarios, building data pipelines and analyzing the valuable information contained in blockchain storage has the potential to become another significant turning point in industry development, as we can observe Web3-native business intelligence that has never existed before.

This article explores the potential of Web3-native data pipelines by projecting commonly used data pipelines in the existing IT market into the Web3 environment. The article discusses the benefits of these pipelines, the challenges that need to be addressed, and their impact on the industry.

1. Singularity Comes from Information Innovation

"Language is one of the most important distinctions between humans and lower animals. It is not merely the ability to produce sounds, but to associate distinct sounds with distinct thoughts and to use these sounds as symbols for the communication of thoughts." --- Darwin

Historically, significant advancements in human civilization have accompanied innovations in information sharing. Our ancestors used language, both oral and written, to communicate with each other and pass knowledge to future generations. This gave them a significant advantage over other species. The invention of writing, paper, and printing made it possible to share information more widely, leading to significant advancements in science, technology, and culture. In particular, the metal movable type printing of the Gutenberg Bible was a watershed moment, as it enabled the mass production of books and other printed materials. This had a profound impact on the onset of the Reformation, democratic revolutions, and scientific progress.

The rapid development of IT technology in the 2000s allowed us to gain deeper insights into human behavior. This led to changes in lifestyle, with most modern individuals making various decisions based on digital information. For this reason, we refer to modern society as the "IT Innovation Era."

Just 20 years after the full commercialization of the internet, artificial intelligence technology has once again amazed the world. Numerous applications capable of replacing human labor have emerged, and many are discussing how AI will change civilization. Some even find themselves in denial, wondering how such a technology could emerge so rapidly as to shake the foundations of our society. Despite "Moore's Law" indicating that semiconductor performance will grow exponentially over time, the changes brought about by the emergence of GPT have been too sudden to confront immediately.

Interestingly, the GPT model itself is not actually a very groundbreaking architecture. On the other hand, the AI industry lists the following as the main success factors of the GPT model: 1) defining business domains that can cater to a large customer base, and 2) model tuning through data pipelines—ranging from data collection to final results and feedback based on those results. In short, by refining the purpose of service delivery and upgrading the data/information processing process, these applications can achieve innovation.

2. Data-Driven Decision Making is Ubiquitous

Most of the innovations we refer to are actually based on the processing of accumulated data rather than on opportunity or intuition. As the saying goes, "In a capitalist market, it is not the strong that survive, but the survivors that are strong." Today's businesses face fierce competition in a saturated market. Therefore, companies are collecting and analyzing various data to seize even the smallest niches.

We may be overly enamored with Schumpeter's theory of "creative destruction," placing too much emphasis on intuitive decision-making. However, even excellent intuition is ultimately a product of an individual's accumulated data and information. The digital world will penetrate our lives more deeply in the future, with more sensitive information presented in the form of digital data.

The Web3 market has garnered widespread attention for its potential to empower users with control over their data. However, the blockchain field, as the foundational technology of Web3, is currently more focused on solving the trilemma (security, decentralization, and scalability). To make new technologies compelling in the real world, it is crucial to develop applications and intelligence that can be used in various ways. We have seen this happen in the big data field, where significant progress has been made in building methodologies for big data processing and data pipelines since around 2010. In the context of Web3, efforts must be made to drive industry development and establish data flow systems to generate data-driven intelligence.

3. Opportunities Based on On-Chain Data Flows

So, what opportunities can we capture from Web3-native data flow systems, and what challenges need to be addressed to seize these opportunities?

3.1 Advantages

In short, the value of configuring Web3-native data flows lies in the ability to securely and effectively distribute reliable data to multiple entities, thereby extracting valuable insights.

  • Data Redundancy ------ On-chain data is less likely to be lost and is more resilient because protocol networks store data fragments across multiple nodes.
  • Data Security ------ On-chain data is tamper-proof, as it is verified and consensus is reached by a network of decentralized nodes.
  • Data Sovereignty ------ Data sovereignty is the right of users to own and control their own data. Through on-chain data flows, users can see how their data is being used and choose to share it only with those who have a legitimate need to access it.
  • Permissionless and Transparency ------ On-chain data is transparent and tamper-proof. This ensures that the data being processed is also a reliable source of information.
  • Stable Operation ------ When data flows are orchestrated by protocols in a distributed environment, the probability of downtime is significantly reduced due to the absence of a single point of failure.

3.2 Application Cases

Trust is the foundation for different entities to interact and make decisions. Therefore, when reliable data can be securely distributed, it means that many interactions and decisions can be facilitated through Web3 services involving various entities. This helps maximize social capital, and we can imagine the following application cases.

3.2.1 Service/Protocol Applications

Rule-based automated decision-making systems ------ Protocols use key parameters to run services. These parameters are adjusted periodically to stabilize service status and provide the best experience for users. However, protocols cannot always monitor service status and dynamically change parameters in a timely manner. This is where on-chain data flows come into play. On-chain data flows can be used for real-time analysis of service status and suggest the best set of parameters that match service requirements (e.g., applying an automatic floating interest rate mechanism for lending protocols).

  • Growth of Credit Markets ------ Traditionally, credit has been used to measure an individual's repayment ability in financial markets. This helps improve market efficiency. However, in the Web3 market, the definition of credit remains unclear. This is due to the scarcity of personal data and the lack of data governance across industries. Therefore, integrating and collecting information becomes challenging. By building a process that collects and processes fragmented on-chain data, the credit market in the Web3 market can be redefined (e.g., Spectral's MACRO (Multi-Asset Credit Risk Oracle) score).
  • Decentralized Social/NFT Expansion ------ Decentralized societies prioritize user control, privacy protection, resistance to censorship, and community governance. This provides an alternative social paradigm. Therefore, a pipeline can be established to more smoothly control and update various metadata and facilitate migration between platforms.
  • Fraud Detection ------ Web3 services using smart contracts are vulnerable to malicious attacks that can steal funds, invade systems, and lead to decoupling and liquidity attacks. By creating a system capable of detecting these attacks in advance, Web3 services can formulate rapid response plans and protect users from harm.

3.2.2 Collaboration and Governance Initiatives

  • Fully On-Chain DAOs ------ Decentralized Autonomous Organizations (DAOs) heavily rely on off-chain tools for effective governance and public fund execution. By building an on-chain data processing workflow, a transparent process for DAO operations can further enhance the value of Web3-native DAOs.
  • Mitigating Governance Fatigue ------ Web3 protocol decisions are often made through community governance. However, many factors can make it difficult for participants to engage in governance, such as geographical barriers, monitoring pressure, lack of expertise required for governance, randomly released governance agendas, and inconvenient user experiences. If a tool can be created to simplify the process for participants from understanding to actually implementing individual governance agenda items, the protocol governance framework can operate more efficiently and effectively.
  • Open Data Platforms for Collaborative Works ------ In the existing academic and industrial sectors, many data and research materials are not publicly disclosed, which can make the overall development of the market very inefficient. On the other hand, on-chain data pools can facilitate more collaborative initiatives than existing markets, as they are transparent and accessible to anyone. The development of many token standards and DeFi solutions is a good example of this. Additionally, we can operate public data pools for various purposes.

3.2.3 Network Diagnostics

  • Index Research ------ Web3 users create various metrics to analyze and compare the status of protocols. Multiple objective metrics can be studied and displayed in real-time (e.g., Nakaflow's Satoshi coefficient).
  • Protocol Metrics ------ By processing data such as the number of active addresses, transaction volume, asset inflow/outflow, and fees generated by the network, the performance of protocols can be analyzed. This information can be used to assess the impact of specific protocol updates, the status of MEV, and the health of the network.

3.3 Challenges

On-chain data has unique advantages that can increase industry value. However, to fully realize these advantages, many challenges within and outside the industry must be addressed.

  • Lack of Data Governance ------ Data governance is the process of establishing consistent and shared data policies and standards to facilitate the integration of each data element. Currently, each on-chain protocol establishes its own standards and retrieves its own data types. However, the problem lies in the lack of data governance among entities that aggregate this protocol data and provide API services to users. This makes it difficult to integrate services, resulting in users struggling to obtain reliable and comprehensive insights.
  • Low Cost Efficiency ------ Storing cold data in protocols can save users on data security and server costs. However, if frequent access to data for analysis is required or if substantial computational resources are needed, storing it on the blockchain may not be cost-effective.
  • Oracle Problem ------ Smart contracts can only function effectively when they have access to data from the real world. However, this data is not always reliable or consistent. Unlike blockchains that maintain integrity through consensus algorithms, external data is not deterministic. Oracle solutions must continuously evolve to ensure the integrity, quality, and scalability of external data without relying on specific application layers.
  • Protocols are Still in Their Infancy ------ Protocols use their own tokens to incentivize users to keep services running and pay for service fees. However, the parameters required to operate protocols (e.g., precise definitions of service users and incentive schemes) are often managed very immaturely. This means that the economic sustainability of protocols is difficult to verify. If many protocols organically connect and create data pipelines, the uncertainty of whether the pipelines can operate well will be even greater.
  • Slow Data Retrieval Times ------ Protocols typically process transactions through consensus among many nodes, which limits the speed and volume of information processing compared to traditional IT business logic. This bottleneck is difficult to resolve unless the performance of all protocols that make up the pipeline significantly improves.
  • The True Value of Web3 Data ------ Blockchains are isolated systems that have not yet connected with the real world. When collecting Web3 data, we need to consider whether the data being collected can provide meaningful insights that justify the cost of establishing data pipelines.
  • Unfamiliar Syntax ------ Existing IT data infrastructure and blockchain infrastructure operate very differently. Even the programming languages used are different, as blockchain infrastructure often employs low-level languages or new languages specifically designed for blockchain needs. This makes it difficult for new developers and service users to learn how to handle each data primitive, as they need to learn a new programming language or a new way of thinking about processing blockchain data.

4. Pipeline Web3 Data Legos

Currently, there are no connections between Web3 data primitives; they extract and process data independently. This makes it difficult to achieve the synergistic effects of experimental information processing. To address this issue, this article introduces commonly used data pipelines in the IT market and maps existing Web3 data primitives onto these pipelines. This will make use cases more concrete.

4.1 General Data Pipeline

Building a data pipeline is akin to conceptualizing and automating repetitive decision-making processes in daily life. By doing so, individuals can access specific quality information whenever needed and use it for decision-making. The more unstructured data there is to process, the more frequently information is used, or the greater the need for real-time analysis, the more time and cost can be saved by automating this series of processes to obtain the proactivity required for future decision-making.

The above diagram shows a general architecture used to build data pipelines in the existing IT infrastructure market. Data suitable for analytical purposes is collected from the correct data sources and stored in appropriate storage solutions based on the nature of the data and analytical requirements. For example, data lakes provide raw data storage solutions for scalable and flexible analysis, while data warehouses focus on storing structured data for queries and analyses optimized for specific business logic. The data is then processed in various ways to generate insights or actionable information.

Each solution layer can also be offered in the form of packaged services. The ETL (Extract, Transform, Load) SaaS product group that connects a series of processes from data extraction to loading is also gaining attention (e.g., FiveTran, Panoply, Hivo, Rivery). The sequence is not always unidirectional; depending on the specific needs of the organization, the layers can connect in various ways. The most important thing when building a data pipeline is to minimize the risk of data loss that may occur when sending and receiving data at each server layer. This can be achieved by optimizing the decoupling of servers and using reliable data storage and processing solutions.

4.2 Pipelines with On-Chain Environments

The conceptual diagram of the data pipeline introduced earlier can be applied to on-chain environments, as shown in the above image. However, it is important to note that fully decentralized pipelines cannot be formed, as each fundamental component relies on centralized off-chain solutions to some extent. Additionally, the above image does not currently include all Web3 solutions, and the boundaries of classification may be ambiguous—for example, KYVE, in addition to being a streaming platform, also includes data lake functionalities, which can be seen as a data pipeline in itself. Furthermore, Space and Time is classified as a decentralized database, but it provides API gateway services such as Rest API and streaming, as well as ETL services.

4.2.1 Capture/Processing

To enable ordinary users or dApps to efficiently use/operate services, they need to easily identify and access data sources primarily generated within the protocol, such as transactions, states, and log events. This layer acts as middleware, facilitating processes that include oracles, messaging, authentication, and API management. The main solutions are as follows.

Streaming/Indexing Platforms

Bitquery, Ceramic, KYVE, Lens, Streamr Network, The Graph, various protocol block explorers, etc.

Node-as-a-Service and Other RPC/API Services

Alchemy, All that Node, Infura, Pocket Network, Quicknode, etc.

Oracles

API3, Band Protocol, Chainlink, Nest Protocol, Pyth, Supra Oracles, etc.

4.2.2 Storage

Compared to Web2 storage solutions, Web3 storage solutions have several advantages, such as persistence and decentralization. However, they also have some drawbacks, such as high costs and difficulties in data updates and queries. Therefore, various solutions have emerged to address these drawbacks and achieve efficient processing of structured and dynamic data on Web3—each solution has its own characteristics, such as the types of data processed, whether it is structured, and whether it has embedded query capabilities.

Decentralized Storage Networks

Arweave, Filecoin, KYVE, Sia, Storj, etc.

Decentralized Databases

Arweave-based databases (Glacier, HollowDB, Kwil, WeaveDB), ComposeDB, OrbitDB, Polybase, Space and Time, Tableland, etc.

* Each protocol has different permanent storage mechanisms. For example, Arweave is based on a blockchain model, similar to Ethereum storage, permanently storing data on-chain, while Filecoin, Sia, and Storj are contract-based models that store data off-chain.

4.2.3 Transformation

In the context of Web3, the transformation layer is as important as the storage layer. This is because the structure of blockchains is fundamentally composed of collections of distributed nodes, making it easy to use scalable backend logic. In the AI industry, there is active exploration of leveraging these advantages for research in federated learning, with protocols specifically designed for machine learning and AI operations emerging.

Data Training/Modeling/Computing

Akash, Bacalhau, Bittensor, Gensyn, Golem, Together, etc.

* Federated learning is a method of training AI models by distributing the original model across multiple native clients, using the stored data to train it, and then collecting the learned parameters on a central server.

4.2.4 Analysis/Usage

The dashboard services and end-user insights and analysis solutions listed below are platforms that allow users to observe and discover various insights from specific protocols. Some of these solutions also provide API services for end products. However, it is important to note that the data in these solutions is not always accurate, as they mostly use separate off-chain tools to store and process data. Errors can also be observed between solutions.

At the same time, there is a platform called "Web3 Functions" that can automatically/trigger the execution of smart contracts, similar to how centralized platforms like Google Cloud trigger/execute specific business logic. Using this platform, users can implement business logic in a Web3-native way, rather than just gaining insights by processing on-chain data.

Dashboard Services

Dune Analytics, Flipside Crypto, Footprint, Transpose, etc.

End-User Insights and Analysis

Chainalysis, Glassnode, Messari, Nansen, The Tie, Token Terminal, etc.

Web3 Functions

Chainlink's Functions, Gelato Network, etc.

5. Concluding Thoughts

As Kant said, we can only witness the phenomena of things and cannot touch their essence. Nevertheless, we utilize the observational records known as "data" to process information and knowledge, and we see how innovations in information technology drive the development of civilization. Therefore, building a data pipeline in the Web3 market, besides having decentralized characteristics, can play a key role as a practical starting point for capturing these opportunities. I would like to summarize this article with a few thoughts.

5.1 The Role of Storage Solutions Will Become More Important

The most important prerequisite for having a data pipeline is establishing data and API governance. In an increasingly diverse ecosystem, the standards created by each protocol will continue to be recreated, and the fragmented transaction records of a multi-chain ecosystem will make it more difficult for individuals to derive comprehensive insights. Then, "storage solutions" are entities capable of providing integrated data in a unified format by collecting fragmented information and updating the standards of each protocol. We observe that existing storage solutions in the market (such as Snowflake and Databricks) are rapidly evolving, with a large customer base, leading industry development through vertical integration by operating various layers in the pipeline.

5.2 Opportunities in the Data Source Market

As data becomes more accessible and processing processes improve, successful use cases begin to emerge. This creates a positive feedback loop, where data sources and collection tools explode in number—since 2010, the types and quantities of digital data collected have grown exponentially due to significant advancements in the technology for building data pipelines. Applying this context to the Web3 market, many data sources can be recursively generated on-chain in the future. This also means that blockchains will expand into various business domains. At this point, we can expect to advance data collection through data markets like Ocean Protocol or decentralized wireless (DeWi) solutions like Helium and XNET, as well as storage solutions.

5.3 The Importance of Meaningful Data and Analysis

However, the most important thing is to continuously ask what data should be prepared to extract the truly needed insights. There is nothing more wasteful than building a data pipeline without clear hypotheses to validate it. The existing market has achieved numerous innovations by building data pipelines but has also paid countless costs through repeated meaningless failures. It is also good to have constructive discussions about the development of the technology stack, but the industry needs time to think and discuss more fundamental questions, such as what data should be stored in the block space or what purposes the data should serve. The "goal" should be to realize the value of Web3 through actionable intelligence and use cases, and in this process, developing multiple fundamental components and completing the pipeline is the "means" to achieve this goal.

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
ChainCatcher Building the Web3 world with innovators