Why AI Cannot Be Separated from Blockchain: How DePIN Supports Artificial Intelligence
The author of this article is a special contributor to Filecoin Insights and a partner at Portal Ventures Catrina
In the past, startups led technological innovation for a long time by breaking free from the shackles of organizational inertia with their speed, flexibility, and entrepreneurial culture. However, all of this has been rewritten in the era of artificial intelligence. So far, the creators of groundbreaking AI products have been traditional tech giants like Microsoft’s OpenAI, Nvidia, Google, and even Meta.
What happened? Why did the giants win over startups this time? Startups can write excellent code, but they face various obstacles compared to tech giants:
High computing costs
The reverse convexity in AI development: Concerns and uncertainties surrounding the social impact of AI hinder innovation due to the lack of necessary guidelines
The AI black box problem
The "data moat" established by large tech companies creates barriers to entry
So, why is there a need for blockchain technology? Where is its intersection with artificial intelligence? While it cannot solve all problems at once, the Distributed Physical Infrastructure Network (DePIN) in Web3 creates conditions to address the above issues. The following will explain how the technology behind DePIN can assist artificial intelligence, mainly from four dimensions:
Reducing infrastructure costs
Verifying creators and identities
Filling the gap in AI democracy and transparency
Setting up a data contribution reward mechanism
In the following text:
"web3" refers to the next generation of the internet, where blockchain technology is an organic component alongside other existing technologies.
"Blockchain" refers to decentralized and distributed ledger technology.
"Crypto" refers to practices that utilize token mechanisms for incentives and decentralization.
1. Reducing Infrastructure Costs (Computing and Storage)
The trigger for every wave of technological innovation is when something expensive becomes cheap enough to waste.
------Society's Technical Debt and Software's Gutenberg Moment (https://skventures.substack.com/p/societys-technical-debt-and-softwares), from SK Ventures
How important is the affordability of infrastructure (the infrastructure for AI refers to the hardware costs for computing, transmitting, and storing data)? Carlota Perez's theory of technological revolutions (https://stratechery.com/2021/the-death-and-birth-of-technological-revolutions/) indicates that technological breakthroughs consist of two phases:
Installation Phase characterized by a large amount of venture capital, infrastructure construction, and "push" go-to-market (GTM) strategies, as customers are unaware of the value proposition of new technologies.
Deployment Phase characterized by a massive increase in infrastructure supply, lowering the barriers to customer acquisition, and adopting a "pull" go-to-market (GTM) strategy, indicating a high product-market fit and customer expectations for more yet-to-be-formed products.
Since attempts like ChatGPT have proven market fit and customer demand, people might feel that AI has entered the deployment phase. However, AI still lacks a crucial element: an excess of infrastructure supply for price-sensitive startups to build and experiment.
Problem
The current physical infrastructure field is mainly dominated by vertically integrated oligopolies, including AWS, GCP, Azure, Nvidia, Cloudflare, Akamai, etc. The industry has high profit margins, with AWS estimated to have a gross margin of 61% on commoditized computing hardware (https://www.cnbc.com/2021/09/05/how-amazon-web-services-makes-money-estimated-margins-by-service.html). Therefore, new entrants in the AI field, especially in the LLM sector, face extremely high computing costs.
The cost of training ChatGPT is estimated at $4 million, with hardware inference operating costs around $700,000/day.
The second version of Bloom may require $10 million for training and retraining.
If ChatGPT enters Google Search, Google's revenue would decrease by $36 billion, with massive profits shifting from the software platform (Google) to hardware providers (Nvidia).
Solution
DePIN networks like Filecoin (originating from the DePIN pioneer in 2014, focusing on internet-level hardware for distributed data storage), Bacalhau (https://www.bacalhau.org/), Gensyn.ai (http://gensyn.ai/), Render Network (https://rendertoken.com/), ExaBits (a coordination layer for matching CPU/GPU supply and demand: https://www.exabits.xyz/) can save 75% to 90%+ in infrastructure costs through the following three aspects:
1. Driving the Supply Curve, Stimulating Market Competition
DePIN provides equal opportunities for hardware suppliers to become service providers. It creates a market where anyone can join as a "miner," exchanging CPU/GPU or storage capacity for economic rewards, thus bringing competition to existing providers.
While companies like AWS undoubtedly enjoy a 17-year head start in user interface, operations, and vertical integration, DePIN attracts a new user base that cannot accept centralized vendor pricing. Just as eBay does not compete directly with Bloomingdale's but instead offers a more economical alternative to meet similar needs, distributed storage networks do not replace centralized providers but aim to serve price-sensitive user groups.
2. Promoting Market Economic Balance through Crypto Economic Design
The subsidy mechanisms created by DePIN can guide hardware providers to participate in the network, thereby lowering costs for end users. To understand this principle, we can look at the costs and revenues of storage providers in AWS and Filecoin in Web2 and Web3.
Customers benefit from price reductions: The DePIN network creates a competitive market, introducing Bertrand competition (https://en.wikipedia.org/wiki/Bertrand_competition), thereby lowering the fees customers pay. In contrast, AWS EC2 requires about a 55% profit margin and a 31% overall profit margin to maintain operations. The Token incentives/block rewards provided by the DePIN network also represent a new source of revenue. In the context of Filecoin, the more real data storage providers host, the more block rewards (tokens) they can earn. Thus, storage providers are incentivized to attract more customers to increase revenue. The token structures of several emerging computing DePIN networks have not yet been disclosed, but they are likely to follow a similar model. Similar networks include:
Bacalhau: A coordination layer that brings computation to the data storage location, avoiding the need to move large datasets.
exaBITS: A distributed computing network serving AI and compute-intensive applications.
Gensyn.ai: A computational protocol for deep learning models.
3. Reducing Indirect Costs: The advantages of DePIN networks like Bacalhau, exaBITS, and IPFS/content-addressed storage include:
Unlocking the availability of potential data: Due to the high bandwidth costs of transferring large datasets, a significant amount of data remains undeveloped, such as the vast event data generated by sports venues. DePIN projects can process data on-site and only transmit meaningful outputs, unlocking the availability of potential data.
Reducing operational costs: Lowering data input, transmission, and import/export costs by acquiring data locally.
Minimizing manual work in sensitive data sharing: If Hospital A and B need to combine analyses of their respective patients' sensitive data, they can use Bacalhau to coordinate GPU power and process sensitive data directly on-site without cumbersome administrative processes to exchange personally identifiable information (PII).
No need to recompute the underlying dataset: IPFS/content-addressed storage inherently has deduplication, provenance, and data verification capabilities. For more on IPFS's functionality and cost-effectiveness, refer to this article (https://curiouscat178.substack.com/p/the-non-philosophical-business-case).
AI-generated summary: AI needs the affordable infrastructure provided by DePIN, as the current infrastructure market is dominated by vertically integrated oligopolies. DePIN networks like Filecoin, Bacalhau, Render Network, and ExaBits democratize the opportunity to become hardware suppliers, introduce competition, maintain market economic balance through crypto economic design, reduce costs by 75%-90%, and lower indirect costs.
2. Verifying Creators and Identities
Problem
A recent survey shows that 50% of AI scholars believe the likelihood of AI causing catastrophic harm to humanity exceeds 10%.
People need to be alert that AI has already triggered social chaos and still lacks regulation or technical standards, a situation referred to as "reverse convexity."
For instance, in this Twitter video, podcast host Joe Rogan debates conservative commentator Ben Shapiro about the movie "Ratatouille," but this video is AI-generated.
It is worth noting that the social impact of AI goes far beyond the issues caused by fake blogs, dialogues, and images:
During the 2024 U.S. elections, AI-generated deepfake campaign content achieved a level of realism that could deceive.
A video of Senator Elizabeth Warren was edited to make her "say" that "Republicans should not be allowed to vote" (which has been debunked).
A synthesized voice of Biden criticized transgender women.
A group of artists filed a class-action lawsuit against Midjourney and Stability AI, accusing them of using artists' works without authorization to train AI, infringing copyright and threatening artists' livelihoods.
An AI-generated song featuring The Weeknd and Drake, "Heart on My Sleeve," went viral on streaming platforms but was subsequently taken down. When new technologies enter the mainstream without regulations, many problems arise, and copyright infringement falls under the "reverse convexity" issue.
So, can we incorporate AI-related regulations in Web3?
Solution
Utilizing on-chain source proof for identity and creator verification
To truly leverage blockchain technology—acting as a distributed ledger containing immutable on-chain historical records—the authenticity of digital content can be verified through content encryption proofs.
Digital signatures as creator verification and identity proof
To identify deepfakes, unique digital signatures from the original content creator can be used to generate encryption proofs. The signature can be created using a private key known only to the creator and can be verified by a public key that is open to everyone. With the signature, it can be proven that the content was created by the original creator, whether human or AI, and it can also verify authorized or unauthorized changes to the content.
Using IPFS and Merkle trees for authenticity proof
IPFS is a distributed protocol that uses content addressing and Merkle trees to reference large datasets. To prove that file content has been received or changed, a Merkle proof is generated, which is a string of hashes showing the location of a specific data block in the Merkle tree. Each change adds a hash to the Merkle tree, providing proof of file modification.
The pain point of the encryption scheme is the incentive mechanism. After all, identifying deepfake creators, while reducing negative social impacts, does not bring equivalent economic benefits. This responsibility likely falls on mainstream media distribution platforms like Twitter, Meta, and Google, and indeed it does. So why do we need blockchain?
The answer is that blockchain's encryption signatures and authenticity proofs are more effective, verifiable, and definitive. Currently, the process of detecting deepfakes mainly relies on machine learning algorithms (such as Meta's "Deepfake Detection Challenge," Google's "Asymmetric Numeral Systems" (ANS), and c2pa: https://c2pa.org/) to identify patterns and anomalies in visual content, but often lacks accuracy and lags behind the development speed of deepfakes. Manual review is generally needed to determine authenticity, which is inefficient and costly.
If one day every piece of content has an encryption signature, everyone can verifiably prove the source of creation, marking tampering or forgery, then we will welcome a beautiful world.
AI-generated summary: AI may pose significant threats to society, especially with deepfakes and unauthorized use of content, while Web3 technologies, such as creator verification using digital signatures and authenticity proof using IPFS and Merkle trees, can verify the authenticity of digital content, prevent unauthorized changes, and provide regulations for AI.
3. AI Democratization
Problem
Today's AI is a black box composed of proprietary data and proprietary algorithms. The closed nature of large tech companies' LLMs stifles what I see as "AI democracy," where every developer and even user can contribute algorithms and data to LLM models and receive a share of profits when the model earns (related article: https://curiouscat178.substack.com/p/four-foundational-pillars-to-usher).
AI democracy = visibility (the ability to see the data and algorithms input into the model) + contribution (the ability to contribute data or algorithms to the model).
Solution
The goal of AI democracy is to make generative AI models open to the public, relevant to the public, and owned by the public. The table below compares the current state of AI with the future achievable through Web3 blockchain technology.
Currently------
For customers:
Unidirectional reception of LLM outputs
Unable to control how personal data is used
For developers:
Low composability
ETL data processing is not traceable, making it difficult to reproduce
Data contribution sources are limited to data-owning institutions
Closed-source models can only be accessed through paid APIs
Sharing data outputs lacks verifiability, with data scientists spending 80% of their time on low-end data cleaning
After combining with blockchain------
For customers:
Users can provide feedback (such as bias, content review, and granular feedback on outputs) as a basis for fine-tuning
Users can choose to contribute data in exchange for profits after the model monetizes
For developers:
Distributed data management layer: Crowdsourcing repetitive and time-consuming data labeling and other data preparation tasks
Visibility & composability & fine-tuning algorithm capabilities, aided by verifiable sources (with a tamper-proof history of all changes)
Data sovereignty (achieved through content addressing/IPFS) and algorithm sovereignty (for example, Urbit achieves peer-to-peer composition and portability of data and algorithms)
Accelerating LLM innovation, speeding up LLM innovation from various variants of foundational open-source models.
Reproducible training data outputs, made possible through blockchain's immutable records of past ETL operations and queries (like Kamu).
Some argue that Web2 open-source platforms also provide a compromise solution, but the results are not ideal. Relevant discussions can be found in exaBITS's blog post.
AI-generated summary: The closed nature of large tech companies' LLMs stifles "AI democracy," where every developer or user can contribute algorithms and data to an LLM model and receive a share of profits when the model earns. AI should be open to the public, relevant to the public, and owned by the public. With blockchain networks, users can provide feedback and contribute data in exchange for monetized profits, while developers gain visibility and verifiable sources to compose and fine-tune algorithms. Innovations in Web3, such as content addressing/IPFS and Urbit, will achieve data and algorithm sovereignty. Reproducibility of training data outputs will also become possible through blockchain's immutable records of past ETL operations and queries.
4. Setting Up Data Contribution Reward Mechanisms
Problem
Today, the most valuable consumer data is proprietary assets of large tech companies, forming their core business barriers. Tech giants have no incentive to share this data with external parties.
So, why can't we directly obtain data from data creators or users? Why can't we turn data into a public resource, making contributions to data open-source for data scientists to use?
Simply put, it is due to the lack of incentive and coordination mechanisms. Maintaining data and executing ETL (Extract, Transform, Load) incurs significant indirect costs. In fact, data storage alone is expected to become a $777 billion industry by 2030, not including computing costs. No one will undertake the work and costs of data processing for free.
Take OpenAI as an example; it was initially set up as an open-source nonprofit, but monetization difficulties made it impossible to cover costs. In 2019, OpenAI had to accept investment from Microsoft, and its algorithms are no longer open to the public. It is expected that by 2024, OpenAI's profits will reach $1 billion.
Solution
Web3 introduces a new mechanism called "dataDAO," facilitating income redistribution between AI model owners and data contributors, creating an incentive layer for crowdsourced data contributions. Due to space limitations, this will not be elaborated on here; for more information, you can read the following two articles:
How DataDAO works/DataDAO原理, by HQ Han of Protocol Labs
How data contribution and monetization works in web3/web3数据贡献和变现如何运作, where I discuss the mechanisms, shortcomings, and opportunities of dataDAO in depth
In summary, DePIN takes a different approach, providing new hardware energy to drive Web3 and AI innovation. Although tech giants dominate the AI industry, emerging participants can leverage blockchain technology to join the competition: DePIN networks lower the barriers to entry by reducing computing costs; the verifiable and distributed nature of blockchain makes true open AI possible; innovations like dataDAO incentivize data contributions; and the immutability and tamper-proof characteristics of blockchain provide creator verification, alleviating concerns about the negative social impacts of AI.