From Chips to Data: The Next Battle of AI
Author: Dr. Max Li, Founder of OORT and Professor at Columbia University
As the world continues to focus on the war surrounding AI chips—tariffs, intellectual property restrictions, supply chain sanctions, and geopolitical disputes—the issue of data scarcity, which directly influences the future development of AI, has clearly been overlooked.
Earlier this year, Elon Musk astutely pointed out that AI companies have exhausted the data needed to train models, even "running out" of the totality of human knowledge.
This article will explore the shrinking data pool and how decentralized AI (DeAI) plays a key role in addressing this challenge.
The Arrival of the Data War
First and foremost, it is essential to clarify: data is not an inexhaustible resource.
The data war has long been foreshadowed: in 2023, a group of visual artists filed a landmark lawsuit against Stability AI, MidJourney, and DeviantArt, accusing these companies of using their works to train generative AI models (such as Stable Diffusion) without permission. Meanwhile, Musk accused companies like OpenAI of unauthorized "scraping" of data from Twitter (now X platform), prompting X to tighten API pricing and access restrictions.
Coincidentally, Reddit significantly raised its API pricing, disrupting companies like OpenAI and Anthropic that rely on user-generated content from Reddit for AI model training. Reddit views this decision as a way to monetize its data, but it has also sparked debates about the tension between user data platforms and AI companies seeking to use that data.
These events highlight an increasingly evident reality: we are depleting the legally and ethically available data.
Multiple Frontlines of Data
The chip war focuses on producing the most powerful hardware, while the data war revolves around acquiring suitable datasets to train AI. The increasing scarcity of ethical, high-quality data has become a bottleneck for many companies developing AI.
For large companies, the most feasible approach is to obtain data from centralized giants, albeit at a high cost. However, small businesses face limited and often unaffordable options. Without appropriate methods or channels for data collection, these companies will fall significantly behind in the future AI development and innovation race.
So how do we ethically and effectively collect the data needed to advance AI development?
The data war will unfold across multiple frontlines, each presenting unique challenges and opportunities.
Data Collection
Who controls the pipelines for data collection? How can it be done ethically and legally?
As lawsuits against tech giants pile up for illegal scraping or use of data, emerging initiatives are beginning to appear. For example, Harvard University has taken the lead in promoting user-consented data contributions, providing open access datasets to the public. Although such projects have their value, they fall far short of meeting the demands of commercial AI applications.
Synthetic data is also gradually becoming a potential solution. Companies like Meta and Microsoft have begun using AI-generated data to fine-tune models such as Llama and Phi-4. Google and OpenAI are also employing synthetic data in their work. However, synthetic data faces its own challenges, such as the "hallucination" problem in models, which can affect its accuracy and reliability.
Decentralized data collection offers another promising option. By leveraging blockchain technology and using cryptocurrency to incentivize individuals to share data securely, decentralized models can address issues of privacy, ownership, and quality. These solutions also democratize data access, enabling small businesses to compete in the AI ecosystem.
Data Quality
Low-quality data can lead to model bias, inaccurate predictions, and ultimately foster distrust in AI systems. How can we ensure that the data used for AI training is accurate and representative?
Common industry practices include:
- Rigorous Data Validation: Companies employ advanced validation techniques to filter out errors, inconsistencies, and noise in datasets. This often involves human oversight, automated processes, or a combination of both to verify data integrity.
- Bias Mitigation Strategies: To ensure data representativeness, businesses implement bias detection tools and diverse sampling techniques. For instance, in the healthcare sector, datasets must include different demographic groups to avoid biases that could affect diagnostic models.
- Adhering to Standards: Data security industry frameworks like ISO/IEC 27001 and other emerging ethical AI guidelines are becoming essential for ensuring data quality and compliance with global standards.
- Crowdsourced Quality Checks: Platforms like Amazon Mechanical Turk are used for tasks such as labeling and validating data. Although cost-effective, these methods require oversight to ensure consistency and accuracy.
- Decentralized Verification: Blockchain and decentralized systems are gradually becoming tools for certifying data sources, ensuring data authenticity, and preventing tampering.
Moreover, regulatory bodies in various countries face the urgent challenge of establishing comprehensive data privacy and security regulations that balance individual rights with technological innovation while addressing key national security issues, such as protecting sensitive data from cyber threats, foreign exploitation, and hostile entity abuse.
A Rocky Road Ahead
The implications of the data war are profound. For example, in the healthcare industry, obtaining high-quality patient data can revolutionize diagnostic and treatment plans, but stringent privacy regulations pose obstacles. Similarly, in the music industry, AI models trained on ethical datasets can transform everything from composition to copyright enforcement, provided they respect intellectual property.
These challenges underscore the importance of decentralized solutions that prioritize data transparency, quality, and accessibility. By leveraging decentralized systems, we can create a fairer data ecosystem where individuals retain control over their data, businesses can access ethical and high-quality datasets, and innovation can proceed without compromising privacy or security.
The shift from the chip war to the data war will reshape the AI ecosystem and its evolution, providing leading opportunities for decentralized data solutions. By prioritizing ethical data collection and accessibility, decentralized AI has the potential to bridge gaps and lead to a fairer, more innovative AI future.
The battle for the best data has already begun. Are we ready to face it?