NGC Ventures Latest Research: It's September, a Cool Reflection and Suggestions on AI Entrepreneurship
Author: CherryIn, Investment Manager, NGC Ventures
The first draft of this article was completed during a break at the end of August, and after a hasty release, it received much feedback. The author has since revised and edited some content to avoid embarrassment.
The content of this article mainly evaluates the current state of the AI industry from an investment perspective, reflects on and speculates about the technology/product paths of different companies, and abstracts the strategies of AI companies. Therefore, there may be omissions in the parts involving specific technologies, and I hope for your understanding.
However, at the end of the day, the few big companies that can publish papers are still at odds with each other, and it seems that no one can evaluate the correctness of this article's content. It's like using GPT-4 to score GPT-3.5; it seems reasonable, but upon closer inspection, it's a bit abstract.
Therefore, I suggest that this article be viewed as a "judgment" formed from the collection of information in an uncertain industry. Since it is a judgment, the stance must be clear and substantial. As for whether the judgment is correct, let time be the judge.
I have always believed that in a noisy new industry, thinking critically and daring to make judgments is never wrong. For judgment questions, the probability of a blind guess being correct is 50%, while the probability of guessing wrong three times in a row is 12.5%. Even making judgments at the level of flipping a coin is meaningful. Making judgments is not scary; what is truly frightening is having a judgment accuracy lower than that of flipping a coin.
Before formally starting this article, I would like to thank the following works for providing valuable inspiration and data sources for this article. Of course, since many of the inferences in this article are based on these works, if there are any errors or misunderstandings on my part, the inferences in this article will no longer be robust, and I ask readers to discern for themselves. This article does not constitute investment advice and is unlikely to constitute investment advice.
- Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance (https://arxiv.org/abs/2305.17306)
- LIMA: Less Is More for Alignment (https://arxiv.org/abs/2305.11206)
- June 2023, A Stage Review of Instruction Tuning (https://yaofu.notion.site/June-2023-A-Stage-Review-of-Instruction-Tuning-f59dbfc36e2d4e12a33443bd6b2012c2)
- GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE (https://www.semianalysis.com/p/gpt-4-architecture-infrastructure)
Now, let's officially begin this article.
Large Models: Launching Cyber Rockets
The first step in discussing AI in 2023 is to discuss whether large model startups can still be viable.
The pre-training of large models has now transformed into a rocket launch problem; as long as you can afford it and the direction is right, anyone can do it. It can be said that training large models is akin to launching cyber rockets.
An unconventional observation is that investors underestimate the difficulty of training large models while overestimating the difficulty of launching real rockets. With a cost of $60 million, investors may think that a failed rocket launch still allows for a second chance, whereas a failed large model training is seen as a waste of funds.
GPT-4, with an efficiency of GPU utilization at OpenAI (rumored to be around 30%), still consumed $60 million. This is a {performance = efficiency × cost} problem, and performance is a wall; if other startups cannot achieve a performance effect greater than 30% × $60 million = $18 million, then users might as well use GPT-4 directly.
Currently, many companies claiming to train large models are raising funding rounds in the range of $1 million to $5 million. This means that even the company with the largest funding amount only has enough ammunition to support one launch. Even if this launch achieves 100% GPU utilization, it is still difficult to exceed GPT-4.
From this perspective, launching rockets is easier because most rockets are currently launch vehicles carrying satellites into space, and since the payload capacity is limited, smaller rocket companies can take on satellites that others cannot launch in time.
Large models are different; the marginal cost of horizontal expansion for large models is only the cost of computing power, which can be elastically expanded. This means that for large model companies, every order's profit is essentially free profit, with almost no additional costs, allowing for a very high capacity for acceptance. For newly established, low-quality large model companies, it is difficult to capture overflow demand.
Unless training costs drop significantly, even if many companies know the entire architecture of GPT-4, it will still be challenging to produce market-ready large models in the short term.
Customization: Facing the "Winner-Takes-All" Problem
In the hardware industry, a common phenomenon is to achieve early profits through customization demands, and then use those early profits to achieve technological breakthroughs (or catch up). However, customization in the large model industry is difficult to become a way out for newcomers.
The explanation for this judgment is quite simple: the vast majority of fine-tuned models cannot catch up with GPT-4. Even if they do catch up, the cost of directly using GPT-4's generalization is lower, requiring fewer personnel, less luck, and less data. As long as the performance gap between GPT-4 and other models exists, customization cannot become a way out for large model companies.
A very typical example is Jasper, which uses fine-tuned GPT-3 to serve enterprise clients. However, after OpenAI publicly released ChatGPT (GPT-3.5), its users rapidly dwindled. This is because Jasper's outputs can be obtained by simply inputting prompts into GPT-3.5, without needing to use a "lagging version" with poor generalization capabilities limited to internal enterprise use.
Compared to new companies, Jasper at least had a window of opportunity between GPT-3 and GPT-3.5 to develop. However, new companies now have to face the pressure from both low-cost, high-speed GPT-3.5 and high-performance GPT-4.
Therefore, hoping to accumulate profits through customization to achieve technological breakthroughs has a very low survival probability.
Fine-Tuning: Necessary, But Don't Be Superstitious
Currently, the AI industry has an unrealistic expectation of fine-tuning, which is overestimated both in terms of specific technical implementation and macro technological rhythm.
The fine-tuning discussed in the industry mostly refers to "generating responses that align with human intentions based on pre-trained models." This type of fine-tuning can be called "alignment," which means making responses align with human intentions rather than enhancing the intelligence of large models.
According to the findings of several papers, the knowledge of large models should primarily come from pre-training, while fine-tuning is more about alignment.
In simple terms, pre-training determines brain capacity, while fine-tuning determines native language. Fine-tuning a pre-trained model is a process of "eliminating illiteracy."
However, the industry often views fine-tuning as a method to "enhance intelligence" in models, meaning that fine-tuning improves model performance and increases model knowledge, believing that this can lead to the "holy grail of artificial intelligence." This line of thinking is somewhat biased.
First, the model's performance itself does not improve; it can only better align with human intentions. If the complexity of the task exceeds the model's performance, fine-tuning cannot yield the expected results. It's like asking the human brain to perform quantum calculations; the inability to do so is not an issue of education.
Second, supplementing "knowledge" in the "intent alignment" part is more akin to "parroting." That is, the model merely mimics experts without understanding the underlying meaning. Although many industries can achieve good solutions through "parroting" (after all, most industries are not complex…), this is clearly not the result we should pursue in the long term.
Finally, training to "supplement additional datasets, improve model performance, and increase model knowledge" should be viewed as the model's ability for "incremental learning/continuous learning," meaning that the model's full parameters can be optimized through incremental datasets. This is not the same category as so-called "instruction fine-tuning."
In summary, fine-tuning is very important, but holding a "superstitious" attitude towards current fine-tuning is incorrect, especially the eagerness to label current fine-tuning as the holy grail, which has a sense of "only two clouds are floating over the grand edifice of physics today."
To take a step back, if the demand for "enhancing intelligence" can indeed be solved through instruction fine-tuning, a simple vector search, directly stuffing knowledge into the context, and writing a few prompt templates could likely yield similar or even better results.
Everyone loves fine-tuning, which may also be a revival of alchemy skills in modern times…
Outlook on Large Models: Arithmetic Operations
(Note: This section is entirely based on data leaked by Dylan Patel, and its reliability cannot yet be verified.)
The training of GPT-4 is based on Nvidia's A series of cards, with a training efficiency of 30%, a training time of about 2 months, and a cost of approximately $60 million, with a total parameter count of {1.7 trillion = 110 billion × 16 expert models}, handling around 280 billion parameters for a single question.
In other words, there are several key parameters that could change the landscape of large model training.
- Training efficiency: Increasing from 30% to 60% can directly halve the time.
- Increased computing density: Switching from the A series to the H series and then to AI-specific cards can solve many efficiency-related issues in architecture.
- Decreased computing costs: Discounts on graphics cards from Jensen Huang (Nvidia founder) lead to significant cost reductions.
- Improved parameter efficiency: There is room for improvement in the parameter efficiency of models; historically, new models have often achieved multiple times the parameter efficiency of older models, possibly using only 30% of GPT-4's parameters to achieve similar results.
In summary, the cost of training a model with GPT-4 level performance from scratch may have an optimization space of 10 to 20 times, compressing it to $3 million to $6 million, which is more acceptable for both startups and large companies in terms of cost control.
This change may take about 2 years to complete.
Currently, mainstream large model technology is still based on transformers, and the underlying architecture has not changed. The approach of "alchemy and parameter enhancement" has not yet been exhausted. The training of GPT-4 was conducted under significant computational constraints and with insufficient training time.
If parameters grow linearly with training time, the parameter limit for models similar to GPT-4's architecture may be around 10 trillion. That is, doubling the training time (×2), doubling the parallel GPUs (×2), increasing training efficiency by half (×1.5), and improving parameter efficiency by half (×1.5) could yield ten times the results. According to Silicon Valley's risk tolerance style, this parameter count will likely be reached within a year, regardless of whether performance has improved.
However, once the parameter count reaches 10 trillion, whether LLMs can still rely on increasing parameters for significant breakthroughs is entirely unknown.
If the increase in parameter count leads to diminishing returns in model performance, then 10 trillion may be a threshold. However, there is also a hypothesis that the increase in parameter count leads to increasing returns in model performance, similar to "if a person is smart enough, they learn quickly regardless of the subject." The former is manageable, but if the latter proves true, model performance could improve exponentially, making future developments completely unpredictable.
Predicting alchemy is difficult, but predicting corporate strategic rhythms is relatively easy. A model with a total parameter count of 10 trillion is a milestone endpoint for most companies, whether they are giants like Google/MS/APPL or smaller ones like OpenAI, a point where they can pause and conduct some technological exploration.
Corporate/capital preferences for risk can be translated into a "tolerance for time." If the entire tolerance time is burning costs rapidly, it is difficult to exceed 6 months. The speed of human craftsmanship growth is not fast enough, usually taking 5 years or even longer as a cycle. Therefore, within 5 years, the limit on model parameter count can be estimated to be around 20 trillion to 50 trillion. Unless there is another significant breakthrough in technology/architecture, the probability of exceeding this order of magnitude is low.
Multimodal: The Elephant in the Room
Multimodal is the elephant in the room, which could profoundly impact the landscape of the field.
A simple definition of multimodal is: supporting the input and output of various types of information. This definition is quite loose; for example, some products on the market that claim to support multimodal input are essentially just chatbots wrapped in OCR. There are also models that fully meet the definition of multimodal, but their performance leaves much to be desired. Even GPT-4's image multimodal input capability has not yet been widely opened, indicating that this feature is not very stable.
However, the release of multimodal capabilities is not a distant prospect. GPT-5 is likely to natively support multimodal capabilities, meaning it will require a redesign of the architecture and retraining. According to previous reasoning, large models still have a growth space of 10 to 50 times in parameters, which should be sufficient to incorporate multimodal capabilities. Therefore, it can be expected that highly usable and high-performance multimodal models will emerge within 2 years; optimistically, it could be around 1 year.
Multimodal is the elephant in the room; everyone knows that such a thing will eventually exist, but many products/research/strategies ignore its presence, leading to misjudgments in critical areas.
For example, a single image model will theoretically face severe pressure from multimodal models, but most current research/investment has overlooked this issue, resulting in overly high valuations for companies focused on image models. These companies are likely to lose their technological barriers in the future and transform into service providers, and their valuation systems should reference service providers rather than technology companies.
If we are to discuss the story of "investing in people, the same team can undergo business transformation," then let's just say I didn't say that. Legends do exist, but one should not blindly believe in legends when conducting research.
Who Can Train GPT-4: Anyone, But Not Necessarily
Alchemy doesn't take that long, and large companies are all buying graphics cards. An obvious fact is that a year from now, large companies will all have the capability to train models at the level of GPT-4. However, whether they should train is another question.
In the gaming field, there is a classic proposition called "play Genshin Impact if you can," meaning that when players can choose between playing Genshin Impact or a competing product, if the competing product is not as good as Genshin Impact, they will choose Genshin Impact.
This "winner-takes-all" mindset also applies to the large model industry. If a company closely follows OpenAI and, after six months of R&D, launches its own large model with 90% of GPT-4's performance, hoping to push it to market, the company will face the following issues:
- OpenAI has a scale advantage in cloud resources, leading to lower costs.
- OpenAI's API is already widely used in product code, making it difficult to switch.
- The company's product performance still does not exceed GPT-4.
- OpenAI's next-generation product (possibly GPT-5) is about to be released.
It is clear that the pressure on this company is quite significant. Rather than training GPT-4, it might be better to bet directly on the next-generation model (targeting GPT-5). Then the issue will shift from "similar competing products" to "technological innovation," which is a heavy burden for small companies to bear.
Therefore, discussing "who can train GPT-4" is a strategic dead end. Instead of pondering this question, it is better to seek directions with more certainty and opportunity.
Advice for AI Startups: Performance First, Avoid Stagnation
I have previously written several articles criticizing langchain, primarily because langchain does not leave developers room to improve performance. Dubbed a "universal framework," it sacrifices much of the performance improvement space of large models, such as multi-turn dialogue and format control for fine-tuning. Similar issues exist with guidance/Auto-GPT/BabyAGI, all aiming to create a "framework that can be used for a lifetime."
An objective fact is that OpenAI released Function Calling in May, providing better implementation solutions for many troublesome areas in code, but the cost of achieving better solutions is the need to refactor key parts of the product code. In August, OpenAI released the ability to fine-tune GPT-3.5, offering new potential solutions for many aspects requiring precise output control.
Therefore, startups must face a critical choice: should they choose ① to improve performance and continuously refactor products, or ② to reduce the use of new features and always develop using old features?
For startups applying new technologies, "development" not only represents the process of writing code but also signifies the "ceiling" of product functionality/strategy. The higher the performance they can manage, the more theoretical functions the product can have, and the greater the strategic flexibility.
The development of technology is unpredictable, and small technological innovations may lead to highly sensitive changes in competitive landscapes. Startups should possess the ability to be anti-fragile regarding technological developments.
------ In plain language: prioritize performance and avoid stagnation. On the development side, make extensive use of new features; on the product side, consider what functions new features can enable; on the strategic side, think about the impact of new features on strategy.
In "On the Fall of the Qin Dynasty," it was mentioned that after the establishment of the Qin Dynasty, they confiscated all metal weapons in the world and cast them into twelve bronze figures to eliminate the possibility of popular uprisings. However, the Qin Dynasty is notoriously known for being short-lived. It is more beneficial to pay attention to changes rather than ignore them.
Advice for AI Startups: Feel Free to Build Applications
A very common hidden danger for startups building applications is the entry of large companies. The large companies here include not only application giants like Meta/ByteDance/Tencent but also upstream players in the AI industry, such as OpenAI.
The reasons for large companies entering the market usually boil down to two: to position themselves for product opportunities and to move from upstream to downstream.
"Positioning for product opportunities" is self-explanatory; large companies feel that this direction is worth pursuing, so they do it.
"Moving from upstream to downstream" is often a reluctant move, possibly because they have developed a large model comparable to OpenAI's, but due to the winner-takes-all problem, they have no users, leading to burning costs, no revenue, and no data, which in turn causes their performance to lag. At this point, entering the downstream market, developing specific applications, and using their technology becomes the only option.
Based on historical experience, due to organizational structure issues, the closer a company is to the downstream, the more likely its technology will lag, and the more it lags, the more it has to move downstream. These so-called technology companies will ultimately compete for the same ecological niche as application-layer companies.
However, in the battlefield of the application layer, since AI technology has emerged only recently, there are no effective, reusable scale advantages, and large companies and startups start from a similar baseline. Compared to large companies, startups are more efficient and insightful, making it easier for them to gain an advantage.
One noteworthy situation is that MS Azure's current promotional materials are almost entirely centered around OpenAI, indicating that such a large company relies entirely on OpenAI as its front, which indirectly proves that startups have a natural advantage in the AI field.
Of course, some cloud vendors may not accept being led by startups and want to capture the entire market themselves. Their high costs and slow speeds are not immediate threats.
The fact is that while some AI application tracks are indeed very short-lived, many long-lived tracks have yet to be discovered, and AI applications are not necessarily winner-takes-all. Extending from applications to platforms or technologies is also a more feasible path.
Therefore, we should rationally assess the ability of large companies to invade the application layer. Our advice is that AI startups can feel free to build applications.
Advice for AI Startups: Pay Attention to Product Lifespan
As mentioned earlier, AI startups can feel free to build applications, but they must consider the performance issues of AI models to avoid stagnation. This situation directly manifests as AI products potentially losing their demand base and gradually withering within a few months, which can happen frequently.
AI applications require the services of large models, and the performance of large models is continuously improving. This improvement is not merely a single-dimensional enhancement like "speed," but rather a comprehensive change in output quality, output length, and output controllability. Each significant technological upgrade will lead to existing application-layer products becoming technologically outdated and create new opportunities and competitors.
We refer to the time during which AI applications maintain advantages and necessity in strategy/product/technology as the "lifespan."
Here are some examples of products with relatively short lifespans:
- When ChatGPT/Claude supports file uploads, ChatPDF loses its necessity.
- When Office365 supports Copilot, products for creating PPTs using AI lose their advantages.
- When GPT-3.5 appears, Jasper loses its necessity.
Considering the rapid development of the AI industry, having a limited lifespan is the norm. Therefore, accepting the fact that lifespans are limited and choosing directions with longer lifespans for development is beneficial for maintaining long-term advantages and product necessity.
Generally, lifespans can be simply categorized into 3/6/12-month levels.
- 3 months: Features that large companies cannot implement in time (such as features that office/ChatGPT have not yet had time to develop).
- 6 months: Features that are somewhat difficult to implement, cannot be integrated into existing solutions, but whose advantages/necessity will disappear as AI performance improves (such as general AI frameworks).
- 12 months: Advantages/necessity that can exist long-term and are not easily affected by large companies/technological developments (such as Hugging Face).
* The lifecycle of platform products is not necessarily long, as prompt stores are also platforms.
Startups can proceed with a direction as long as they have a 6-month level lifespan; a 12-month level lifespan is more of a rare opportunity.
When a product's lifespan comes to an end, there are generally two scenarios. The first scenario is that the advantage disappears, necessitating a product overhaul and technological upgrade; please refer to the earlier section "Performance First." The second scenario is that the necessity disappears, and the product will gradually be replaced. At this point, the product still has several months of "operational lifespan," which is sufficient for startups to choose their next direction.
Advice for AI Startups: Web3 + AI is Feasible
Currently, there are many projects focused on the theme of Web3 + AI, and considering the uncertainty of technological development and the early stage of the market, there are many variables regarding the topic of Web3 + AI in the future.
This article aims to find high-probability certainties amid uncertainties, so I still hope to throw out some ideas and directions that may present opportunities for startups and interested researchers to consider.
- Decentralization
Currently, the leaders in the AI industry only provide closed-source models, and the stability, transparency, and neutrality of their continued service are uncontrollable. Decentralization may become an important theme in the AI industry, that is, providing stable, transparent, and neutral AI services based on a decentralized architecture.
Decentralization is an "alternative solution" and also a form of "deterrence," significantly increasing the unethical costs for centralized/sovereign AI companies, preventing them from using AI models for military, cult, or political purposes.
In extreme situations, if centralized/sovereign AI services become unavailable or untrustworthy for some reason, decentralized AI can continue to provide highly available services, preventing individual countries/regions or even humanity from losing AI services and falling into paralysis.
- Practical Computing Power
The transition of ETH from PoW to PoS is rooted in the criticized dilemma of "mining not generating value," and combining Web3 with AI can provide scenarios for practical computing power, thus achieving the digestion of existing computing power and promoting overall computing power growth.
- Virtual Assetization
AI is inherently an asset based on computing power and storage. The combination of Web3 and AI can provide a channel for transforming AI into virtual assets, realizing the value instantiation of the AI industry while creating truly native virtual assets for Web3.
- Variables in Web3 Applications
The combination of Web3 and AI may bring new functional points and growth opportunities to Web3 applications, allowing existing Web3 applications to be completely redone.
In Conclusion: Is AI Still Worth Starting a Business in September?
To conclude: yes, and this conclusion is likely applicable until the New Year.
Human perception of situations often has biases, and I am no exception. Some are overly optimistic, while others are overly pessimistic. I once communicated with two teams; one team believed they could create an AI agent by Q1 next year, while the other thought AI was only suitable for knowledge management tasks. Clearly, the former is overly optimistic, while the latter is overly pessimistic.
When making long-term plans, both excessive optimism and excessive pessimism can lead to pitfalls, and widely circulated statements often reflect these significant biases, making independent thinking invaluable. Therefore, regardless of whether readers can accept the views in this article, as long as they engage in independent thinking and judgment during the reading process, I will be immensely gratified.