OpenAI, which was far ahead, has slowed down

2024-05-14 10:11:14

Collection

When we founded OpenAI, our initial idea was that we would create artificial intelligence and use it to generate various benefits for the world. Instead, it now seems that we will create artificial intelligence, and then others will use it to create all sorts of amazing things that benefit all of us.

Scan with WeChat

Produced by: Huxiu Technology Group

Author: Qi Jian

Editor: Wang Yipeng

"If what is released is GPT-5, then OpenAI is still far ahead. If it's AI Search or a voice assistant, then it indicates that OpenAI has declined."

An AI large model practitioner told Huxiu that the expectations for OpenAI in the industry are too high; unless it is a disruptive innovation like GPT-5, it is difficult to satisfy the audience's "appetite."

Although Sam Altman had already announced before the OpenAI live stream that GPT-5 (or GPT-4.5) would not be released, the expectations from the outside world for OpenAI had already reached an unmanageable level.

On May 14, 2023, at midnight Beijing time, OpenAI announced the latest GPT-4o, where "o" stands for Omnimodel. The more than 20-minute demonstration live stream showcased an AI interaction experience that far exceeds all current voice assistants, aligning closely with previous reports from foreign media.

Although the demonstration effect of GPT-4o can still be described as "explosive," industry insiders generally believe it is difficult to match the "magic" described in Altman's preview. Many believe that these functional products are "deviating from OpenAI's mission."

OpenAI's PR team seems to have anticipated this trend in public opinion. At the press conference and in a blog post released afterward, Altman explained:

"A key part of our mission is to provide very powerful AI tools to people for free (or at a discounted price). I am very proud that we offer the world's best model in ChatGPT for free, without ads or similar things.

When we founded OpenAI, our initial idea was to create AI and use it to generate various benefits for the world. Instead, it now seems that we will create AI, and then others will use it to create various amazing things that benefit us all."

GPT-4o: Far Ahead

"If we have to wait 5 seconds to get 'every' reply, the user experience will plummet. Even if the synthesized audio sounds very realistic, it will break the immersion and feel lifeless."

On the eve of the OpenAI press conference, Jim Fan, head of Embodied AI at NVIDIA, predicted the voice assistant that OpenAI would release and proposed:

Almost all voice AI will go through three stages:

Speech recognition or "ASR": audio -> text1, such as Whisper;
Planning what to say next with LLM: text1 -> text2;
Speech synthesis or "TTS": text2 -> audio, such as ElevenLabs or VALL-E.

Going through these 3 stages leads to significant delays.

GPT-4o has almost resolved the delay issue in response speed. The shortest response time for audio input with GPT-4o is 232 milliseconds, with an average response time of 320 milliseconds, nearly comparable to humans. The average delay for voice conversation features in ChatGPT without using GPT-4o is 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

GPT-4o not only greatly enhances the experience by shortening delays but also includes many upgrades based on GPT-4, such as:

Excellent multimodal interaction capabilities, including voice, video, and screen sharing.
Real-time recognition and understanding of human expressions, text, and mathematical formulas.
Rich emotional expression in interactive voice, capable of changing tone, style, and even "improvising" songs.
Ultra-low latency, allowing users to interrupt the AI in real-time during conversations to add information or start new topics.
All ChatGPT users can use it for free (with usage limits).
Speed is twice that of GPT-4 Turbo, with API costs reduced by 50% and rate limits increased by 5 times.

"Declining" OpenAI

"These breakthroughs in limitations are all innovations."

Some industry experts believe that the multimodal capabilities of GPT-4o only "look good," and in reality, OpenAI has not demonstrated features that truly represent a "breakthrough" for visual multimodality.

Here, we compare it with Claude 3 from the neighboring company Anthropic, following the conventions of the large model industry.

The technical documentation for Claude 3 mentions, "While Claude's image understanding capabilities are cutting-edge, some limitations should be noted."

These include:

Character recognition: Claude cannot be used to identify (i.e., name) people in images and will refuse to do so.
Accuracy: Claude may hallucinate or make mistakes when interpreting low-quality, rotated, or very small images below 200 pixels.
Spatial reasoning: Claude's spatial reasoning abilities are limited. It may struggle with tasks requiring precise location or layout, such as reading a simulated clock face or describing the exact position of chess pieces.
Counting: Claude can provide an approximate count of objects in an image but may not always be accurate, especially for large numbers of small objects.
AI-generated images: Claude does not know whether an image is AI-generated and may be incorrect if asked. Do not rely on it to detect fake or synthesized images.
Inappropriate content: Claude will not handle inappropriate or explicit images that violate our acceptable use policy.
Healthcare applications: While Claude can analyze general medical images, it is not designed for interpreting complex diagnostic scans like CT or MRI. Claude's output should not be considered a substitute for professional medical advice or diagnosis.

In the case studies released on the GPT-4o website, there are some capabilities related to "spatial reasoning," but they still hardly qualify as breakthroughs.

Furthermore, it is easy to see from the content output during the press conference demonstration that its model capabilities are not significantly different from GPT-4.

GPT-4o Benchmark Scores

Although the model can add tone to conversations and even sing improvisationally, the dialogue content still lacks the detail and creativity seen in GPT-4.

Additionally, after the press conference, OpenAI's official website released a series of application case explorations for GPT-4o. These include: photo-to-cartoon style; meeting notes; image synthesis; 3D content generation based on images; handwriting and draft generation; stylized posters and comic generation; artistic font generation, etc.

Among these capabilities, photo-to-cartoon style and meeting notes are also some seemingly ordinary text-to-image or AI large model functions.

Can It Challenge Existing Business Models?

"If I register 5 free ChatGPT accounts, do I not need to spend $20 a month to subscribe to ChatGPT Plus?"

OpenAI's usage policy for GPT-4o states that ChatGPT Plus users have a traffic limit 5 times higher than that of regular users.

GPT-4o is free for everyone, which seems to challenge OpenAI's own business model.

Data from third-party market analysis platform Sensor Tower shows that in the past month, ChatGPT had 7 million downloads in the global App Store, generating $12 million in subscription revenue; in the global Google Play market, it had 90 million downloads, generating $3 million in subscription revenue.

Currently, the subscription price for ChatGPT Plus in both app stores is $19.99. Based on subscription data, it is estimated that ChatGPT Plus had about 750,000 paying subscribers through app stores in the past month. Although ChatGPT Plus has a large number of direct paying users, from the mobile revenue perspective, its annual income is less than $200 million, making it difficult to support OpenAI's nearly $100 billion valuation even if it multiplies several times.

From this perspective, OpenAI does not need to consider too much about personal user recharges.

Moreover, GPT-4o emphasizes a good experience; if your conversation with the AI gets interrupted and you have to switch accounts to continue, would you be inclined to pay for a subscription?

"The original ChatGPT hinted at the possibilities of a language interface; this new thing feels fundamentally different. It is fast, intelligent, fun, natural, and helpful."

Sam Altman's latest blog mentioned the "possibilities of a language interface," which is precisely what GPT-4o may aim to do next: challenge all GUIs (graphical user interfaces) and those who want to make strides in LUIs (language user interfaces).

Combined with recent reports of OpenAI's collaboration with Apple, it can be speculated that GPT-4o may soon extend an "olive branch" or "flip the table" to all AI PC and AI phone manufacturers.

Regardless of the type of voice assistant or AI large model, the core value for AIPC and AI phones is optimizing the experience, and GPT-4o has optimized that experience to the extreme.

GPT-4o is likely to disrupt all known apps and even the SaaS industry. Over the past year, all AI agents that have been developed or are in development will face threats.

A product manager from a resource aggregation app once told Huxiu, "My operational process is the core of the product; if your ChatGPT optimizes that process, it means my app has no value."

Imagine if the UI of a food delivery app changed to a single phrase, "Order me food"; then whether users open Meituan or Ele.me, it would be the same for them.

The next step for manufacturers can only be to compress the profit margins of the supply chain and ecosystem, or even engage in a vicious price war.

From the current situation, it seems that other manufacturers will need some time to surpass OpenAI in model capabilities.

To compete with OpenAI, products may only be able to do so by creating "cheaper" models.

Impact on the Domestic Industry

"I've been so busy lately that I haven't had time to pay attention to them."

A founder of an industrial AI large model told Huxiu that they have been busy communicating strategic partnerships, product launches, customer interactions, and capital exchanges, leaving no time to focus on releases from OpenAI.

Before OpenAI's announcement, Huxiu also inquired with several domestic AI practitioners from various industries about their predictions and views on OpenAI's latest release, and they were quite consistent: very much looking forward to it, but it has nothing to do with me.

One practitioner stated that given the current progress in China, it is unrealistic to catch up with OpenAI in the short term. So, keeping an eye on what OpenAI releases is mostly about observing the latest technological directions.

Currently, domestic companies in AI large model research and development are generally more focused on engineering and vertical models, which are more pragmatic and easier to monetize.

In engineering, the recently popular Deepseek is stirring up a price war for tokens in the domestic large model industry. In terms of vertical models, several industry insiders told Huxiu that the research and development of small models and vertical models in the short term will not be significantly influenced by OpenAI.

"Sometimes OpenAI's technological direction is not very worth emulating." A model expert told Huxiu that Sora is a good example; in February 2024, OpenAI released the video model Sora, achieving stable output of 60-second videos. Although it seems effective, there has been almost no follow-up practice, and the speed of implementation is very slow.

Before Sora, many companies and institutions focusing on text-to-video had already achieved stable video generation of 15 seconds, and after Sora's release, the R&D, financing, and product rhythms of some companies were disrupted, even turning the entire text-to-video industry's development into a "technological leap forward."

Fortunately, this time GPT-4o is very different from Sora. OpenAI CTO Muri Murati stated that in the coming weeks, we will continue our iterative deployment to provide you with all the features.

Shortly after the press conference ended, GPT-4o was already available for trial use.

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.

Huxiu Technology Group

Hong Kong is determined to compete as a global virtual asset hub

Tencent tears off the "shame veil" of China's NFT