OpenAI's versatile model GPT-4o delivers a stunning real-time interaction, ushering in the era of science fiction

Recommended Reading
2024-05-14 09:58:22
Collection
Just 17 months after the release of ChatGPT, OpenAI has introduced a super AI straight out of a sci-fi movie, and it's completely free for everyone to use.

Original Title: OpenAI Disrupts the World: GPT-4o Completely Free, Real-Time Voice and Video Interaction Shocks Everyone, Directly Entering the Sci-Fi Era

Author: Machine Heart

Source: TechFlow Deep Tide

So shocking!

While various tech companies are still chasing after large model multimodal capabilities, integrating functions like text summarization and image editing into mobile phones, the far ahead OpenAI has directly unleashed a major move, releasing products that even its own CEO, Sam Altman, was amazed by: just like in the movies.

In the early hours of May 14, OpenAI unveiled its next-generation flagship generative model GPT-4o and a desktop app at its first "Spring Product Launch." They showcased a series of new capabilities. This time, technology has disrupted product forms, and OpenAI has given a lesson to tech companies around the world through its actions.

Today's host is OpenAI's Chief Technology Officer Mira Murati, who stated that the main focus today is on three things:

  • First, OpenAI will prioritize free access to products in the future, aiming to allow more people to use them.

  • Second, OpenAI has released a desktop version of the program and an updated UI, making it simpler and more natural to use.

  • Third, after GPT-4, a new version of the large model has arrived, named GPT-4o. The special feature of GPT-4o is that it brings GPT-4 level intelligence to everyone through extremely natural interaction, including free users.

After this update of ChatGPT, the large model can accept any combination of text, audio, and images as input and generate any combination of text, audio, and images as output in real time ------ this is the future of interaction.

Recently, ChatGPT can be used without registration, and today a desktop program has been added. OpenAI's goal is to allow people to use it seamlessly anytime and anywhere, integrating ChatGPT into their workflows. This AI is now productivity.

GPT-4o is a new large model aimed at the future paradigm of human-computer interaction, with understanding capabilities across text, voice, and image modalities, reacting quickly and with emotion, and being very human-like.

At the event, an OpenAI engineer demonstrated several key capabilities of the new model using an iPhone. The most important is real-time voice conversation. Mark Chen said, "It's my first time at a live launch event, and I'm a bit nervous." ChatGPT responded, "Why don't you take a deep breath?"

"Okay, I'm taking a deep breath."

ChatGPT immediately replied, "That's not good, you're breathing too loudly."

If you've used voice assistants like Siri before, you can see the obvious differences here. First, you can interrupt the AI at any time without waiting for it to finish speaking to continue the next round of conversation. Second, you don't have to wait; the model reacts extremely quickly, even faster than human responses. Third, the model can fully understand human emotions and express various feelings itself.

Next is the visual capability. Another engineer wrote an equation on paper, and instead of giving a direct answer, ChatGPT was asked to explain step by step how to solve it. It seems to have great potential in teaching problem-solving.

ChatGPT said, "Whenever you're struggling with math, I'm right here with you."

Next, we tried GPT-4o's coding capabilities. There was some code, and the desktop version of ChatGPT was opened to interact with it using voice, asking it to explain what the code is for and what a certain function does, and ChatGPT responded fluently.

The output of the code was a temperature curve graph, and ChatGPT responded to all questions about this graph in one sentence.

It could answer questions like which month is the hottest, and whether the Y-axis is in Celsius or Fahrenheit.

OpenAI also responded to some real-time questions from users on X/Twitter. For example, real-time voice translation, where the phone can be used as a translation machine to translate between Spanish and English.

Someone also asked, "Can ChatGPT recognize your expressions?"

It seems that GPT-4o is already capable of real-time video understanding.

Next, let's take a closer look at the bombshell that OpenAI released today.

All-Purpose Model GPT-4o

First, let's introduce GPT-4o, where "o" stands for Omnimodel.

For the first time, OpenAI has integrated all modalities into a single model, significantly enhancing the practicality of large models.

OpenAI CTO Mira Murati stated that GPT-4o provides "GPT-4 level" intelligence, but improves capabilities in text, visual, and audio aspects based on GPT-4, and will be "iteratively" rolled out in company products over the coming weeks.

"The rationale for GPT-4o spans voice, text, and visual," said Mira Murati. "We know these models are becoming increasingly complex, but we want the interaction experience to be more natural and simpler, allowing you to focus entirely on collaborating with GPT without worrying about the user interface."

GPT-4o matches the performance of GPT-4 Turbo in English text and coding, but shows significant improvement in non-English text performance, while the API speed is also faster and costs have been reduced by 50%. Compared to existing models, GPT-4o excels particularly in visual and audio understanding.

It can respond to audio input in as fast as 232 milliseconds, with an average response time of 320 milliseconds, similar to humans. Before the release of GPT-4o, users who experienced ChatGPT's voice conversation capabilities could perceive an average delay of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

This voice response mode consists of a pipeline of three independent models: a simple model transcribes audio to text, GPT-3.5 or GPT-4 receives the text and outputs text, and a third simple model converts that text back to audio. However, OpenAI found that this method meant GPT-4 would lose a lot of information, as the model could not directly observe tone, multiple speakers, or background noise, nor could it output laughter, singing, or express emotions.

With GPT-4o, OpenAI has trained a new model end-to-end across text, visual, and audio, meaning all inputs and outputs are processed by the same neural network.

"From a technical perspective, OpenAI has found a way to directly map audio to audio as a primary modality and to transmit video in real time to the transformer. These require some new research on tokenization and architecture, but overall it's a data and system optimization problem (most things are).” commented Jim Fan, a scientist at Nvidia.

GPT-4o can perform real-time reasoning across text, audio, and video, marking an important step towards more natural human-computer interaction (and even human-machine-machine interaction).

OpenAI President Greg Brockman also engaged in some fun online, not only having two GPT-4o models converse in real time but also having them improvise a song. Although the melody was a bit "touching," the lyrics covered the room's decor style, character clothing features, and small anecdotes that occurred during the process.

In addition, GPT-4o's capabilities in understanding and generating images are far superior to any existing model, making many previously impossible tasks "easy as pie."

For example, you can ask it to help print OpenAI's logo on a coaster:

After this period of technical breakthroughs, OpenAI should have perfectly solved the issue of ChatGPT generating fonts.

At the same time, GPT-4o also has the capability to generate 3D visual content, capable of performing 3D reconstruction from six generated images:

This is a poem, and GPT-4o can format it in a handwritten style:

More complex formatting styles can also be handled:

With GPT-4o, you only need to input a few paragraphs of text to get a series of continuous comic storyboards:

And the following features should surprise many designers:

This is a stylized poster evolved from two candid photos:

There are also some niche features, like "text to art font":

GPT-4o Performance Evaluation Results

Members of OpenAI's technical team stated on X that the mysterious model "im-also-a-good-gpt2-chatbot," which previously sparked widespread discussion on LMSYS Chatbot Arena, is a version of GPT-4o.

On comparatively difficult prompt sets ------ especially in coding: GPT-4o shows particularly significant performance improvements over OpenAI's previous best models.

Specifically, in multiple benchmark tests, GPT-4o achieved GPT-4 Turbo level performance in text, reasoning, and coding intelligence, while achieving new highs in multilingual, audio, and visual capabilities.

Reasoning improvement: GPT-4o set a new high score of 87.2% on the 5-shot MMLU (common sense questions). (Note: Llama3 400b is still in training)

Audio ASR performance: GPT-4o significantly improved speech recognition performance across all languages compared to Whisper-v3, especially for resource-scarce languages.

GPT-4o achieved new SOTA levels in speech translation and outperformed Whisper-v3 in MLS benchmark tests.

The M3Exam benchmark is both a multilingual evaluation benchmark and a visual evaluation benchmark, consisting of standardized multiple-choice questions from various countries/regions, including graphics and charts. In all language benchmark tests, GPT-4o outperformed GPT-4.

In the future, improvements in model capabilities will enable more natural, real-time voice conversations and allow users to converse with ChatGPT through real-time video. For example, users could show ChatGPT a live sports event and ask it to explain the rules.

ChatGPT Users Will Get More Premium Features for Free

Over 100 million people use ChatGPT every week. OpenAI announced that the text and image capabilities of GPT-4o will be available for free in ChatGPT starting today, and Plus users will receive up to five times the message limit.

Now, when opening ChatGPT, we find that GPT-4o is already available.

When using GPT-4o, free ChatGPT users can now access the following features: experience GPT-4 level intelligence; users can obtain responses from the model and the web.

Additionally, free users will have the following options ------

Analyze data and create charts:

And converse with captured photos:

Upload files for summarization, writing, or analysis assistance:

Discover and use GPTs and the GPT app store:

And use memory features to create a more helpful experience.

However, depending on usage and demand, the number of messages free users can send using GPT-4o will be limited. When the limit is reached, ChatGPT will automatically switch to GPT-3.5 so users can continue the conversation.

Additionally, OpenAI will roll out a new version of the voice mode GPT-4o alpha in ChatGPT Plus over the coming weeks and will introduce more new audio and video features of GPT-4o through API to a small number of trusted partners.

Of course, through multiple model tests and iterations, GPT-4o has some limitations across all modalities. In these imperfect areas, OpenAI stated that it is working hard to improve GPT-4o.

It is conceivable that the opening of GPT-4o's audio mode will certainly bring various new risks. Regarding safety issues, GPT-4o has built-in safety in cross-modal design through techniques such as filtering training data and refining model behavior post-training. OpenAI has also created a new safety system to protect voice output.

New Desktop App Simplifies User Workflow

For both free and paid users, OpenAI has also launched a new ChatGPT desktop application for macOS. With simple keyboard shortcuts (Option + Space), users can immediately ask ChatGPT questions, and they can also take screenshots directly within the application for discussion.

Now, users can also have voice conversations with ChatGPT directly from their computers. The audio and video features of GPT-4o will be launched in the future, starting voice conversations by clicking the headphone icon in the bottom right corner of the desktop application.

Starting today, OpenAI will roll out the macOS application to Plus users and will make it more widely available in the coming weeks. Additionally, a Windows version will be launched later this year.

Altman: You Open Source, We Offer Free Access

After the launch, OpenAI CEO Sam Altman published a blog post for the first time in a while, sharing his thoughts on the journey of pushing GPT-4o:

In today's release, I want to emphasize two things.

First, a key part of our mission is to provide powerful AI tools for free (or at a low cost) to people. I am very proud to announce that we are offering the world's best model for free in ChatGPT, without ads or anything similar.

When we founded OpenAI, our initial vision was: we wanted to create AI and leverage it to create various benefits for the world. Now the situation has changed; it seems we will create AI, and then others will use it to create various amazing things, from which we all will benefit.

Of course, we are a business and will invent many paid offerings, which will help us provide free, excellent AI services to billions of people (hopefully).

Second, the new voice and video modes are the best computational interaction interfaces I have ever used. It feels like the AI in the movies, and I am still a bit surprised that it is real. It turns out that achieving human-level response times and expressive capabilities is a huge leap.

The original ChatGPT hinted at the possibilities of a language interface, while this new thing (the GPT-4o version) feels fundamentally different ------ it is fast, intelligent, fun, natural, and helpful.

For me, interacting with computers has never been very natural, and that is a fact. And when we add (optional) personalization, access to personal information, and the ability for AI to act on behalf of people, I can truly see an exciting future where we can do much more with computers than ever before.

Finally, a huge thank you to the team for their tremendous efforts to achieve this goal!

It is worth mentioning that last week Altman stated in an interview that while universal basic income is difficult to achieve, we can achieve "universal basic compute." In the future, everyone will have free access to GPT's computing power, which can be used, resold, or donated.

"The idea is that as AI becomes more advanced and embedded in every aspect of our lives, having a large language model unit like GPT-7 may be more valuable than money; you gain part of the productivity," Altman explained.

The release of GPT-4o may be the beginning of OpenAI's efforts in this direction.

Yes, this is just the beginning.

Finally, it is worth mentioning that the video "Guessing May 13th's announcement" showcased in OpenAI's blog today almost completely coincided with a teaser video for Google's I/O conference tomorrow, which undoubtedly is a direct challenge to Google. I wonder if Google felt immense pressure after watching today's OpenAI launch?

ChainCatcher reminds readers to view blockchain rationally, enhance risk awareness, and be cautious of various virtual token issuances and speculations. All content on this site is solely market information or related party opinions, and does not constitute any form of investment advice. If you find sensitive information in the content, please click "Report", and we will handle it promptly.
banner
ChainCatcher Building the Web3 world with innovators