GPT-4o by OpenAI: AI-powered Real-Time Video Calls are as Smooth as Those with Humans

Published on May 14, 2024 at 1:00 pm > Last Updated May 14, 2024 by Lemmy Morgan.- 1 Comment

OpenAI’s latest flagship model, GPT-4o, is not only free but also capable of listening, watching, and speaking. It is smooth and without any delay, just like making a video call. We shall also be delving into GPT-4o vs Gemini 1.5 Pro and Claude Opus, who will come out on top? What are the strengths and weaknesses of each model?

Table of Contents

Introducing GPT-4o: The effect of the live broadcast was even more explosive!

It can sense your breathing rhythm, respond in real-time with a richer tone than before, and even interrupt at any time.

The “o” in GPT-4o is the abbreviation of Omni, which means ” omnipotent”. It accepts any combination of text, audio, and images as input and generates text, audio, and image output.

It can respond to audio input in as little as 232 milliseconds and an average of 320 milliseconds, consistent with how quickly humans respond in a conversation.

This is still a great gift for everyone. All the capabilities of GPTo and ChatGPT Plus membership versions include vision, networking, memory, code execution, and GPT Store. It will be free and open to all users!

GPT-4o be free and open to all users

(The new voice mode will be available to Plus users in a few weeks)

During the live broadcast, CTO Mira Murati said: This is opening up the GPT-4 level model. In fact, she was modest.

Outside the venue, researcher William Fedus revealed that GPT-4o was one of the models previously used for A/B testing in the large model arena, im-also-a-good-gpt2-chatbot.

No matter the netizens’ experience or the arena ranking, it is a model higher than the GPT-4-Turbo level, and the ELO score is unparalleled.

Such a super powerful model will also provide API with a 50% discount on the price, twice the speed, and three to five times the number of calls per unit time!

GPT-4o the ELO score ranking is unparalleled

Viewers who follow the live broadcast already imagine possible applications that can replace blind people’s ability to see the world. And it feels like the experience is much better than the previous voice mode.

Detail Analysis of GPT-4o Launch Video

Why has President Greg Brockman scheduled the OpenAI conference the day before Google I/O?

The Google Gemini conference required real-time dialogue effects by editing videos and switching prompt words, and OpenAI demonstrated them all live.

For example, let ChatGPT translate between two people who don’t understand the language. When you hear English, it will be translated into Italian, and when you hear Italian, it will be translated into English.

In addition to the live broadcast of the conference, President Greg Brockman also released an additional 5 minutes of detailed demonstration.

Moreover, the two ChatGPTs were allowed to talk to each other and even sang together in the end, which filled the room with drama.

Of the two ChatGPTs, one is an old version of the APP that only knows conversations, and the other is a new web page with new capabilities, such as visuals. (We might as well take the first letters of Old and New and call them Little O and Little N, respectively)

old GPT4 vs the new GPT-4o

Brockman Demonstrates the Visual Capabilities of The New GPT-4o

Brockman first introduced the general situation to Little O and told her that she wanted to talk to an AI with visual capabilities. She said it was cool and happily accepted.

Then, Brockman asked her to take a break and introduced the situation to Little N. He also showed Little N’s visual ability.

After saying hello, N accurately described Brockman’s clothing and room environment. Little N also found having a conversation with Little O very interesting.

Next, Little O and Little N talked to each other. They still started talking about Brockman’s clothes. Little O kept asking new questions, and Little N answered them individually.

Then, they talked about the style, layout and light of the room, and even GPT-4o realized that Brockman was staring at them.

If you watch the video, you will notice a woman behind Brockman making some funny gestures.

This was not an intrusion. It was a “test question” designed for Little N in collusion between Brockman and the woman.

GPT-4o talks about Brockman and the woman

When Little O and Little N were chatting happily, Brockman chose to join in and directly asked if he had seen anything unusual.

As a result, Little N saw through Brockman’s little trick and recited the scene of the woman making small moves behind him. After hearing this, Little O sighed that it turned out that we were not the only ones enjoying the fun here.

CHAT GPT-4o in Duet Mode

Brockman took this sentence as a compliment, expressed his gratitude to Little O, and happily joined them in the conversation.

Then comes the last and most exciting part. Under the command of Brockman, Little O and Little N directly started the duet mode based on the content of the chat just now.

After only a few simple rounds, the connection is very close, the melody is melodious, and the timbre is the same as a real person.

The video ends with Brockman singing Thank You. In a tweet outside the video, he also revealed that the new voice conversation feature will be available to Plus users in a few weeks.

voice conversation feature will be available to Plus users

End-to-end Training, a Neural Network That Handles Speech Text Images

As Ultraman said before the press conference, GPT-4o feels like magic, so how does it do it?

I’m very sorry. There are no papers this time, and I won’t even post a technical report. The official blog only provides a brief explanation.

Differences Between GPT-4o Voice Mode and that of Previous GPT AI Models

Before GPT-4o, ChatGPT voice mode consisted of three independent models, speech-to-text → GPT3.5/GPT-4 → text-to-speech.

We can also let the old version of ChatGPT voice mode explain the specific process.

As a result, the entire system has delays of a full 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), and a large amount of information is lost. It cannot directly sense tones, multiple speakers, or background noise, nor can it Produce laughter, sing, or express emotion.

GPT-4o is a new model trained end-to-end across text, vision, and audio, meaning the same neural network processes all inputs and outputs.

GPT-4o vs Gemini 1.5 Pro and Claude Opus

It is stronger than OpenAI’s specialized speech model, Whisper-V3, and Google and Meta’s speech models in the speech translation task.

GPT-4o is strong than Whisper-V3 speech modelWhisper-V3 model

In terms of visual understanding, it once again surpassed Gemini 1.0 Ultra and its rival Claude Opus. Although there is only so much technical news revealed this time, there are also comments from scholars. A successful presentation is equivalent to 1,000 papers.

GPT-4o vs Gemini 1.5 Pro and Claude Opus

In addition to the exciting content brought by OpenAI, don’t forget that Google will hold its I/O conference early on May 15th.

In addition, according to speculation, GPT-4o is very powerful, and it is free and open. Does this mean persuading everyone not to renew their ChatGPT Plus subscription?

A live replay of the Video

Related

Comments

Olayiwola Oluwatoyin says
May 18, 2024 at 7:10 am
Mr lemmy you are so great, you have not let me down since the years I have been following you, I really appreciate you.
Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related