[LINK] Introducing GPT-4o

Thu May 16 19:45:58 AEST 2024

Please Note: This email did not come from ANU, Be careful of any request to buy gift cards or other items for senders outside of ANU. Learn why this is important.
https://www.scamwatch.gov.au/types-of-scams/email-scams#toc-warning-signs-it-might-be-a-scam

`

ChatGPT gets a voice – and feelings

The "astonishing" GPT-4o is just weeks away from release.

By David Braue on May 16 2024
https://ia.acs.org.au/article/2024/chatgpt-gets-a-voice---and-feelings.html

Introducing GPT-4o: https://www.youtube.com/watch?v=DQacCB9tDaw

OpenAI has demonstrated GPT-4 omni (GPT-4o), an “astonishing” update
that gives ChatGPT a human-sounding voice, translation skills, computer
vision, an emotional range – and a singing voice.

Introduced in a live demonstration led by OpenAI chief technology
officer Mira Murati, the user interface changes built into the GPT-4o
large language model (LLM) – which she said “will bring GPT-4 level
intelligence to everyone, including our free users” as it is rolled out
in coming weeks – have been designed to make interaction with the model
“much more natural and far, far, easier.”

“For the past couple of years, we’ve been very focused on improving the
intelligence of [GPT] models but this is the first time that we are
really making a huge step forward when it comes to ease of use,” Murati
said.

“This is incredibly important because we’re looking at the future of
interaction between ourselves and the machines.”

Through a series of demonstrations, Murati – along with head of
frontiers research Mark Chen and post-training team lead Barret Zoph –
showed how the GPT-4o app, which is also set to debut in a desktop app,
provides a natural language interface that supports dozens of languages
and queries while providing near instantaneous responses.

The new model’s faster speed meant the demonstrators could interrupt the
GPT-4o voice in mid-sentence, giving it new instructions just like one
person might interrupt another during the natural flow of conversation.

Asked to read a bedtime story, GPT-4o changed its tone of voice when
requested, speaking in a more intense way each time it was asked to add
more “drama” – then switching to a dramatic robotic voice, and singing
the end of the story as well.

The multi-modal model also integrates computer vision – allowing it to,
for example, interpret a written linear mathematics equation and talk
Zoph through the process of solving it.

GPT-4o’s computer vision capabilities also enabled it to analyse a
selfie of Zoph and infer his emotional state – “pretty happy and
cheerful,” the model surmised, “with a big smile and maybe a touch of
excitement”.

Once more, with feeling

The voice capabilities of GPT-4o immediately drew comparisons online to
‘Samantha’, the Scarlett Johansson-voiced AI companion from the 2013
movie ‘Her’ – which mainstreamed the idea of an emotional,
human-sounding AI capable of convincing willing users that it was human.

The emotive range of the new AI is “quite astonishing,” Alex Jenkins,
director of Curtin University’s WA Data Science Innovation Hub, told
Information Age.

He likened the original ChatGPT to “a deaf person who read every book in
the world, every journal article, and every piece of paper they could
get their hands on – but they didn’t know how the world sounds.”

“They didn’t know what human speech was like,” he said, “and that
obviously has an impact in terms of communicating in a human-like way,
because we use expression in our voice all the time as a key component
of communication.”

Although computers have been “talking” for many years, Jenkins added,
“dumb” previous text-to-speech engines “didn’t understand the intent and
context of the conversation. They were reading the words out and not
applying inflection in any kind of meaningful way.”

“This new model understands how the world sounds and how people sound,
and it’s able to express its voice in a similar fashion to what humans
can do.”

The announcement quickly drew a counter salvo from Google, which
announced the availability of its Gemini 1.5 Pro LLM – which adds
features such as analysis of audio files and uploaded documents up to
1,500 pages long.

Availability of GPT-4o as a desktop app will also threaten Apple’s Siri
– reportedly due for an AI overhaul at next month’s Worldwide Developers
Conference – and Microsoft’s Cortana voice assistants, with Zoph
demonstrating how he can feed the source code of the application to the
desktop app and ask it questions about the information – such as what
the code does or what its output means.

Progressing the technology to this point “is quite complex,” Murati
said, “because when we interact with each other there is a lot of stuff
that we take for granted.”

Where previous GPT models used three separate elements to produce speech
– transcription intelligence, text-to-speech, and orchestration – she
explained that GPT-4o integrates these capabilities natively across
voice, text, and visual prompts.

The efficiency of GPT-4o is also significant because it will be the
first time that OpenAI’s GPT-4 LLM – a far more powerful engine than the
widely used GPT 3.5 that has previously only been offered to paying
customers – is available to any user, for free.

As the benchmark against which other LLMs are measured in terms of
capability, speed and security, GPT-4’s general availability will
significantly boost the AI capabilities available to the mass market –
with GPT-4o’s voice-driven user interface enabling a broad range of new
use cases.

As well as helping in applications such as helping autistic people learn
how to communicate verbally, the new model will likely be able to write
poetry “that sounds like it flows, and sounds lyrical,” Jenkins said.

“We’re a long way from the sort of doomsday Skynet scenario,” he laughed.

“I think the biggest immediate risk is that we'll be inundated with a
lot of mediocre poetry.”