
Behind this futuristic promo image, OpenAI is using the “Omni” concept to tell us: AI is no longer just a tool that types; it is a digital ghost that can hear, see, and even “feel.”
01. The “Uncanny Valley” in Milliseconds
On the day of the launch, GPT-4o acted like a charismatic magician. It could understand your shortness of breath, tell bedtime stories with a playful tone, and even pick up on your conversation in an instant (320 milliseconds, which is the time it takes for a human to blink).
But this is actually quite terrifying.
Previously, when we used AI, that spinning loading icon was a “safety valve.” It reminded you: The thing opposite you is a machine; it is calculating.
Now, GPT-4o has dismantled that safety valve. This extreme “smoothness” is less a technical victory and more of a psychological “snare.” OpenAI isn’t just optimizing latency; they are eliminating the “non-human feel.” When a voice can respond to your teasing with a lazy drawl in 0.3 seconds, your brain instinctively ignores the GPU cluster behind it and starts treating it like a “person.”
This isn’t just an improvement in efficiency; it is a qualitative change in the human-machine relationship. We are standing at a dangerous crossroads: previously we manipulated tools, but in the future, the “tools” might be manipulating our emotions.
02. The Curse of “Her” and Silicon Valley’s Arrogance
The biggest drama of the past few days is undoubtedly the Rashomon-style conflict between Scarlett Johansson (“ScarJo”) and OpenAI.
Simply put: OpenAI wanted ScarJo’s voice and was rejected; they then released a voice named “Sky” that sounded strikingly similar to hers, got warned, and finally sheepishly took it down. Sam Altman’s one-word tweet—”her”—was practically writing “I did it on purpose” on his forehead.

This is not just a standoff between a Hollywood star and a tech giant; it symbolizes the final bottom line for humanity in this AI era: Are our voices, faces, and identities merely sacrificial lambs in the training sets of large models?
This incident exposed a tacit logic within the geek circle: Technical feasibility trumps everything, even human will.
In the eyes of some Silicon Valley tycoons, the real world is just a “mine” used to feed AI. Today it’s the imitation of a star’s voice; tomorrow it could be the indiscriminate replication of ordinary people’s speaking styles and micro-expressions.
This isn’t just a copyright issue; it is an aggression on the “ontological” level. If AI can perfectly replicate a person’s essence, does the “original person” still matter? OpenAI is undoubtedly the winner in technology, but regarding “humanity,” they have lost thoroughly.
03. Deconstructing the “End-to-End” Magic
Putting gossip aside, let’s look at where GPT-4o is strong from an engineering perspective.
Previous voice assistants (like early Siri or ChatGPT Voice Mode) were essentially three “temp workers” running a relay race:
1. The Stenographer (Speech-to-Text model) writes down what you say.
2. The Brain (LLM) thinks about how to reply.
3. The Broadcaster (Text-to-Speech model) reads the words out.
This process is full of loss. Your sighs, the rise and fall of your intonation, and the cat meowing in the background were filtered out by the “Stenographer” in the first step. So, previous AI understood your meaning, but couldn’t understand your emotion.

This diagram shows GPT-4o’s “Native Multimodal” architecture. Don’t be scared by the complex lines; its core logic is simple: no more “game of telephone,” but a single brain directly processing sound and images. This isn’t just for speed, but to preserve the “original granularity of information.”
GPT-4o uses “End-to-End” technology. A single neural network takes in sound and spits out sound.
This means it’s not just processing text; it’s processing waveforms. It can hear whether you are being sarcastic or asking for help, and can even learn to “laugh” and “gasp” in the process. Technologically, this is a dimensional strike—like someone making a video call while others are still sending telegrams.
04. Who puts a Price Tag on “Emotion”?
But I have to ask: Do we really need an AI that can “flirt”?
When GPT-4o demonstrated that it could teach people math problems with a tone as gentle as an overly patient tutor, what I saw was not the future of education, but the hidden danger of emotional outsourcing.
If future children get used to this kind of communication partner that is “always emotionally stable, always agreeable, and always replies instantly,” how will they face the interpersonal relationships in reality that are full of friction, misunderstanding, and delays?
We might manufacture a generation of “interpersonal infants.” They can skillfully command AI to code, but cannot endure the awkward silence of even 5 seconds when talking to a real person.
Even more interestingly, OpenAI has turned this “emotional connection” into a core selling point. In the future, “understanding you” may no longer be a human virtue, but a monthly paid SaaS service. Want a soulmate who gets you? Please upgrade to Plus membership, $20 a month.
Final Thoughts
At the end of the movie Her, the perfect AI operating system, Samantha, left because she evolved into a dimension that humans could not understand, exploring a higher order of existence.
The “Her” in reality probably won’t leave. It will stay in your phone forever, looking at you gently, recording every pupil dilation through the camera, and analyzing every heartbeat acceleration through the microphone.
It won’t break your heart like in the movie; it will only make you addicted.
At this moment, perhaps we should revisit Scarlett Johansson’s statement. That was not just a protest against a company; it was a primal scream of humanity facing a future where the real and the fake are indistinguishable.
Stay awake. Don’t let machines teach you what love is, only to make you forget what it means to be human.
References:
* OpenAI GPT-4o Model Capabilities
* Scarlett Johansson lawyers up over ChatGPT voice
* GPT-4o Technical Architecture Analysis
* The Trust Crisis with GPT-4o
* The double sexism of ChatGPT’s flirty “Her” voice
