They tell me no one outside Google has seen what we’re about to see. No journalist has ever been in this building. Here in Google’s Mountain View labs, the company’s creating lifesize AI agents that can see you, and talk to you.
This one’s name is Sophie.
It can speak any number of languages. It can see me, and almost everything else in the room. It can read, if I hold up my phone, a piece of paper, or a book. And of course, it can do Google-y things like pull up maps, recommend restaurants, check the weather, look up simple facts — only now with a woman’s face, a dark turtleneck, and an attempt at body language.
If it didn’t feel so fake and flat, I think, this could be pretty cool. But it might not seem fake and flat for long.
Image: Matt Piniol / The Verge
Today, Google is experimentally revealing “Beam video agents,” which it’s pitching as an exploration into the future of real-time communication with AI agents, using Google Beam.
Beam, if you’ll recall, is the company’s moderately mindblowing teleconferencing hardware that makes people feel like their conversation partner is right in front of them in stunning glasses-free 3D. The first Google Beam product is the $25,000 HP Dimension. Its six cameras don’t actually send video of another person. Instead, AI servers combine them into a volumetric 3D projection of a person — basically, the most lifelike video game character I’ve ever seen.

When I try Beam for the very first time, with Beam boss Andrew Nartker on the other side, visions of the Star Trek holodeck begin dancing before my eyes. But what if you weren’t talking to a person to begin with? What if it was a virtual character all along?
Sophie, unfortunately, is not in 3D yet, and “she” is not a character either — at least not with the limited feature set Google has enabled today. Like any second-gen chatbot, Sophie is here to mirror me, to unconvincingly act excited by everything I say, to act as my subservient concierge. Sophie always speaks after a long pause, in tight blocks of text that are always roughly the same length, starting with an acknowledgement and ending with a question about which capability it should demo next.
That’s intentional, says the team, because this demo was created specifically for Google I/O attendees to experience a five-minute demo of what Sophie can do, like creating a generative AI picture. I ask it to tell me a bedtime story, and facepalm when it produces a ridiculous image of me manipulating some magical contraption with the help of a giant fox. (My kids would probably love it.)
I won’t sugarcoat it: this doesn’t feel like talking to a person. There are too many cracks in the facade. Why does Sophie’s accent keep weirdly changing, sometimes developing a southern twang that goes away just as quickly? It’s meant to have a neutral American accent when speaking in American English, says product manager Pavan Kumar, but the AI model seems to be unintentionally drifting. I notice that Sophie keeps making the same exact arm gestures while speaking, too — probably because this early experiment is built off an audio model. Text drives speech, speech drives a lip-synced face, and gestures are presumably gravy on top.
(And yes, the subservient AI is female as usual. It’s because Sophie has a personality that everyone felt comfortable talking to, says Nartker.)
Video agents aren’t Beam’s only new experiment that’s not in 3D. Google is also showing off group calls on Beam for the very first time, and it’s exactly what you’d expect — Googlers can dial in from their laptop or phone like a normal Google Meet, a feature that was missing from Beam at launch but has been in development for a couple of years. (Google says it’s working with Zoom too.)
That should definitely make justifying the purchase of a $25,000 device easier, and it comes with positional audio so the Beam user can tell who’s speaking even more easily — though attendees may shrink down to smaller-than-lifelike proportions, or it may alternate who’s on screen, if you have more than three counterparts on a call.
The only strange bit here is why Google’s calling it an experiment instead of announcing a release date. With Beam video agents, the experimental tag makes more sense because it’s not ready and Google’s not 100 percent sure who it’s for yet — though the company thinks it might be useful in workplaces, shops, and schools.
Image: Matt Piniol / The Verge
For me, the most tantalizing possibilities are the ones Nartker isn’t showing me on our tour, as we walk past a robot arm designed to test the Beam’s headtracking capabilities, and a server rack full of Beam boards in the middle of accelerated lifecycle testing. Many of them run 10-hour loops, day after day, to ensure they can hold up to real-world use.
Nartker keeps hinting there are things we can’t yet see in other parts of this building, things he’s had staff clear away — at one point, he makes sure a particular door is closed before he offers us a glass of water. When he explains that both his digitally rendered body and my digitally rendered body are 3D meshes living in a cloud server, it hits me that they could also exist in a virtual world right now. Perhaps we could see them through a headset too?
I ask him about my theory. “There are lots of windows we’d like to build: big windows, small windows. This is just a really excellent first window,” says Nartker.
“You’ve already got an internal demo of this in VR, right?” I ask, at a different point on the tour.
“It’s a big building, Sean!” he teases. He promises he’ll invite me back for more.
Read the full article here
