The modern chatbot revolution has just scratched the surface of the potential for interactions between humans and machines. Currently, AI researchers are looking to technologies like large language models (LLMs), deep AI, and experiential learning to replicate the human thought process.
Recently, Techopedia had a conversation with Mark Sagar, Ph.D., Chief Scientific Officer and Co-Founder of AI startup Soul Machines. He’s a two-time Academy Award winner who’s been utilizing AI to develop digital individuals capable of engaging in conversations with humans while interpreting and reacting to both verbal and nonverbal cues.
Below is the transcript of the conversation that examines how AI can be used to replicate human thought processes and enable computers to convey what they’re thinking about. It also discussed how developers can look to eliminate the uncanny valley effect when creating digital avatars.
Key Takeaways
What is Soul Machines?
Techopedia: Could you tell us a little about the work that soul machines do?
Mark Sagar: Soul Machines creates digital people, intelligent digital people that you can interact with like a person. Our long-term goal is to create the most intuitive cooperative interface with artificial intelligence.
So how do we cooperate with machines?
If we look at different trends happening, we’ll see that you’ve got voice assistants, for example, which is a given modality for conveying information. Now if you put that on steroids, what you get is basically the next step, you know, like right now with ChatGPT, its images and things like that.
But when people interact, not only do we speak to each other, we look at each other, we’re emoting, we’re showing, we’re fully interacting.
And what I see the future as is, if you think about human cooperation being the most powerful force in history, human cooperation with intelligent machines will define the next era of history.
What I would like to see us do in the future is basically have absolutely free-flowing rapport with technology so that we can cooperate as efficiently as sort of jazz musicians can interoperate and trade riffs and things like that in order to create things or achieve tasks.
So the goal here is that if you build all the systems, if you emulate all the systems that our intelligence is built on, then we should be able to achieve general intelligence in the future.
Significance of GenAI and Language Models in Cognitive Processes
Techopedia: What kind of a role do you see generative AI and language models playing in that process? Do you think they will be significant in the long run, or are they more of a kind of ‘bridging’ technology to that vision?
Mark Sagar: No, I think they’re very significant. I think they’re a component of cognition. So, if you take into account, you know, human cognition, language models are a part of cognition, but they aren’t doing visual perception, they’re not doing emotional processing, they’re not doing a lot of things.
They’re really looking at, you know, word relationships, and those word relationships structure a lot of thoughts, so we’re using those as almost labels to look at different long-term associations, and you can do incredibly powerful things with that.
Humans sort of learn through experiential learning. We will, as babies, interact with the world and start to figure out the qualities of something, this is heavy, this is hot, this is cold, and there’s this constant feedback loop that we’re getting with our parents, caregivers, or whoever we’re interacting with.
During that process, we’re being taught labels. You know, this is red, this is green, and so it’s very multimodal in the way that we’re actually learning the world.
So if you have a very large language model, which is, you know, ontological, it’s trained on word symbols and their associations. It does embody human knowledge and all of those forms but goes back to a symbolic relation level.
Humanization of AI Explained
Techopedia: And would you be able to comment on what you and Soul Machines mean by the “humanization of AI?”
Mark Sagar: What it looks like and what we are interacting with. So, for example, we are increasingly adding human aspects to technology.
For example, we’re talking about voice assistants, adding voice and language to technology. Why are we doing that? Because it’s a natural way for us to interact. It’s intuitive to interact with other people and look at what they’re looking at, looking at how they’re feeling – that’s the next level of that.
You’re getting all this extra information. So if you look at movies like 2001, where they had Hal, and there was just sort of a lens looking at people, and they had no feedback from Hal other than the voice, that’s what we have currently with voice assistants, which sit in your house.
It’s a black box, which is sitting there that you’re communicating with. So it’s very unnatural in some ways, it’s like you’re talking to HAL out of 2001. You’re not talking to a person, so you’re not aware that the person might be listening or attending to things, so it’s a very asymmetrical communication going on.
Because it’s so natural for us to interact with face-to-face interactions because it’s a form of human interaction that starts from birth. The most intuitive way to interact with technology, I think, is face-to-face.
Techopedia: So do you think that emotionally responsive avatars are kind of the key to addressing this “uncanny valley” effect that there is with a lot of the designs of digital people that have been put forward on the market?
Mark Sagar: I think it helps on different levels. We are emotional beings, and so, having an emotional interface is acknowledging that, for example, and also emotion plays a really key part in decision-making.
As a pure utility thing, say that you’re building a customer service agent, and I asked you a question, but you look confused, and you’re doing an expression, you know, you’re raising one eye and sort of lowering the other one, and you’re not sure what is going on.
As a human, you would immediately say, “Oh do you need more time, or do you need an explanation?” Or something like that. Now that’s coming straight from your facial reaction, for example, you’re detecting confusion.
We’re constantly engaging each other’s signals because the face is the mirror of the brain so that face is basically conveying what you’re thinking about, what you’re attending to, and how you feel about it. All of those things are absolutely vital in decision-making.
Techopedia: So it’s almost like you’ve got to factor into the design how to signal to human users what digital avatars are thinking?
Mark Sagar: Exactly, right. You hit the nail on the head. It’s a two-way street. You’re trying to interpret what the user is thinking because, ultimately, what we do when we’re interacting with another person is forming a theory of mind. You’re thinking, “What’s that person thinking about,” “What did they want to do,” and so forth.
You want that in both directions, and you want the computer to convey what it’s thinking about because what we don’t want is a black box where we don’t know what’s going on inside because that’s kind of a dystopian future. We want to convey that and have it as transparent as possible.
There was a robot called Baxter [a production line robot], and what they did was put some eyes on the robot on a little screen…and Baxter would look where it was about to move, and people would know to get out of the way of the arm that was about to move.
Because they knew the intention of the robot, they would then stand back, as that’s what people naturally do.
What About Transparency and Ethical Implications?
Techopedia: Do you think that increasing transparency over that thought process is the key to combating some of the ethical concerns around the use of digital people in certain contexts?
Mark Sagar: Yes, I think that’s really important. The problem with deep fake technology is that it looks just like a video, so you don’t know what you’re dealing with because it looks completely realistic.
So, I don’t think it should try to fool you visually. I think it should have realistic human expression, but it shouldn’t be designed to fool you. That’s one thing.
The other thing is what it does should be meaningful. It shouldn’t be a gimmick with how it’s interacting. It should actually be conveying information based on what it is assuming so that you know where it wants to go.
The AI Landscape of Tomorrow
Techopedia: How do you see AI and digital people evolving over the next five years or so?
Mark Sagar: I think we’re seeing with the work that we’re doing that a lot of R&D is focused on multimodal human interaction and dealing with all the complexities of that. We’ve got asynchronicities, interactions, all these different things where people interact, and we want to make that as fluid as possible.
Our most advanced work is on a model called Baby X, which is a digital infant that we are building in a way that you can teach it like a baby and interact with it and emote with it.
We’re looking at the fundamentals of teaching a human and teaching in an emotional and social context, and we see that as a foundation for adult learning because everybody’s a baby, and we go through these processes, and so our development is on that level.
Techopedia: Do you think that chatbots like ChatGPT and other LLM-driven tools will converge to become an all-encompassing solution, or do you think they will still have their own pathways as separate solutions?
Mark Sagar: That’s a good question. In general, I think you will see a merge in what people interact with. So you know, behind the scenes, there’ll be lots of components talking to each other.
But it’s like if we just look at science fiction, for example, you’ve got a robot like C3PO out of Star Wars. It’s an autonomous robot that socially communicates, and it’s got an embodied human life form, even though it’s a robot.
Or if you look at Data on Star Trek, you know, you’ve got basically [a] humanoid type of robot that you’re interacting with like a person that’s autonomous but self-sufficient, that’s coming in one packet, and that feels like a natural interface for us to have because we’re used to that.
Note that the script has been edited for brevity and clarity.