What do computer graphics, assisted intelligence, Watson, and actual humans have in common?
Recently I spent some time reading, researching, and watching the state of the art in several areas of interest, notably CGI/computer animation, VR, text to voice, and facial pattern recognition. The takeaway is that all of the pieces are in place for an idea I’ve had for several years. Be advised, many of the videos included below are around a year old, so following up directly with their makers may reveal even better implementations.
What follows is a short story about each piece, then a video showing how this piece fits today. The concepts, taken as a whole, will form the foundation of an idea that could potentially change how we get things done in real life. It’s worth the read, and it’s a doozy. Grab a cup of your favorite whatever.
As always, if you like this content, please share it so others can enjoy it too. You never know who will read these things… they might be the one to take this and actually change the world.
Input: the human voice.
Human-computer interaction (HCI) has taken many forms over the years. We know the keyboard and mouse, the stylus, touch, motion, and now of course we have person to person, or social chat. One of the fastest growing segments in tech is voice recognition, and owning the home of your users. This is voice to text to [internet/cloud/action] to text to voice.
Apple has Siri, Amazon has Alexa, Microsoft has Cortana, and Google has, well, “Hey Google.” All of these services empower their owners to ask questions about seemingly anything, to change the TV channel, to order more diapers, or close the garage. They’re all early from a technology standpoint, and have varying degrees of friendliness and usability. But the core is here, and they’ll only get better with time.
All you have to do is talk, in your native language, and magic (or algorithms) happens.
Input: plain old text.
Another interface is text to text. We know these as chatbots, and you interact with them through several input channels: Twitter, SMS, Facebook Messenger, and dozens of others. Companies like Conversable are doing a great job in this space: “I’d like a large cheese pizza delivered to my house. Use my saved payment method.”
While one is initiated by voice, and the other by text, they’re just hairs apart from a technology perspective. Speech to text is nothing new, it’s been a work in progress since the earliest times in computing. Add assisted intelligence to the text output, and now we’re cooking.
Need something? Just type it, and the tech will respond.
Input: human emotion.
While voice and text are great inputs, video is an even better input. Recognizing a person has become so trivial that it can be done with a simple API call. Technology can be sure it’s “you” before you interact with it. Microsoft uses this to automatically log you into XBOX with a Kinect device.
More than detecting who you are, computers can also detect emotion in video. This used to require a room at a university, and was only done as a part of a research project. Today, we can accomplish this with just about any standard web cam, or a front facing camera on a smartphone.
We know it’s you, and how you’re feeling at this precise moment. “How can I help, Michael?”
Output: human voice.
Even the best synthesized voices still sound, well, electronic. Enter a new technology from Adobe called “vocal editing.” This tech was demoed at Adobe MAX 2016, and uses a recorded voice to allow the “photoshopping” of that voice.
It’s early, but this tech exists, and could be the voice interaction component of this idea. This demo uses a recording, just a few seconds long. Imagine what would be possible with dozens of hours of training recordings. The only input required after that is text. Text is the primary output of all of today’s Assisted Intelligence (AI) applications, like Watson for example (IBM Watson, How to build a chatbot in 6 minutes). This the next logical step:
This technology could easily be used in real time to allow “bots” to make calls to humans, and the humans would be none the wiser. They could even use your voice, if you allow it to. Bots can take voice as input (voice to text), and output text which gets sent back in the form of audio (text to voice), using this technology.
Any text input, from any source, with one remarkably consistent voice response. Maybe even your own.
Display: character-persistent, animated avatar.
When I had the original idea, the avatar the user interacted with was a simple CGI character, an obviously rendered character that would remove any interaction distraction. I wanted every touch point with the the avatar to be as simple as possible, so you’d spend all of your time focused on the task, and not distracted by its interface. This may still be the best option, but I see that gap closing quickly.
Here’s Faceshift at GDC 2015 (since acquired by Apple), but others (like Faceware, nagapi) exist in the market. Notice two completely different actors playing the same character.
Disney Research Labs has similar technology already in use.
The movie viewer never sees the actor, only the character. With the voice tech above, and a character representation, we’ve removed two persistence problems. Voice and character. Any one (or thing, including AI) can provide consistent human voice, and anyone (sex, build, race, location, whatever) can power the physical representation.
Every single time you interact with the tech, the avatar looks and sounds the same – no matter who the actor is on the other side of the animation.
Display: the human presence.
We’ve seen remarkable leaps forward in what is an age old (in computer years) tech called motion capture. Meat space actors wear a variety of sensors and gadgets to allow computers to record the motion for later use. This used to appear in only the best of the best games and movies. Just about everything you see in major releases today (from a CGI standpoint) is based on motion capture, if it involves humans.
Traditional motion capture was just a part of the process though. Scenes would be shot on a green screen, then through (sometimes) years of perfection, a final movie would grace theater screens or appear in an epic game release.
At Siggraph 2016, Epic Games (makers of the Unreal Engine) featured a new process that amounts to “just in time acting.” Instead of capturing the actors, then using that motion later in scenes, Epic used a process that rendered results in real-time. It’s mind blowing – using the camera in the game engine to record a motion captured actor-in game.
Display: enhanced human presence.
The problem with CGI and humans is something called the uncanny valley: “a computer-generated figure or humanoid robot bearing a near-identical resemblance to a human being arouses a sense of unease or revulsion in the person viewing it.”
Explained in Texan: “Well, it might be close, but it ain’t quite right.” It may be getting close enough.
There are several ways humans protect themselves from attack. One of the simplest is recognizing fellow humans. Sometimes they may want to harm us, other times hug us. But either way, we’re really, really good at recognizing deception.
Until now. This piece was created with Lightwave, Sculptris and Krita, and composited with Davinci Resolve Lite – in 2014 (two years ago).
In 2015, a video was released by USC ICT Graphics Laboratory showing how advanced skin rendering techniques can be deceptively good. Another video by Disney’s Research Hub shows a remarkable technology for rendering eyes. And earlier this year, Nvidia released a new demo of Ira.
Display: enhanced facial reenactment method.
An advancement I didn’t expect to see so soon takes standard digital video footage (a newscast or a YouTube video, for example) and allows an actor, using a standard webcam, to transfer expressions to the target actor. It’s a clever technology.
If the technology works as well as it appears to with simple, low resolution sources, imagine what could be done with professional actors creating 48 hours of source video. That could in turn be targeted by digital actors using a combination of the above video technologies. The interface to this technology would be a recorded human actor with transferred facial expressions from a digital actor, all rendered in real time.
Bringing it all together.
Inputs: voice, text, video, and emotion.
Processing: assisted intelligence, APIs: input to voice.
Outputs: text, human voice, and/or photo realistic CGI/animated characters/human.
But wait. There’s one more thing.
This is great for an AI powered personal assistant. Marrying all of this tech together into one simple and cohesive interface would make everything else feel amateur.
But what if we could add an actual person (or 4) to the mix. Real human beings available 24/7 (in shifts) to note your account, or to call your next appointment to let them know you’ve arrived early? What if your assistant could call the local transit agency, or cancel a flight using voice in whatever the local language happens to be?
All of the technologies mentioned above create and intentional gap between the inputs and outputs, allowing any number of “actors” in between. If a task is suitable for a bot to handle, then a bot should handle it, and reply. If a human is required, the user should never know a human stepped in to take control of the interaction. The voice, text, and display will be 100% the same to protect the experience.
Think about it: any language in (from either side), and your specific language and video representation out. If there were a maximum of four people that knew you more intimately than your family, but you knew you’d never, ever have to think about this problem again, would you do it?
In summary, I’ve outlined a highly personalized virtual assistant, with 100% uptime and omnipresence across every device and interface you have (including VR, but let’s save that for another time).
What you won’t know is whether you’re talking to a human, or machine.
If you liked this, please share it. Your friends will dig it too. Thank you!