What’s in an Avatar? A Fork in the Road for the Future of VR & the Metaverse
Gil Elbaz, CTO, Co Founder, Datagen

30 years since Neil Stephenson first coined the term in his sci-fi classic, Snow Crash, the “Metaverse” is now a reality…sort of. Though certain entities have staked claims and begun making forays into the technology, we’re still a long way away from the hyper-realistic, full-sensory simulation conjured up by Stephenson. But it has taken the most important step of all in technological innovation – crossing over from a fictional concept to an engineering challenge being actively worked on. No matter what your opinion of the technology may be, it’s coming. And we can be certain that virtual reality, augmented reality, and mixed reality (XR, collectively) will serve as the central pillar around which this new realm will be built and experienced – in the same way the personal computer (and, later, the smartphone) have reshaped the way we work, communicate, and innovate.

The future of realistic VR Avatar DesignAnd now that it’s squarely within the realm of reality, corporations from a wide variety of industries are beginning to make their first forays into this parallel world. From automotive manufacturers to fashion brands, businesses are treading into the murky waters of the metaverse in hopes of securing some portion of its vast untapped potential. While still in development, reports reveal that Toyota is working on virtual office environments in the metaverse to better connect their 300,000-plus global workforce. Nissan, meanwhile, is developing virtual showrooms in the metaverse for customers to get a more immersive look at their vehicles before purchasing from their respective corners of the physical world. On the other end of the spectrum, Dolce and Gabbana, rolled out a 20-piece collection of metaverse wearables, in hopes of bringing high fashion to the new frontier.

While these early adopters give us some sense of the types of use cases we can expect to see in the metaverse, for most other questions, the road ahead is murky. What else will we do there?  How will it change our day-to-day lives? How will it change the way we access and experience the Internet? And the most burning question of them all – will avatars ever get legs, or will our metaversal selves forever remain cartoon torsos floating in space?

So Many Questions, So Many More Possibilities

OK – that may not be the first question to come to most people’s minds, but it could  be the most important. Not because of any inherent value or significance to 3D-rendered lower extremities, but because the answer to that question may indirectly shed light on many of the other questions listed above. To understand why that is, we first need to explore why today’s “quasi-verse” has relegated its inhabitants to representation from legless avatars.

The short answer is that AR and VR headsets don’t have cameras or sensors that can capture the full length of the user’s body in a single frame or input. The long answer, on the other hand, is much more interesting, and draws on aspects of neurobiology, human development, and computer vision (CV) to in turn reveal the true significance of our avatars’ limbless limbo.

The future of realistic VR Avatar DesignThe Object Permanence Problem

Whether you’re comfortably seated reading this article or on your feet in a crowded train car, you probably can’t see your legs right now. And until mentioning them just now, you probably weren’t thinking about them either. They were out of sight and out of mind, and yet they continued to function without issue. You could shift, walk, tap, and even jump if you wanted to without laying eyes on them. And even if you didn’t utilize them at all, you (hopefully) never had any doubts as to their continued existence below your waist.

Frivolous as they may seem, these bold feats of object permanence (i.e., the understanding that things continue to exist even when out of view) are in fact exceedingly difficult for even the most advanced AI and computer vision systems. The problem of object permanence and inferring information from heavily occluded objects (such as legs) has plagued AI and computer vision engineers for quite some time, and absent placing additional cameras or sensors around our legs, there isn’t a readily apparent solution to legless avatars without first cracking the puzzle of object permanence – and the significance of that breakthrough goes well beyond whether or not your avatar can do the electric slide.

Much of the Metaverse, AI, and Computer Vision’s future functionality hinges on this puzzle, including the question of fully autonomous vehicles. In fact, Elon Musk spelled out the problem of object permanence in a recent interview at TED 2022, in reference to a pedestrian becoming briefly occluded by a parked vehicle before crossing the street.

The Immersion Equation

The significance of having full-body, photorealistic digital versions of ourselves also has major implications as to verisimilitude – or, how closely the Metaverse experience resembles our experience of the real-world. While hovering cartoon approximations of ourselves may be fine for hanging out in a group chat or playing video games with friends, for the Metaverse to reach its full potential, the RayMan look simply won’t cut it. In terms of both adoption and use case variety, lifelike avatars will be prerequisite for the future viability of the metaverse. Do you think employers will be eager to hold job interviews or high-stakes business meetings in the Metaverse if face-to-face with a Sonic the Hedgehog avatar?

Beyond questions of bad taste, consider some of the high-value use cases that are either significantly improved, or wholly dependent upon realistic representation of the real world, such as medical use cases. The Metaverse could hold vast potential for breaking down the many barriers to accessing low-cost, 1-on-1 medical care. Whether it’s a check-up with your dermatologist or an annual physical we will need our avatars to be both highly realistic and reflective of real-time changes and alterations to our physical bodies and selves. Even something as simple as online shopping will necessitate accurate renderings of our human forms, so that we can try on clothes in the metaverse and have them actually fit when they arrive at our front door.

The future of realistic VR Avatar DesignAs a simple rule of thumb, if our metaverse avatars are only capable of looking like video game characters five years from now, then we shouldn’t expect the Metaverse to be used for much more than gaming.

The Great Data Dilemma

So, what we can expect from the metaverse and VR in the coming years comes down to how well we can recreate the world accurately (and fully) in a virtual, AI-enabled space. And right now, the billion-dollar blocker for virtually every AI and computer vision problem is data.

Under the prevailing models of Artificial Intelligence development – neural networks, and deep-learning – savvy data scientists create complex algorithms, stack one atop the other, and feed the amalgamation (aka the model) mountains of data that’s specially-curated and annotated in such a way that the AI/computer vision model is able to interpret and learn from it.

What form that training data takes is dictated by the type of AI application being developed. Driverless cars, for example, rely heavily on a branch of machine learning called computer vision (which is exactly what it sounds like), so their training data comes in the form of images or videos (i.e., visual data). Unfortunately, you can’t simply show a neural net rerun of Knight Rider and hope to get a fully autonomous car. Nor can you show it reruns of Soul Train and expect it to learn how to do the moonwalk.

There’s a laundry list of requirements and processes required to ensure the image data is effective at training your model. First, it must be relevant to the application in question. Second, it must be representative of the environment in which your model will be deployed. And then it must be rigorously cleaned, annotated, and curated in such a way as to be interpretable by your model. As you might imagine, this is an incredibly tedious, laborious process fraught with endless opportunities for human inconsistency and error inconsistency. In the same interview mentioned earlier, Musk tells the interviewer that under traditional processes, it would take somewhere near 10 hours for a team of humans to label just 10 seconds of video data. Try scaling that up.

Synthetic Data & the Data-as-Code Revolution

Under the prevailing approach to computer vision, researchers create models using code, then take real-world images and painstakingly “translate” them into something programmable by code. It’s a process akin to translating a text written Mandarin into Greek so that a Greek-speaking teacher can annotate it before translating it back into Mandarin for a classroom of Chinese students.

If that sounds ridiculous or confusing, it’s because it is. If you’ve seen or played a video game in the last 15 years, you must know that modern GPUs are capable of generating dazzlingly-lifelike 3d imagery – imagery that isn’t made of light and matter, but code…the “native language” of computer vision models. If that isn’t an attractive shortcut, I don’t know what is.

Of course, not all computer code is a single language, but if we generate synthetic visual data (i.e. 3d images) with our machine learning models in mind, we erase the long, laborious, misguided task of hand-labeling training data. With synthetic data our training data is just an artifact of running code, and thus, we gain full control over the content of our data and managing it becomes just as easy as managing code. Not only is this easier and less-time-consuming, but the data generated is more effective at training models and less prone to error. By generating visual data with associated ground-truth, we collapse the entire training data lifecycle into a seamless exchange between training data and model.

Looking Ahead

What the Metaverse looks like 5 years from now will be dictated largely by whether or not Computer Vision teams continue the sisyphean task of hand-labeling masses of chaotic images, or adopt a streamlined, data-as-code approach.

The good news is, the framework is already in place. The convergence of synthetic training data and a more streamlined, data-as-code approach to training is exactly what we’ve developed at Datagen, and it’s already paying dividends for a host of computer vision applications – including AR/VR and the metaverse. Though we can’t name names, based on the growing rate of adoption of data-as-code solutions, we’re comfortable making a prediction of our own: The future of the metaverse looks promising – bi-pedal and bright.