Shaping Extended Reality Through AI

Shaping Extended Reality Through AI

February 21, 2023 Artificial intelligence Partners

Code Comments • • Shaping Extended Reality Through AI | Code Comments

Shaping Extended Reality Through AI | Code Comments

About the episode

The idea behind extended reality, or XR, is immersion. That can be a hard standard to meet when dealing with a visual interface. As an intern at NVIDIA, Hayden Wolff stepped up to tackle a thorny challenge, and with some assistance from natural language processing (NLP), the company’s Project Mellon is changing the way we look at the design process.

Watch this video to see how Red Hat and NVIDIA are using artificial intelligence to unlock every organization’s potential.

Subscribe here:

Transcript

Even the most knowledgeable person can't create new solutions alone, especially when it comes to blending different fields of technology and different techniques to come up with new innovative solutions. Sharing knowledge and experience is part of our culture here at Red Hat, we default to open and having an open culture, well, that creates an environment of psychological safety. And this is important because there are no dumb questions. An individual or a team can better go out and introduce themselves to others where they can basically say, "Hey, I need a better understanding of this." And that leads to better solutioning and a better way to solve real world problems. I'm Burr Sutter and this is Code Comments, an original podcast from Red Hat. In today's episode, I talked to Hayden Wolff, an engineer from NVIDIA, and without any prior experience in visualization, Hayden started as an intern and ended up working with the XR team. XR, that means extended reality, and as just an intern. Hayden took on a major challenge that came to be known as Project Mellon, and the goal was to incorporate NLP, Natural Language Processing, into the XR project for the specific purpose of automotive design reviews. Throughout our conversation, I learned how Hayden worked across different teams at NVIDIA and tapped into the knowledge and experience of his colleagues to bring Project Mellon to life. Well, Hayden it's a pleasure to be talking to you today. Welcome to the podcast.

Thank you. Yeah, I'm excited.

I'm very interested in the project you're working on. I know it's called Project Mellon and it's specifically related to XR. Could you tell us about the genesis of that, how it came together and how you got involved with it?

Yeah, basically it started when I was an intern at NVIDIA and I was lucky enough to extend my internship into my senior year of university and I said to my team, "Hey, what am I going to work on for this entire year?" They were like, "Yeah, we didn't really think about that yet." Basically, I think I emailed the entire team and I was like, "If anyone has any projects that they have been dying to see fulfilled, anything cool, interesting, and they just need programming hands on it, pitch it to me, maybe we can make it happen." And a guy on my team, I was on the professional visualization tech marketing team, was like, "I have this vision of bringing voice commands to XR." And I was like, "Okay, well why is this thing useful? Why do you want voice commands in XR?" The reason why I asked that question is, I think it's important oftentimes to be skeptical about technology. How often do you use Siri or whatever other voice assistant you have on your phone?

How often do I use Siri or other voice command systems? I can tell you that I never use them, but I can tell you my wife uses them all the time. She uses her voice recognition for sending text messages, for searching things through, "Hey Google," from that perspective. There's of course, the Alexa that we have in the house. She has migrated fully to voice commands and I'm still in thumbs mode. I'm not sure what's wrong with me there. What do you think?

I don't think anything's wrong with you. I think it's really point of convenience, right? I only use voice commands on my phone for a few specific tasks. I use it sometimes when I'm driving, for example, if I need to send an urgent message to someone, if I need to call someone, I say, "Hey, call up my friend so and so." And the place that I most often use it is when I need to set reminders because it's a long enough task that it's faster for me to say, "Hey, set a reminder tomorrow at 12:00 in the afternoon to move my car so that I don't get a parking ticket." It's faster for me to tell my phone to set that reminder than open up the reminders app, go in, add a reminder for moving my car, set the time to be 12:00 in the afternoon, make sure that it's on PM and not AM. It's faster for me to just say it to my phone. The reason why I asked my teammate, Sean, "Why do you actually want to build this?" Is because I think a lot of times we can end up building technology that's sort of a gimmick and it's not actually useful to anyone. And I did not want to build something that isn't useful to someone. It's fun to build stuff, but I wanted to spend a year of my time building something that's actually going to be used and that people look at and are like, "Hey, I need this thing." He explained to me as an expert in XR that one of the fields that uses VR a lot is the automotive industry, and they use it for what are called design reviews. And he was saying that a lot of times there are two issues in design reviews in VR. First it's that you are looking at a really high fidelity model. And if you basically have this beautiful car, it's maybe full scale, you're walking around it, you're trying to look at every single possible thing that doesn't look right with it, you're trying to make sure that you have the lines look right, you really want to make sure that you can get the full model. Then suddenly you decide, "Okay, I want to change the paint color." Well, you have to click and open up a menu. That menu then covers up basically a portion or all of the vehicle so you can no longer see the main thing that you're looking at, you're trying to do to this design review. You want to see a different color because you want to see how that paint looks in a certain kind of lighting. Then suddenly you have now covered up that entire car with a menu and your view is basically covered up. And that causes two problems. One, you can't see what you're looking at very well. And two, it really gets rid of this feeling of being in this virtual world where you're really immersed in the reality of it.

I can see a situation too. I think it'd be interesting to paint a picture for our audience. I can imagine an executive at a large automotive manufacturer who's come down probably in their polo shirt and nice slacks and they've been meeting with customers, but now they want to do the design review, and you've probably asked them to put on a headset, the VR goggles, so they can actually see this model in real time, a full three-dimensional model. And I assume they can walk around it and now you want them to interact with it. Is that what you're describing?

Yes. You basically just stated the second problem, which is that also people don't know really how to interact with graphical interfaces in virtual reality. I don't know if you've ever taken a VR headset and put it on someone else to show it off to them and be like, "Hey, this is super cool." Then they put it on and they're holding the controllers and they're just kind of standing there frozen because they don't actually know what to do, because it is a foreign thing to interact with. We're used to sitting in front of a 2D screen on our computer. We're used to using our fingers to scroll through our phone. It's really unnatural. Speech is a natural thing for us to do. We don't have to suddenly have this learning curve of using this strange foreign GUI. The second goal of the project was to create a natural interface for people who maybe don't spend that much time in VR so that they can just jump in. And if they want the car to have a different size wheel, they can just say, "Hey, swap out the wheels to a larger size," rather than, "Oh, you have to hit this tiny button on the controller." Okay. Yeah." "You see that GUI that just came up? Is there a hamburger menu?" "What's a hamburger menu?" "It's the menu with three lines." Yeah, I think I see something with three lines." Click on menu with three lines," then it opens up this other entire panel and you're like, "Wait, how do I change the wheels?" "I think if I remember correctly, it's this, you should see a little image of a tire, but it might actually be in this other menu." That's so hard to steer someone else through.

I know exactly what you mean because I've had that personal experience where I've actually filmed for a virtual reality segment before, and of course they gave you the two controllers and you have the headset on, and invariably someone said, "No, click the left hand index finger. No, that's the right hand thumb" to get to this menu or to scroll over to this other thing, and you end up not being able to navigate in that unusual virtual world. I do see how natural language processing and the concept of voice could be incredibly valuable. I'm particularly curious about how you think of how the voice command system should be kind of, let's say, articulated and architected so that it does become more natural for the user. They can more easily walk into this environment and use it without having a 30 minute training session where they simply want to come in there and do their specific task. In this case, an automotive design review.

Yeah. Actually one of the really tough technical problems in this project and more generally in speech based systems is knowing what you can and can't ask, right? Because a lot of times you'll ask a speech based system like an Alexa or your phone to do something, and then it's like, "Sorry, I cannot execute that request." There are actually two tough problems in this project that kind of relate to that, which is one, discoverability, what can I ask the system to do? Honestly, we haven't solved that very well because a lot of times to help with discoverability, you might throw up a GUI, but then that defeats our attempt to get rid of GUIs in VR. You have the discoverability problem, and then you also have the problem of, I guess it ties into discoverability. Okay, let's say it's one problem, which I'm just going to call discoverability. What can I ask here? Basically a portion of our system is answering questions. You can ask it, "Hey, what can I do here? What options are available?" Et cetera, et cetera. And it'll come back and answer a lot of your questions. And we've designed the entire dialogue system around a command based system. We've cut out for the most part and reduced the scope of the problem so that you're not spending engineering time trying to build a generalized dialogue solution where you can chit chat with it and you can ask it any random question about anything you want in the world. Instead, we've focused it on a more narrow scope. We've basically said this is a system that is a command and control based dialogue system to control a visual application. And by reducing the scope, we've been able to basically apply language techniques that both improve the system for the end user and then also improve the system for developers. Because if you try to do it all, at least in dialogue systems, it just still isn't there. The technology isn't there yet.

I can tell you, I really appreciate what you're talking about right now. And remember earlier when I said, "Well, I primarily still use my thumbs. I don't use all the voice activation capabilities on the phone or in the car or even at the home?" And I can tell you exactly what my problem is. As a person who's been programming computers for 30 plus years, I like having the right commands to issue to the Alexa, Siri, Hey, Google, whatever it might be. And I know those commands are unique and not well known to me. That actually freezes me as a user. I won't go forward with the command because I don't quite know what the command ought to be. And my brain is so pre-wired to knowing exactly what the syntax should be. Because I'm so used to a Bash shell script, right? Putting together an answerable playbook, putting together a job application or JavaScript application. That's just how my brain is wired.

Yeah, I was going to say that's one problem of this project that we still haven't really solved because we're both trying to reduce the number of GUIs so that it's a more fluid experience, but also we're trying to help the user figure out what they can ask. And I would argue that just having a response that is purely voice-based every time you ask a question probably isn't ideal. It's pretty slow. It cannot get all of the detail across. That's actually a whole area of the project that is still being worked on and that has a lot of work to be done is sort of finding this in between with discoverability and also we don't want to ruin the visual experience. It's trade offs, right?

Absolutely. And I love how you laid out the challenges of this particular project, and now I'm particularly curious about how you made this a reality. How did NVIDIA's culture impact making this work for you? What was your process? Tell us about how that really came to be.

Yeah, so when the project started, I immediately was like, "Okay, I need to train in a model of some sort so that we can do this thing that's apparently called AI that is big or whatever right now and that we're supposed to be really good at." And honestly, at first it was really hard. My first step is collecting data. I need to sit there and fill in spreadsheets or I need to use some manual things such as MTurk to gather data from people. And I remember sitting there filling in these spreadsheets full of different ways to request changing the color of the car with these different colors, different ways to swap out the rims of the vehicle. And I think while doing that, I was asking people a lot of questions and because I had hit a blocker somewhere, I'd be like, "This thing in the Jupyter Notebook isn't running correctly" or, "I'm getting this weird score." And I think at some point someone asked me, "Hey, what are you working on? Why do you keep asking all of these questions?" I explained what I was trying to do. The guy that asked me that question happened to be a researcher in NLP, Natural Language Processing. And he took a big interest in this project and basically decided to almost ... I mean, now he has basically built up an entire NLP solution unique to the needs of this project. And that's when we were really able to make progress because he was working on the cutting edge of NLP. Then I was able to take my programming skills and take what he was producing in terms of NLP output of what the system should do. Then I was passing that to the visual environment. The very first visual environment that we worked in was Autodesk Vred, which is a high fidelity 3D design application from Autodesk for basically doing automotive reviews for both designers and engineers. The very first POC that we built was with some of his NLP solutions, which I can talk about a little bit more in depth later if you'd like, some of his NLP solutions and then literally passing commands to Autodesk Vred via their web API, which was a dirty solution, but it was the fastest way to get to a working POC. I remember taking information from the 3D application that was spat out to a web console. It was displayed in HTML, I was parsing the HTML. I was literally going through the HTML tags to get back information, which was a hundred percent not the right way to do it. You should not be parsing HTML to try to get information about the visual environment, but it allowed us to show capabilities of the application that we wanted to be core to its functionality. For example, you should be able to ask, "What am I looking at right now?" Without the dialogue system knowing what you have previously asked, right? Because that should be knowledge that the environment has, that the 3D environment has. It might be embedded in the scene, but maybe the user doesn't know. I was literally going through and parsing HTML because that pipeline already existed. It was about getting a quick and dirty version. As soon as we recorded a demo video, we sent it to Autodesk and we were like, "What do you think about this?" They got back to us and their response was, "Oh my God, this would be so useful. Let's work more on this." Even though the implementation was embarrassing, at least on the execution side, the side that I worked on, it was embarrassing. It was gross. It was maybe probably not the right way to do it. It didn't stop us from collecting early feedback knowing are we on the right path? Is this something that customers really do need? The answer was yes.

I look at it as the lean startup approach with that validated learning. Do the, like you said, MVP, right? Minimal viable product, or in this case, minimal viable prototype. And you were able to hack something together, put it in front of real users, put it in front of customers, and get real feedback on it. That is absolutely fantastic.

Yeah, because honestly, if the response was, "Nah, we don't need this," we would've stopped working on it.

Now I got to tell you, as a geek here, I got to drill down on some of this because I'm very interested in what you said about screen scraping the HTML. Let's kind of walk the user through or the audience here through what you might have created. It sounds like you were working on a Jupyter Notebook, and I'm assuming you're interacting with a model created by your AI researcher there. And I'm guessing that as you provide voice commands into it, it actually will convert some of those things into text that you can then pass through a web API to Autodesk. Can you just walk us through how those link points came together and then how the HTP commands went back and forth? I'm curious about that.

Yeah, so let's think of it from an end-to-end workflow perspective because I think that's the easiest way to sort of visualize it. Let's say for example, I start with a command and I say, "Paint the car in blue." Me as an end user, I have told the system to execute this request. I'm going to actually walk us through all the components. We start with a voice-based command, and luckily we already had some pieces that are part of NVIDIA that we could leverage. We first used NVIDIA Riva, which is basically a highly customizable SDK to do ASR, automatic speech recognition, TTS, text to speech, and then you can also do some NLP. We used Riva to basically take in an audio stream and then convert that audio stream to text. That's known as automatic speech recognition. We have now moved from having our command paint the car in blue, I think is what I said, and we now have it as text. Okay, so we now have that as text input rather than audio input. Our next step is we need to take that text input and we need to figure out what is this text input actually saying? This is where the NLP side of things comes in, and this is where a lot of the real innovation in this project was. We use what are called zero shot models, which is a really cool way of basically making it so that you don't have to train a model on any specific data to your use case. Let me explain how this works. To recognize an intent, which we can think of as an action that should occur. In this case, in our sentence, paint the car in blue, the action would be painting the car. We need to distill that the action that we want to occur is painting the car. The way that we do this is that we use a model that has been generically trained. This is a very large language model, generically trained on a bunch of examples of, "Does X imply Y?" Let me distill what this means more because I know I'm walking through a lot here. Imagine that this is a super large data set and it has been trained on an example such as at the other end of Pennsylvania Avenue, people began to line up for a White House tour. That is our X, in "does X imply Y?" Our Y is people formed a line at the end of Pennsylvania Avenue. It is reasonable that basically these two sentences are connected to one another. Basically the first part of what I said implies Y. If I had another example, which was flip the hamburger and was our X, and then our Y was switch the lights on, those two are not related at all, right? The correlation between those would be very low, like X does in no way imply ... Flipping a hamburger does not at all imply turning the lights on. Basically we have this giant model which is trained on these generic, "Does X imply Y?" And we use that to actually extract an intent. Then once we feed the model, for example, paint the car in blue, if the options that the system has to choose from are okay, the actions you can do are you could paint the car, you can change the rims, you can open the hatch, you can close the hatch, you can turn the interior lights on, you can change the trim of the leather, the closest match across all those possibilities to painting the car in blue is going to be the action, paint the car in blue. Did I include that in my examples of possibilities it can choose from? I hope I did. Anyways, it should also be able to choose from painting the car. Our closest match is painting the car. And because this large model has been generally trained on a bunch of examples, we don't have to write a bunch of training data that is different ways to say painting the car. We don't have to say, "Please paint the car for me. I want you to paint the car for me." Then so many variations of that. Instead then if I said, "I want to see a paint that's blue," the closest match across all those examples that I gave is still paint the car, the action, paint the car, right? We no longer have to spend all this time collecting all of this data. Instead, we can just basically describe what are our desired actions that we want to occur, and then the system chooses the closest match. I know I went very deep in there, but I know that there are nerds that listen to this, so.

Well, and it's fascinating. I can definitely see the decision tree that's been made and the pattern matching logic that's trying to occur to basically quickly get to the action the computer should take based on what it heard via voice and what was converted to text. That is a very powerful environment, but it seems like you guys figured it out very rapidly.

I think it was because we were working, I was working with a variety of NLP researchers, and even though we could out of the box do what's called an intent and slot recognition model where you have to fill in all of that training data, they were up to date on what the newest research was, and they were like, "This is the perfect project to try out this new research of using these zero shop models, like let's do it." And it actually proved to be a really good solution for our use case because then someone else can build the system for their own application without needing to collect all of this data.

That's a very important point. Normally in an AI world, we talk about the dual step process, at least that's how I think of it. There's all this collection of data, cleaning of data, organizing of data, and maybe that's a data engineer that's aggregating the data from numerous backend systems. Then there's a data scientist who's basically crafting the model based on that dataset, iterating, iterating, iterating, and of course that model, that modeling effort then has to get kicked over to the inference side where an application programmer might wrap it with a Python application, Java application or a Node.js or something like that application, and then that's what would go on production. In this case, it sounds like the modeling effort was much easier based on the process you went through.

Yeah. I mean, luckily we had access to some of these very large models that are generalized in the way that I describe, and because we were able to leverage and use those, it made it so that we really ... I mean, you don't have to give it training data because the "training data" is generalized to this concept of, "Does X imply Y?" And that is something that we can then use to basically do intent matching. It was an awesome solution because we didn't have to spend time then collecting training data.

And I love the fact that you approached this without, again, someone like myself having calcified, if you have certain things I always do a certain way, and that of course keeps me from using some of these new things like let's say voice commands as an example. But in your case, you just basically went right into it as if there were no barriers and you were able to basically tackle these huge problems, a massive problem with ease is what it sounds like to me, right? It was easy.

Not with ease, not with ease.

It was certainly nice that the NVIDIA world that you live in, the NVIDIA culture was able to also, you mentioned multiple parties got involved with this, multiple people who had conversational AI expertise, people with great expertise in Natural Language Processing, the text to speech element, those folks really rallied around you and the project that you had.

Yeah, I think the help was what made this project easy, if anything, because otherwise I probably would've just been stuck in Jupyter Notebooks and also I would've been spending my time recollecting data when realistically there was a better way to do it, right? There was a way to also get this out to customers who have different data and different commands that they want to run, and within a day they could ... I mean, within an hour, honestly, they could fill out a little spreadsheet of what commands they wanted to execute, and then suddenly if they had a way to hook it up to their visual application and then send that output from the NLP system over, then they were all good to go. They could suddenly execute voice commands in their visual environment.

I do love this project, it's super cool, and I'm very curious about how you might use Project Mellon going forward. What do you think some of the future use cases are? How can you take what you learned with the automotive design review and how have you expanded or where else do you think this project might go?

One big area that we've been working on now is bringing voice to more domains, right? Because we talked about the automotive domain. At NVIDIA, we've had a lot of interest from also the AEC team, which is architecture, engineering, and construction, for their design reviews in NVIDIA Omniverse, which is basically a real time collaborative 3D platform for building digital twins. And basically if you could then in a design review using Omniverse View, which is basically this full fidelity design review application in Omniverse, if you could then say, "Hey, pull out the measure tool, set the sun to be at 5:00 PM and you can view what this house looks like with the sun at certain angles at certain times, set the season to be spring." Again, no complex GUIs. I think this area of designer view is probably an area that we are going to continue to focus in on for two reasons. One, because you really want to be able to have that seamless experience viewing something with all these different properties without having to deal with all these GUIs. And two, there's also the component of our system being inherently good for command based, command and control based use cases. We don't want to necessarily try to tackle a problem of making a chit chat application, like an avatar bot, right? Because you don't really want to spend that much time telling this little bot, "Oh, turn your head a little bit." It's just not that useful, right? We want to continue to focus on applications that really need that command and control structure that our dialogue system provides.

Well, and I do love those enterprise use cases that you're describing, the concept of walking through a virtual building or seeing a virtual house. There's so much applicability in our world where you might want to say, "Hey, customer, come look at the house we're building for you" or, "Hey group of customers, look at the building we're designing for you" or, "Look at the roadways that we have to lay in this particular city," if we're talking to a city planner. "This is what we're thinking of for the overpass," as an example. The applications are all over the place, and I think they're incredibly powerful, so I really do appreciate that, Hayden, I love what you shared with us today. Again, I could probably spend the next three hours talking to you about some of these things because it is so fascinating what you described there.

Yeah, I feel like we just touched the surface.

That is always the case in these conversations. Where would someone who's listening to our show right now kind of dive deeper into this? I'm interested in learning more about that command and control, that concept of voice activation, that concept of integrating with different technologies.

I think the good place to start would be watching our GTC session on Project Mellon. If you go to NVIDIA GTC, this was a session from this past fall, and so you can now watch that session on-demand. You can see basically how the application works. You can see some cool live demos, you can hear more, and you could get in touch with us if this is something that you want to integrate as a developer or as a designer. Also, there are ways that you can get your hands dirty with Riva using NVIDIA Launchpad if that's something that you're interested in. They're also available as containers on NGC, so those could be some good places to get started. Also, like playing around with Omniverse and building stuff with the SDK, which we had to do for this project to integrate it into Omniverse. There are a lot of moving pieces here. Cloud XR is another SDK from NVIDIA that we used, and so these things are all available for people to try out.

Well, Hayden, again, thank you so much for your time today. It was a great story. I'm absolutely fascinated by it. I look forward to having my own opportunity to dive in and try some of those SDKs myself at some point.

Yeah. Awesome.

You can read more about Red Hat's partnership with NVIDIA at RedHat.com/codecommentspodcast, or visit RedHat.com/NVIDIA to find out more about our alliance and AI solutions. Many thanks to Hayden Wolff for being our guest, and thank you all for joining us today. Our audio team includes Leigh Day, Stephanie Wonderlick, Mike Esser, Johan Philippine, Kim Huang, Nick Burns, Aaron Williamson, Karen King, Jared Oates, Rachel Ertel, Devin Pope, Matias Faundez, Mike Compton, Ocean Matthews, Alex Traboulsi, and Victoria Lawton. I'm Burr Sutter, and this has been Code Comments, an original podcast from Red Hat.

About the show

Code Comments

On Code Comments, we speak with experienced professionals on the challenges along the way from whiteboard to deployment.

Shaping Extended Reality Through AI

Shaping Extended Reality Through AI | Code Comments

About the episode

Subscribe

Transcript

More about artificial intelligence

About the show

Code Comments

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links

Shaping Extended Reality Through AI

Shaping Extended Reality Through AI | Code Comments

About the episode

Subscribe

Transcript

More about artificial intelligence

Red Hat OpenShift delivers high-performance LLM inference for financial services

Architecting true autonomy with a level 4/5 network

Technically Speaking | Defining sovereign AI with open source

Technically Speaking | Inside open source AI strategy

Technically Speaking | Build a production-ready AI toolbox

About the show

Code Comments

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links