- The GPU
- Posts
- Issue #26: Jensen’s GTC Keynote Part 1 - Raw and Unfiltered
Issue #26: Jensen’s GTC Keynote Part 1 - Raw and Unfiltered
Straight from the source from inside GTC

I’m at GTC this week for Panchaea
Drowning in jetlag, running on caffeine, and jumping between back-to-back meetings. But you know I wouldn’t leave you hanging.
Straight from your man on the GTC inside, I’ve got the full, unedited transcript of Jensen’s keynote. No edits, no fluff, just the raw words.
Blackwell, Rubin, AI factories, trillion-parameter models. Jensen’s laying the groundwork for the next phase of AI infrastructure.
And I’ve got it all for you, straight from the source.
Here’s part 1.
Enjoy.
Introduction
This is how intelligence is made. A new kind of factory. Generator of tokens. The building blocks of AI. Tokens have opened a new frontier. The first step into an extraordinary world. Where endless possibilities are born.
Tokens transform images into scientific data, charting alien atmospheres and guiding the explorers of tomorrow. They turn raw data into foresight. So next time, we'll be ready. Tokens decode the laws of physics. To get us there faster. And take us further.
Tokens see disease before it takes hold. They help summarise the language of life. And learn what makes us tick.
Tokens connect the dots. So we can protect our most noble creatures. Return potential into plenty. And help us harvest our bounty. Tokens don't just teach robots how to move, but to bring joy.and put life within reach. Together, we take the next great leap to bravely go where no one has gone before.
And here, is where it all begins.
Welcome to the stage, Nvidia founder and CEO, Jensen Huang.
What an amazing year. We wanted to do this at NVIDIA. So through the magic of artificial intelligence, we're going to bring you to NVIDIA's headquarters. I think I'll bring you to NVIDIA's headquarters. What do you think? This is where we work. This is where we work. What an amazing year it was, and we have a lot of incredible things to talk about, and I just want you to know that I'm up here without a nap. There are no scripts, there's no teleprompter, and I've got a lot of things to cover, so let's get started.
First of all, I want to thank all of the sponsors, all the amazing people who are a part of this conference. Just about every single industry is represented. Healthcare is here. Transportation. Retail. Gosh, the computer industry. Everybody in the computer industry is here. And so it's really, really terrific to see all of you, and thank you for sponsoring.
GeForce
GTC started with GeForce. It all started with GeForce. And today, I have here a GeForce 5090. And 5090, unbelievably, 25 years later, 25 years after we started working on GeForce, GeForce is sold out all over the world. This is the 5090, the Blackwell generation, and comparing it to the 4090, look how it's 30% smaller in volume. It's 30% better at dissipating energy, and incredible performance. Hard to even compare, and the reason for that is because of artificial intelligence.
GeForce brought CUDA to the world. CUDA enabled AI, and AI has now come back to revolutionize computer graphics. What you're looking at is real-time computer graphics, 100% path trace, for every pixel that's rendered, Artificial intelligence predicts the other 15. Think about this for a second. For every pixel that we mathematically render, artificial intelligence inferred the other 15. And it has to do so with so much precision that the image goes right, and it's temporally accurate. Meaning that from frame to frame to frame, going forward or backwards because it's computer graphics, it has to stay temporally stable. Incredible.
Artificial intelligence has made extraordinary progress. It has only been 10 years. Now, we've been talking about AI for a little longer than that. But AI really came into the world's consciousness about a decade ago. It started with perception AI, computer vision, speech recognition. Then, generative AI. For the last five years, we've largely focused on generative AI. Teaching an AI how to translate from one modality to another. Another modality, text to image, image to text, text to video, amino acids to proteins, properties to chemicals, all kinds of different ways that we can use AI to generate content.
GenAI
Generative AI fundamentally changed how computing is done. From a retrieval computing model, we now have a generative computing model. Whereas almost everything that we did in the past was about creating content in advance, storing multiple versions of it, and fetching whatever version we think is appropriate at the moment of use. Now? AI understands the context, understands what we're asking, understands the meaning of our request, and generates what it knows. If it needs, it'll retrieve information, augments its understanding, and generate answer for us. Rather than retrieving data, it now generates answers. Fundamentally changed how computing is done. Every single layer of computing has been transformed.
The last several years, the last couple, two, three years, major breakthrough happened. Fundamental advance of artificial intelligence. We call it agentic AI. Agentic AI basically means that you have an AI that has agency. They can perceive, understand the context and the circumstance. It can reason. Very importantly, it can reason about how to answer or how to solve a problem. And it can plan an action. It can plan to take action. It can use tools, because it now understands multimodality information. It can go to a website and look at the format of the website, words and videos, maybe even play a video, learns from what it learns from that website, understands it, and come back, and use that information, use that newfound knowledge to do its job. Agentic AI.
Reasoning and Agentic AI
At the foundation of agentic AI, of course, something that's very new, reasoning. And then, of course, the next wave is already happening. We're going to talk a lot about that today. Robotics, which has been enabled by physical AI. AI that understands the physical world. It understands things like friction, inertia, cause and effect, object permanence. When something goes around the corner, it doesn't mean it's disappeared from this universe. It's still there, just not seeable. And so that ability to understand a physical world, a three-dimensional world, is what's going to enable a new era of AI that we call physical AI, and it's going to enable robots.
Each one of these phases, each one of these waves, opens up new market opportunities for all of us. It brings more and new partners to GTC. As a result, GTC is now jam-packed. The only way to hold more people at GTC is we're going to have to grow San Jose. And we're working on it. We've got a lot of land to work with. We've got to grow San Jose. So that we can make GTC live. As I'm standing here, I wish all of you could see what I see. And we're in the middle of a stadium. And last year was the first year back that we did this live. And it was like a rock concert. And it was described, GTC was described as the Woodstock of AI. And this year, it's described as the Super Bowl of AI. The only difference is everybody wins at this Super Bowl. Everybody's a winner.
And so every single year, more people come because AI is able to solve more interesting problems for more industries and more companies. And this year, we're going to talk a lot about agentic AI and physical AI. At its core, What enables each wave and each phase of AI, three fundamental matters are involved. The first is how do you solve the data problem? And the reason why that's important is because AI is a data-driven computer science approach. It needs data to learn from. It needs digital experience to learn from, to learn knowledge and to gain digital experience. How do you solve the data problem?
The second is how you solve the data problem without human in the loop. The reason why human in the loop is fundamentally challenging is because we only have so much time, and we would like an AI to be able to learn at super human rates, at super real-time rates, and to be able to learn at scale that no humans can develop. And so the second question is, how do you train the model? And the third is, how do you scale? How do you create, how do you find an algorithm whereby the more resource you provide, whatever the resource is, the smarter the AI becomes? A scandal!
Well, this last year, this is where almost the entire world got it wrong. The computation requirement, the scaling law of AI, is more resilient and, in fact, hyper-accelerated. The amount of computation we need at this point as a result of agentic AI, as a result of reasoning, is easily 100 times more than we thought we needed this time last year. And let's reason about why that's true.
The first part is, let's just go from what the AI can do. Let me work backwards. Agentic AI, as I mentioned at this foundation, is reasoning. We now have AIs that can reason, which is fundamentally about breaking a problem down step by step. Maybe it approaches a problem in a few different ways and selects the best answer. Maybe it solves the same problem in a variety of ways. And sure enough, it got the same answer, consistency check. Or maybe, after it's done deriving the answer, it plugs it back into the equation. Maybe it's like that again. It's confirmed that, in fact, it's the right answer. Instead of just one shot blurbing it out.
Remember, two years ago, when we started working with ChatGT, a miracle was caused.complicated questions, and many simple questions, it simply can't get right, and it's understandably so. It took a one-shot, whatever it learned by studying pre-trained data, whatever it saw from other experiences, pre-trained data, it does a one-shot, learns about it, writes them off, Now, we have AIs that can reason step-by-step-by-step using a technology called chain of five, best of N, consistency checking, a variety of different path planning, a variety of different techniques. We now have AIs that can reason, break a problem down and reason, step-by-step-by-step.
More Compute Needed
Well, you can imagine, as a result, the number of tokens we generate, and the fundamental technology of AI is still the same. Generate the next token, predict the next token. It's just that the next token now makes up step one. Then the next token after that, after it generates step one. That step one has gone into the implant of the AI again as it generates step two, and step three, and step four. So instead of just generating one token or one word after the next, it generates a sequence of words that represents a step recently. The amount of tokens that's generated as a result is substantially higher, and I'll show you in a second. Easily 100 times higher.
Now, 100 times more, what does that mean? Well, it can generate 100 times more tokens, and you can see that happening, as I explained previously. The model is more complex. It generates ten times more tokens, and in order for us to keep the model responsive, interactive, so that we don't lose our patience waiting for it to think, We now have to compute 10 times faster. And so 10 times tokens, 10 times faster, the amount of computation we have to do is 10, 100 times more easily. And so you're going to see this in the rest of the presentation. The amount of computation we have to do for inference is dramatically higher than it used to be.
Well, the question then becomes, how do we teach an AI how to do what I just described? How to execute this chain of thought? Well, one method, which you have to teach AI for a reason. As I mentioned earlier, in training, there are two fundamental problems we have to solve. Where does the data come from?
Where does the data come from? And how do we not have it be limited by human in the loop? There's only so much data and so much human demonstration we can perform. And so this is the big breakthrough in the last couple of years. Reinforcement learning, verifiable results. Basically, reinforcement learning of an AI as it attacks or tries to engage, solving a problem step by step by step.
Well, we have many problems that have been solved in the history of humanity where we know the answer. We know the equation of a quadratic equation. How to solve that? We know how to solve a Pythagorean Theorem. The rules of a right triangle. We know many, many rules of math and geometry and logic and science. We have puzzle games that we can give it. Constraints, constraint, constraint. That's the problem that we need to go through. Those kind of problems, on and on and on, we have hundreds of these problem spaces where we can generate millions of different examples and give the AI hundreds of chances to solve it step by step by step.
To reward it as it does a better and better job. So as a result, you take hundreds of different topics, millions of different examples, hundreds of different tries, each one of the tries generating tens of thousands of tokens, you put that all together, we're talking about trillions and trillions of tokens in order to train that model. And now with reinforcement learning, we have the ability to generate an enormous amount of tokens. Synthetic data generation, basically using a robotic approach to teach and handle. The combination of these two things has put an enormous, enormous challenge of computing in front of the industry.
And you can see that the industry is responding. This is, what I'm about to show you is Hopper shipments Of the top four CSPs, the top four CSPs, they're the ones from public cloud. Amazon, Azure, GCP, and OCR. The top four CSPs, not the AI companies, that's not included. Not all the startups, not included. Not enterprise, not included. A whole bunch of things not included, just those four.
Hopper vs. Blackwell
Just to give you a sense of comparing the peak year of Hopper and the first year of Blackwell. And the peak year of Hopper and the first year of Blackwell. So you can kind of see that in fact AI is going through an inflection point. It has become more useful, because it's smart. It can reason. It is more used. You can tell it's more used, because whenever you go to check your teeth these days, it seems like you have to wait longer and longer and longer, which is a good thing, because there's a lot of opportunities for it to grow greater and better. And the amount of computation necessary to train those models, and to inference those models, has grown tremendously.
So in just one year, And Blackwell just started shipping in just one year. You can see the incredible growth in AI and artificial intelligence. Well, that's been reflected in computing across the board. We're now seeing, and this is, the purple is the forecast of analysts about the next, the increase of capital expense of the world's data centers, including CSPs and enterprise and so on. The world's data centers through the end of the decade, the 23rd.
I've said before that I expect Dance Center Bilbao to reach a trillion dollars, and I am fairly certain we're going to reach that very soon. Two dynamics is happening at the same time. The first dynamic is that a vast majority of that growth is likely to be accelerated, meaning we've known for some time that general purpose computing is going to run its course, and then we need a new computing approach. And the world is going through a platform shift.
From hand-coded software running on general-purpose computers to machine learning software running on accelerators and GPUs. This way of doing computation is, at this point, past this tipping point. And we are now seeing the inflection point happening, the inflection happening in the world's data center build-outs. So the first thing is a transition in the way we do computing.
Second is an increase in recognition that the future of software requires capital investment. Now this is a very big idea. Whereas in the past we wrote the software and we ran it on computers, in the future the computer is going to generate the tokens for the software. And so the computer has become a generator of tokens, not a retrieval of files. From retrieval-based computing, to generative-based computing, from the old way of doing things to a new way of building these infrastructures of AI factories.
The AI factory is because it has one job, and one job only, generating the information that we then reconstitute into music, into words, into videos, into research, into chemicals, proteins. We reconstitute it into all kinds of information that we can use. So the world is going through a transition in not just the amount of data centers that will be built, but also how it gets built.
CUDA Libraries
Well, everything in the data center won't be accelerated. Not all of it's AI. And I want to say a few words about this. You know, this slide, this slide, this slide is genuinely my favorite. And the reason for that is because for all of you who have been coming to GTC, All of these years, you've been listening to me talk about these libraries. This is just, in fact, what GDC is all about. In fact, a long time ago, 40 years ago, this is the only police plant we have.
one library after another library after another library. You can't just accelerate software. Just as we need an AI framework in order to create AI, and we accelerate the AI frameworks, you need frameworks for physics, and biology, and multi-physics, and all kinds of different quantum physics. You need all kinds of libraries and frameworks
We call them QNX libraries, Acceleration Frameworks, for each one of these fields of science. And so this first one is incredible. This is too tiny of an error. NumPy is the number one most downloaded Python library, most used Python library in the world. Downloaded 400 million times this last year. A cuLitho is computational. And cuPlainumeric is a zero-chain, drop-in acceleration for NumPy. So if any of you are using NumPy out there, give cuPlainumeric a try. You're going to love it.
A cuLitho, a computational lithography library. Over the course of four years, we've now taken the entire process of processing lithography, computational lithography, which is the second factory in a family. There's the factory that manufactures the wafers. And then there's the factory that manufactures the information to manufacture the wafers. Every industry, every company that has factories will have two factories in the future. The factory for what they build, and the factory for the mathematics. The factory for the AI. Factory for cars, factory for AIs for the cars. Factory in Portland.
Smart speakers, factories, AI for the smart speakers. And so Kuhn Libo is our computational lithography, TSMC, Samsung, ASML, our partner Synopsys, Mentor, incredible support all over. I think that this is now at its tipping point. In another five years' time, every mask, every single lithography will be processed on NVIDIA Kuhn. Arial is our library for 5G, turning a GPU into a 5G radio. Why not? Signal processing is something we do incredibly well. Once we do that, we can layer on top of it AI, AI for brain, what we call AI brain. The next generation of radio networks will have AI deeply inserted into it. Why is it that we're limited by the limits of information theory? Because there's only so much information spectrum we can get, not if we had AI doing it.
Co-op numerical or mathematical optimization. Almost every single industry uses this when you plan seats and flights, inventory and customers, workers and plants, drivers and riders. So on and so forth. We have multiple constraints, multiple constraints, a whole bunch of variables, and you're optimizing for time, profit, quality of service, usage of resource, whatever it happens to be. And we get used to it by our supply chain management. Kuok is an incredible library. It takes what would take hours and hours, and it turns it into seconds. The reason why that's a big deal is so that we can now explore much larger spaces. We announced that we are going to open source co-op. Almost everybody is using either a Ruby or IBM C-Flex.
or RICO. We're working with all three of them. The industry is so excited. We're about to accelerate the living daylights out of the industry. Paraphrase for gene sequencing and gene analysis. Illinois is the world's leading medical imaging library. Earth 2, multi-physics for predicting a very high resolution of local weather. Q-Quantum and Q-Incube, we're gonna have our first Quantum day here at GDC. We're working with just about everybody in the ecosystem, either helping them research on quantum architectures, quantum algorithms, or in building a classical accelerated quantum heterogeneous architecture.
So really exciting work there. KU, Echo Variance, KU Tensor for Tensor Contraction, Washington Chemistry. Of course, this stack is world famous. People think that there's one piece of software called CUDA, but in fact, on top of CUDA is a whole bunch of libraries that's integrated into all different parts of the ecosystem and software and infrastructure in order to make AI possible.
I've got a new one here to announce today. QVSS are smart solvers, really important for CAE. This is one of the biggest things that has happened in the last year. Working with cadence, and synopsis, and emphasis, and all of the systems companies, we've now made possible just about every important BDA and CAE library in the Accelerator. What's amazing is until recently,
NVIDIA's been using general purpose computers, running software super slowly, to design accelerated computers for everybody else. And the reason for that is because we never had that software, that body of software, optimized or accrued until recently. And so now, our entire industry is going to get supercharged as we move to accelerating computing.
QDF, a data frame for structured data, we now have a drop-in acceleration for Spark and drop-in acceleration for Pandas. Incredible. And then we have Morp, a library for physics that runs in Python library for Qt. We have a big announcement there. I'll save it in just a second. This is just a sample of the libraries that made possible Accelerator 2. It's not just CUDA. We're so proud of CUDA. But if not for CUDA and the fact that we have such a large install base, none of these libraries would be useful for any of the developers who use them. For all the developers that use them.
You use it because, one, it's going to give you incredible speed-up. It's going to give you incredible scale-up. And two, because the install base of Kube is not everywhere. It's in every cloud. It's in every data center that's available for every computer company in the world. It's literally everywhere. And therefore, by using one of these libraries, Your amazing software can reach everybody. And so we've now reached the tipping point of accelerated computing. CUDA has made it possible. And all of you, this is what GDC is about, the ecosystem, all of you made this possible. And so we made a little short video here. Thank you very much.
To the creators, the pioneers, the builders of the future, CUDA was made for you. Since 2006, 6 million developers in over 200 countries have used CUDA and transformed computing. With over 900 CUDAx libraries and AI models, you're accelerating science, reshaping industries, and giving machines the power to see, learn, and reason.
Now, NVIDIA Blackwell is 50,000 times faster than the first CUDA GPU. These orders of magnitude gains in speed and scale are closing the gap between simulation and real-time digital twins. And for you, this is still just the beginning. We can't wait to see what you do next.
I love what we do. I love even more what you do with it. And one of the things that most touched me And in my 33 years doing this, one scientist said to me, Jensen, because of your work, I can do my life's work in my lifetime. And boy, if that doesn't touch you, you've got to be a corpse.
So this is all about you guys. Thank you.
AI
All right, so we're going to talk about AI. But you know, AI started in the cloud. It started in the cloud for a good reason, because it turns out that AI needs infrastructure. It's machine learning. If the science says machine learning, then you need a machine to do the science. And so machine learning requires infrastructure, and the cloud data centers have infrastructure. They also have extraordinary computer science, extraordinary research. perfect circumstance for AI to take off in the cloud and the CSPs.
But that's not where AI is limited to. AI will go everywhere. And we're going to talk about AI a lot in different ways. The cloud service providers, of course, they like our leading edge technology. They like the fact that we have full stack, because accelerated computing, as you know, as I was explaining earlier, is not about the chip. It's not even just the chips in the library, the programming model, it's the chip, the programming model, and a whole bunch of software that goes on top of it. That entire stack is incredibly complex. Each one of those layers, each one of those libraries, is essentially like SQL. SQL, as you know, is called in-storage computing. It was the big revolution of computation by IBM. SQL is one library, just imagine it. I just showed you a whole bunch of them. And in the case of AI, there's a whole bunch more. So the stack is complicated. They also love the fact that CSPs love that NVIDIA CUDA developers are CSP customers.
Because in the final analysis, there are billions of infrastructure for the world to use. And so the reach of all of our ecosystems is really valuable and really, really deeply appreciated. Well, now that we're going to take AI out to the rest of the world, the rest of the world has different system configurations, operating environment differences, domain-specific library differences, usage differences.
And so AI as it translates to enterprise IT, as it translates to manufacturing, as it translates to robotics or self-driving cars, or even companies that are starting GPU clouds. There's a whole bunch of companies, maybe 20 of them. who started during the NVIDIA time. And what they do is just one thing. They host GPUs. They call themselves GPU Clouds. And one of our great partners, Core, we use in the process of Republic, and we're super proud of them. And so GPU Clouds, they have their own requirements. But one of the areas that I'm super excited about is Edge.
6g/AI RAN Partnership
And today, we announced that Cisco, NVIDIA, T-Mobile, the largest telecommunications company in the world, servers ODC, are going to build a full stack for radio networks here in the United States. And that's going to be the slain stack. So this current stack we're announcing today will put AI into the edge. Remember, $100 billion of the world's capital investments each year is in the radio networks and all of the data centers provisioning for communications. In the future, there is no question in my mind that's going to be accelerated computing infused with AI. AI will do a far, far better job adapting the radio signals, the massive MIMOs, to the changing environments and the traffic conditions. Of course it will. Of course we'll use reinforcement learning to do that. Of course MIMO is essentially one giant radio robot. Of course it is. And so we will, of course, provide for those capabilities. Of course AI can revolutionize communications.
You know, when I call home, you don't have to say but that few words because my wife knows where I work, what the conditions like, conversation carries on from yesterday, she kind of remembers what I like, don't like, and often times, just a few words, you're communicating a whole bunch. The reason for that is because of context and human priors, prior knowledge. Well, combining those capabilities can revolutionize communications. Look what it's doing for video processing. Look what I just described earlier in 3D graphics. And so, of course, we're going to do the same for Edge.
So I'm super excited about the announcement that we made today. T-Mobile, Cisco, NVIDIA, servers over the sea are going to build a full stack. Well AI is going to go into every industry, that's just one.
Autonomous Vehicles & GM Partnership
One of the earliest industries that AI went into was autonomous vehicles. The moment I saw AlexNet, and we've been working on computer vision for a long time, the moment I saw AlexNet was such an inspiring moment, such an exciting moment, it caused us to decide to go all in on building self-driving cars. So we've been working on self-driving cars now for over a decade. We build technology that almost every single self-driving car company uses. It could be either in the data center. For example, Tesla uses a lot of energy used in the data center. It could be in the data center or the car. Waymo and Wave uses a million computers in data centers as well as the car. It could be just in the car. Very rare, but sometimes it's just in the car. Or they use all of our software in addition.
We work with the car industry however the car industry would like us to work with. We build all three computers, the training computer, the simulation computer, and the robotics computer, the self-driving car computer, all the software stack that sits on top of it, models and algorithms, just as we do with all of the other industries that I've demonstrated. And so today, I'm super excited to announce that GM has selected a media department to build a refuge here in Suffolk, NJ. The time for autonomous vehicles has arrived. And we're looking forward to building an AGM. AI in all three areas. AI for manufacturing, so they can revolutionize the way they manufacture. AI for enterprise, so they can revolutionize the way they work. Design cars and scale the cars. And then also AI for in the car.
So AI infrastructure for GM, partnering with GM, and building with GM their AI. So I'm super excited about that. One of the areas that I'm deeply proud of, but rarely gets any attention, is safety. Automotive safety. It's called HALOS. And our company's called HALOS. Safety requires technology from silicon systems to system software, the algorithms, the methodologies, everything from diversity to ensuring diversity, monitoring and transparency, explainability. All of these different philosophies have to be deeply ingrained into every single part of how you build the system and the software.
We're the first company in the world, I believe, to have every line of code safety assessed. Seven million lines of code safety assessed. Our chip, our system, our system software, and our algorithms are safety assessed by third parties that crawl through every line of code to ensure that it is designed to ensure diversity, transparency, and explainability. We also have bought over 1,000 patents. And during this GTC, and I really encourage you to do so, is to go spend time in the Halos workshop so that you can see all of the different things that comes together to ensure that cars of the future are going to be safe as well as autonomous. And so this is something I'm very proud of. It barely gets any attention. And so I thought I would spend the extra time this time to talk about it.
All of you have seen cars drive by themselves. The weight on the road taxis are incredible. But we made a video to share with you some of the technology we use to solve the problems of data and training and diversity so that we could use the magic of AI to go create AI. Let's take a look.
NVIDIA is accelerating AI development for AVs with Omniverse and Cosmos. Cosmos' prediction and reasoning capabilities support AI-first AV systems that are end-to-end trainable with new methods of development. Model distillation, closed-loop training, and synthetic data generation.
First, model distillation. Adapted as a policy model, Cosmos' driving knowledge transfers from a slower, intelligent teacher to a smaller, faster student inferenced in the car. The teacher's policy model demonstrates the optimal trajectory followed by the student model learning through iterations until it performs at nearly the same level as the teacher.
The distillation process bootstraps the policy model, but complex scenarios require further tuning. Closed-loop training enables fine-tuning of policy models. Log data is turned into 3D scenes for driving closed-loop and physics-based simulation using omniverse neural reconstruction. Variations of these scenes are created to test the model's trajectory generation capabilities. Cosmos Behavior Evaluator can then score the generated driving behavior to measure model performance.
Newly generated scenarios and their evaluation create a large dataset for close loop training, helping AVs navigate complex scenarios more robustly. Last, 3D synthetic data generation enhances AVs' adaptability to diverse environments. From log data, Omniverse builds detailed 4D driving environments by fusing maps and images, and generates a digital twin of the real world, including segmentation, to guide Cosmos by classifying each pixel. Cosmos then scales the training data by generating accurate and diverse scenarios, closing the sim-to-real gap.
Omniverse and Cosmos enable AVs to learn, adapt, and drive intelligent, advancing safer mobility.
NVIDIA is the perfect company to do that. Gosh. That's our destiny. Use AI to recreate AI. The technology that we showed you there is very similar to the technology that you're enjoying to take you to a digital twin we call NVIDIA.
All right, let's talk about data science. That's not bad, huh? Gaussian Splats. Just in case. Gaussian Splats. Well, let's talk about data centers. Blackwell is in full production, at least the way it looks like. It's an incredible, incredible effort for people. For us, this is a sight of beauty. Would you agree?
Blackwell
How is this not beautiful? How is this not beautiful? Well, this is a big deal because we made a fundamental transition in computer architecture. I just want you to know that, in fact, I've shown you a version of this about three years ago. It was called Grace Hopper, and the system was called Ranger. The Ranger system is about maybe about half of the width of the screen. And it was the world's first ND Link 32. Three years ago, we showed Ranger working. And it was way too large. But it was exactly the right idea. We were trying to solve scale up.
Distributed computing is about using a whole lot of different computers working together to solve a very large problem. But there's no replacement for scaling up before you scale out. Both are important. But you want to scale up first before you scale out. Well, scaling up is incredibly hard. There is no simple answer for it. You're not going to scale it up. You're not going to scale it out like Hadoop. Take a whole bunch of commodity computers, hook it up into a large network, and do in-storage computing using Hadoop. Hadoop is a revolutionary idea, as we know. It enabled hyperscale data centers to solve problems of gigantic sizes and using off-the-shelf computers. However, the problem we're trying to solve is so complex that scaling in that way would have simply cost way too much power, way too much energy. It would have never, deep learning would have never happened. And so the thing that we had to do was scale up first. Well, this is the way we scaled up. I'm not going to lift this. This is 70 pounds.
The last generation system architecture is called HGX. This revolutionized computing as we know it. This revolutionized artificial intelligence. This is eight GPUs. Eight GPUs. Each one of them is kind of like this. This is two GPUs. Two Blackwell GPUs in one Blackwell package. Two Blackwell GPUs and one Blackwell package. And there are eight of these underneath this. And this connects into what we call MDLink8.
This then connects to a CPU shell like that. So there's dual CPUs. And that sits on top. And we connect it over PCI Express. And then many of these get connected with InfiniBand, which turns into what is an AI circuit. This is the way it was in the past. This is how we started. Well, this is as far as we scaled up before we scaled out. But we wanted to scale up even further. And I told you that Ranger took this system and scaled it up by another factor of four. And so we have NVLink 32. But the system was way too large.
And so we had to do something quite remarkable, re-engineer how NVLink worked and how Scalar worked. And so the first thing that we did was we said, listen, The airplane switches are in the system embedded on the motherboard. We need to disaggregate the EmuLink system and take it out. So this is the EmuLink system. This is an EmuLink switch. This is the highest performance switch the world's ever made. And this makes it possible for every GPU to talk to every GPU at exactly the same time at full bandwidth.
So this is the NVLink switch. We disaggregated it, we took it out, and we put it in the center of the chassis. So there's all the 18 of these switches in nine different racks, nine different switch trays, we call them. And then the switches are disaggregated. The compute is now sitting here. This is equivalent to these two things in compute. What's amazing is this is completely liquid cooled. And by liquid cooling it, we can compress all of these compute nodes into one rack. This is the big change of the entire industry.
GB200/NVL72
All of you in the audience, I know how many of you are here, I want to thank you for making this fundamental shift from integrated NVLink to disaggregated NVLink, from air-cooled to liquid-cooled, from 60,000 components per computer or so, to 600,000 components per rack, 120 kilowatts, fully liquid-cooled, and as a result, we have a one XLFLOPS computer in one rack. Isn't it incredible?
So this is the compute node. This is the compute node. And that now fits in one of these. Now weighed 3,000 pounds, 5,000 cables, about two miles worth. It's just incredible electronics. 600,000 parts. I think that's like 20 cars. 20 cars worth of parts. And it integrates into one supercomputer. Well, our goal is to do this. Our goal is to do scale-up. And this is what it now looks like. We essentially want to build this chip.
It's just that no radical limits can do this. No process technology can do this. It's 130 trillion transistors, 20 trillion of images for computing. So it's not like you can't reasonably build this any time soon. And so the way to solve this problem is to disaggregate it, as I've described, into the And as a result, we have done the ultimate scale-up. This is the most extreme scale-up the world has ever done. The amount of computation that's possible here, the memory bandwidth.
570 terabytes per second. Everything in this machine is now in T's. Everything is a trillion. And you have an exaflops, which is a million trillion floating point operations per second. Well, the reason why we wanted to do this is to solve an extreme problem. And that extreme problem, a lot of people misunderstood to be easy. And in fact, it is the ultimate extreme computing problem. And it's called inference. And the reason for that is very simple. Inference is token generation by a factory. And a factory is revenue and profit generation, or LACA.
And so this factory has to be built with extreme efficiency, with extreme performance, because everything about this factory directly affects your quality of service, your revenues, and your profitability. Let me show you how to read this chart, because I want to come back to this a few more times. Basically, you have two axes. On the x-axis, is the tokens per second. Whenever you chat, when you put a prompt into chat, you can see what comes out as tokens. Those tokens are reformulated into words.
Inference and Tokens
You know, it's more than a token per word, okay? And they'll tokenize things like T-H-E can be used for the, it can be used for them, it can be used for theory, it can be used for theatrics, it can be used for all kinds of things, okay? And so T-H-E is an example of a token. They reformulate these tokens to turn into words. Well, we've already established that if you want your AI to be smarter, you wanna generate a whole bunch of tokens. Those tokens are reasoning tokens, consistency checking tokens. It's coming up with a whole bunch of ideas so they can select the best of those ideas and tokens. And so those tokens, it might be second guessing itself. It might be, is this the best work you can do? And so it talks to itself, just like we talk to ourselves. And so the more tokens you generate, the smarter your AI.
If you take too long to answer a question, the customer is not going to come back. This is no different than web search. There is a real limit to how long it can take before it comes back with a smart answer. And so you have these two dimensions that you're fighting against. You're trying to generate a whole bunch of tokens, but you're trying to do it as quickly as possible. Therefore, your token rate matters.
So you want your tokens per second for that one user to be as fast as possible. However, in computer sciences and factories, there's a fundamental tension between latency response time and throughput. And the reason is very simple. If you're in a large, high-volume business, you batch up. It's called batching. You batch up a lot of customer demand, and you manufacture a certain version of it for everybody to consume later. However, from the moment that they batched up and manufactured whatever they did, to the time that you consumed it, could take a long time.
This is no different for computer science, no different for AI factories that are generating tokens. And so you have these two fundamental tensions. On the one hand, you would like the customer's quality of service to be as good as possible. Smart AIs are super fast. On the other hand, you're trying to get your data center to produce tokens for as many people as possible so you can maximize your revenues. The perfect answer is to the upper right. Ideally, the shape of that curve is a square that you could generate very fast tokens per person up until the limits of the factory. No factory can do that. And so it's probably some curve. And your goal is to maximize the area under the curve
OK, the product of x and y. And the further you push out, more likely it means the better of a factory that you're building. Well, it turns out that in tokens per second for the whole factory and tokens per second response time, one of them requires an enormous amount of computation, flops. And then the other dimension requires an enormous amount of bandwidth and flops.
And so this is a very difficult problem to solve. The good answer is that you should have lots of thoughts and lots of memory and lots of everything. That's the best answer to start, which is the reason why this is such a great computer. You start with the most blocks you can, the most memory you can, the most bandwidth you can, of course the best architecture you can, the most energy efficiency you can, and you have to have a programming model that allows you to run software across all of this insanely hard so that you can do this.
Now let's just take a look at this one demo to give you a tactile feeling of what I'm talking about. Please play.
Traditional LLMs capture foundational knowledge, while reasoning models help solve complex problems with thinking tokens. Here, a prompt asks to seat people around a wedding table while adhering to constraints like traditions, photogenic angles, and feuding family members. Traditional LLM answers quickly with under 500 tokens. It makes mistakes in seating the guests, while the reasoning model thinks with over 8,000 tokens to come up with a correct answer. It takes a pastor to keep the peace.
As all of you know, if you have a wedding party of 300 and you're trying to find the optimal seating for everyone, That's a problem that only AI can solve or a mother-in-law can solve. And so that's one of those problems that Kuat cannot solve. OK, so what you see here is that we gave you a problem that requires reasoning. And you saw R1 goes off into reasons about it. It tries all these different scenarios. And it comes back and it tests its own answer. It has to ask itself whether it did it right. The last generation language model does a one-shot. So the one-shot is 439 tokens. It was fast, it was effective, but it was wrong.
Reasoning Models
So it's 439 wasted tokens. On the other hand, in order for you to reason about this problem, and that was actually a very simple problem. You just give it a few more difficult variables, and it becomes very difficult to reason through. And it took 8,000, almost 9,000 tokens. And it took a lot more computation, because the model's more complex.
OK, so that's one dimension. Before I show you some results, let me just show you, let me explain something else. So the answer, if you look at the blackboard, if you look at the blackboard system, and it's scaled up, NVLink72, the first thing that we have to do is we have to take this model. And this model is not small. In the case of R01, people think R01 is small, but it's 608 billion parameters. Next generation models could be trillions of parameters.
And the way that you solve that problem is you take these trillions and trillions of parameters in this model, and you distribute the workload across the whole system of GPUs. You can use tensor parallel. You can take one layer of the model, and run them across multiple GPUs. You could take a slice of the pipeline and call that pipeline parallel and put that on multiple GPUs. You could take different experts and put them across different GPUs and call that expert parallel.
The combination of pipeline parallelism and tensor parallelism and expert parallelism, the number of combinations is insane. And depending on the model, depending on the workload, depending on the circumstance, how you configure that computer has to change. so that you can get the maximum throughput out of it. You also sometimes optimize for very low latencies. Sometimes you try to optimize for throughput. And so you have to do some in-flight batching. A lot of different techniques for batching and aggregating work. And so the software, the operating system for these AI factories is insanely complicated.
Well, one of the observations. And this is a really terrific thing about having a homogenous architecture like a V172, is that every single GPU could do all the things that I just described. And we observe that these reasoning models are doing a couple of phases of computing. One of the phases of computing is thinking. When you're thinking, you're not producing a lot of tokens. You're producing tokens that you're maybe consuming yourself. You're thinking. Maybe you're reading. You're digesting information. That information could be a PDF. That information could be a website. You could literally be watching a video ingesting all of that at super linear rates.
And you take all of that information, and you then formulate the answer. Formulate a planned answer. And so that digestion of information, context processing, is very flux intensive. On the other hand, During the next phase, it's called decode. So the first part we call pre-fill. The next phase of decode requires floating point operations, but it requires an enormous amount of bandwidth. And it's fairly easy to calculate. If you have a model, and it's a few trillion parameters,
Well, it takes a few terabytes per second. Notice I was mentioning 576 terabytes per second. It takes terabytes per second to just pull the model in from HPM memory and to generate literally one token. And the reason it generates one token is because, remember, that these large language models are predicting the next token. That's what they say is the next token. It's not predicting every single token. It's predicting the next token.
Now, we have all kinds of new techniques, speculative decoding, and all kinds of new techniques for doing that faster. But in the final analysis, you're predicting the next token. And so you ingest, pull in the entire model and the context, we call it a KVCache, and then we produce one token. And then we take that one token, we put it back into our brain, we produce the next token. Every single one, every single time we do that, we take trillions of parameters in, we produce one token. Trillions of parameters in, produce another token. Trillions of parameters in, produce another token. And notice that demo, we produced 8,000 tokens.
600 tokens. So trillions of bytes of information, trillions of bytes of information have been taken into our GPUs and produced one token at a time. Which is fundamentally the reason why you want NVLink. NVLink gives us the ability to take all of those GPUs and turn them into one massive GPU. The ultimate skill.
And the second thing is that now that everything is on NVLink, I can disaggregate the prefill from the decode. And I can decide, I want to use more GPUs for prefill, less for decode. Because I'm thinking a lot. I'm doing, it's agentic. I'm reading a lot of information. I'm doing deep research. Notice doing deep research? and earlier I was listening to Michael, and Michael was talking about him doing research, and I do the same thing. And we go off and we write these really long research projects for our AI. And I love doing that, because I already paid for it.
And I just love making our GQs work. And nothing gives me more joy. So I write them. And then it goes off and it does all this research. And it went off to like 94 different websites. And I read all this stuff. And I'm reading all this information. And it formulates an answer and writes the report. It's incredible. OK? During that entire time, pre-fill is super busy.
And it's not really generating that many tokens. On the other hand, when you're chatting with chatbot, and millions of us are doing the same thing, it is very token generation heavy. It's very decode heavy. And so, depending on the workload, we might decide to put more GPUs into decode, depending on the workload, put more GPUs into preflow.
Reply