The next big battle is here: AI processors for data centers 

24.04.2024 359 0

Artificial intelligence (AI) is the big topic in the IT industry right now. If you are sick of hearing about AI, we have some bad news: it’s only just getting started. The world will be talking about AI for the foreseeable future.  

AI is already having a major impact on the IT industry. Especially when it comes to data centers. There’s a massive hunger for hardware resources capable of computing the intense AI workloads and there’s not nearly enough processors to go around.  

NVIDIA is the big winner in this hardware race. At least for now. A couple of years ago the company bravely bet a lot of its future on the idea that AI would be the next big thing and invested massive resources into the development and production of such AI-focused processors. That bet paid off massively as it brought NVIDIA billions upon billions of dollars in revenue. The chips are incredibly expensive, but that didn’t stop companies ordering thousands of them and wait patiently for more stock.  

Naturally, this didn’t go unnoticed by the rest of the big names. All of a sudden, we have a huge chip race unfolding right in front of us. NVIDIA, Intel, Google and Meta have all introduced their own AI chips and most of them are aimed exactly at the same target – the data center. This new race is shaping up to be remarkably interesting as it already brings big leaps in innovation and performance. Let’s take a quick look at the efforts each of these companies is making in the area. 

NVIDIA’s goal to power a new era of computing 

At the end of March 2024, NVIDIA announced its latest hardware platform. It’s called Blackwell and comes with the “modest” claim that it will power a new era of computing. Blackwell features six technologies for accelerated computing. “For three decades we’ve pursued accelerated computing, with the goal of enabling transformative breakthroughs like deep learning and AI,” said Jensen Huang, founder and CEO of NVIDIA. “Generative AI is the defining technology of our time. Blackwell is the engine to power this new industrial revolution. Working with the most dynamic companies in the world, we will realize the promise of AI for every industry.” 

All of the top names are going to use Blackwell hardware: Amazon Web Services, Dell, Google, Meta, Microsoft, Oracle, Tesla and xAI. 

Blackwell features the world’s most powerful chip, as claimed by NVIDIA. It has 208 billion transistors and is manufactured on a custom-built 4NP TSMC process. The platform also features a second-generation Transformer Engine along with micro-tensor support and NVIDIA’s advanced dynamic range management algorithms integrated into NVIDIA TensorRT™-LLM and NeMo Megatron frameworks. As a result, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities. 

The platform also features the fifth generation of NVLink. It delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs. 

Blackwell-powered GPUs include a dedicated engine for reliability, availability, and serviceability, called RAS Engine. “Additionally, the Blackwell architecture adds capabilities at the chip level to utilize AI-based preventative maintenance to run diagnostics and forecast reliability issues. This maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time and to reduce operating costs,” says the company.  

It also promises Secure AI computing with support for new native interface encryption protocols, which are critical for privacy-sensitive industries like healthcare and financial services. Finally, there’s a dedicated decompression engine that supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated. 

The highlight of NVIDIA Blackwell is the GB200 Grace Blackwell Superchip. It brings together two B200 Tensor Core GPUs along with a Grace CPU. For the highest AI performance, GB200-powered systems can be connected with the NVIDIA Quantum-X800 InfiniBand and Spectrum™-X800 Ethernet platforms which deliver advanced networking at speeds up to 800Gb/s, says NVIDIA. 

“The GB200 is a key component of the NVIDIA GB200 NVL72, a multi-node, liquid-cooled, rack-scale system for the most compute-intensive workloads. It combines 36 Grace Blackwell Superchips, which include 72 Blackwell GPUs and 36 Grace CPUs interconnected by fifth-generation NVLink. Additionally, GB200 NVL72 includes NVIDIA BlueField®-3 data processing units to enable cloud network acceleration, composable storage, zero-trust security, and GPU compute elasticity in hyperscale AI clouds. The GB200 NVL72 provides up to a 30x performance increase compared to the same number of NVIDIA H100 Tensor Core GPUs for LLM inference workloads, and reduces cost and energy consumption by up to 25x. The platform acts as a single GPU with 1.4 exaflops of AI performance and 30TB of fast memory, and is a building block for the newest DGX SuperPOD,” and you can link up to 8 B200 GPUs for x86 gen AI platforms. Those are some massive numbers. 

Intel’s answer is named Gaudi 3 

Not to be outdone, Intel answered early in April with its own AI chip – Gaudi 3. The company bravely claims it’s “delivering 50% on average better inference and 40% on average better power efficiency than NVIDIA H100 – at a fraction of the cost”. It’s good to mention that H100 is NVIDIA’s previous effort, not the B200.  

“Innovation is advancing at an unprecedented pace, all enabled by silicon – and every company is quickly becoming an AI company,” said Intel CEO Pat Gelsinger. “Intel is bringing AI everywhere across the enterprise, from the PC to the data center to the edge. Our latest Gaudi, Xeon and Core Ultra platforms are delivering a cohesive set of flexible solutions tailored to meet the changing needs of our customers and partners and capitalize on the immense opportunities ahead.” 

Intel Gaudi 3 promises 4x more AI compute for BF16 and a 1.5x increase in memory bandwidth over its predecessor. “In comparison to NVIDIA H100, Intel Gaudi 3 is projected to deliver 50% faster time-to-train on average across Llama2 models with 7B and 13B parameters, and GPT-3 175B parameter model. Additionally, Intel Gaudi 3 accelerator inference throughput is projected to outperform the H100 by 50% on average and 40% for inference power-efficiency averaged across Llama 7B and 70B parameters, and Falcon 180B parameter models,” the company says. 

“We do expect it to be highly competitive” with NVIDIA’s latest chips, said Das Kamhout, vice president of Xeon software at Intel, on a call with reporters. “From our competitive pricing, our distinctive open integrated network on chip, we’re using industry-standard Ethernet. We believe it’s a strong offering.” The chips will be available to OEM partners of Intel from the second quarter of 2024. Among them are Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro. 

Google transforms itself into a hardware company 

Intel couldn’t have the spotlight for long. Mere hours after it announced Gaudi 3, Google showed off its first custom-built server chip – Axion. It signifies a major shift in the company’s stance. For decades now Google has always been software first and relied on hardware from partners. The only exception was the Tensor processor for Pixel devices, but those are for end users.  

Now Axion makes a big leap into the deep end of the pool for data center AI chips. The chip is based on ARM architecture, but Google says it will deliver 30% better performance than the fastest general-purpose ARM-based virtual machines and 50% better performance than comparable x86 virtual machines. It will also offer 60% better energy efficiency than x86-based instances.  

“Google’s announcement of the new Axion CPU marks a significant milestone in delivering custom silicon that is optimized for Google’s infrastructure, and built on our high-performance Arm Neoverse V2 platform. Decades of ecosystem investment, combined with Google’s ongoing innovation and open-source software contributions ensure the best experience for the workloads that matter most to customers running on Arm everywhere,” says Rene Haas, CEO of Arm 

Google is so confident about the capabilities of Axion that it will start using it first for its own services. Actually, it has already started to deploy it for services like BigTable, Spanner, BigQuery, Blobstore, Pub/Sub, Google Earth Engine, and the YouTube Ads platform with more in the pipeline. Unlike the other two companies, Google will not sell Axion to anyone. If you want to take advantage of the chip’s capabilities, you have to be a client of Google Cloud. That’s the only way to use it starting later this year. 

Meta enters the AI chip chat 

And just when we thought we were done with new AI data center chips for a while, Meta joined the party. The company announced a new generation of the MTIA (Meta Training and Inference Accelerator). “This new version of MTIA more than doubles the compute and memory bandwidth of our previous solution while maintaining our close tie-in to our workloads. It is designed to efficiently serve the ranking and recommendation models that provide high-quality recommendations to users. This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth and memory capacity for serving ranking and recommendation models,” says Meta. 

The company adds that it has been using the MTIA chip in its own data centers. “We are already seeing the positive results of this program as it’s allowing us to dedicate and invest in more compute power for our more intensive AI workloads. The results so far show that this MTIA chip can handle both low complexity and high complexity ranking and recommendation models which are key components of Meta’s products. Because we control the whole stack, we can achieve greater efficiency compared to commercially available GPUs,” adds Meta. 

Want to use MTIA? You can’t. At least for the time being, the company is using the chip only for its own needs at its own data centers.  

Other IT giants like Amazon and Microsoft also have their own AI chips which they announced late last year. They also have the same idea – to use the chips for their own cloud infrastructure. Also last year AMD showed off its Instinct MI300X GPU, aimed at AI servers.  

All of this is more than great news for TSMC. The company might not be developing AI chips itself, but it manufactures them for a lot of the big names, among them Apple and NVIDIA. As a result, TSMC’s revenue for March jumped 34.3% year-on-year making it the fastest pace of growth since November 2022. And it also propelled TSMC to be the world’s largest semiconductor manufacturer thanks to the AI boom.  

Leave a Reply

Your email address will not be published.