Nvidia’s Blackwell Affords FP4, Second-Gen Transformer Engine

Electronics

Nvidia’s Blackwell Affords FP4, Second-Gen Transformer Engine

lohitnath.453

March 28, 2024

Nvidia’s Blackwell Affords FP4, Second-Gen Transformer Engine

[ad_1]

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SAN JOSE, CALIF.—Market chief Nvidia unveiled its new era of GPU know-how, designed to speed up coaching and inference of generative AI. The brand new know-how platform is known as Blackwell, after recreation theorist David Harold Blackwell, and can change the earlier era, Hopper.

“Clearly, AI has hit the purpose the place each software within the trade can profit by making use of generative AI to reinforce how we make PowerPoints, write paperwork, perceive our information and ask questions of it,” Ian Buck, VP and common supervisor of Nvidia’s hyperscale and HPC computing enterprise, instructed EE Occasions. “It’s such an extremely beneficial device that the world can’t construct up infrastructure quick sufficient to fulfill the promise, and make it accessible, inexpensive and ubiquitous.”

The B200, two reticle-sized GPU die on a brand new customized TSMC 4NP course of node with 196 GB of HBM3e reminiscence, will supersede the H100 as state-of-the-art AI acceleration within the information heart. The GB200, or “Grace Blackwell,” is the brand new Grace Hopper—the identical Grace Arm-based CPU mixed with two B200s. There’s additionally a B100—a model of Blackwell which is able to primarily be used to exchange Hopper techniques the place the identical kind issue is required.

B200’s two CoWoS-mounted die are related by a 10-TB/s NV-HBI (high-bandwidth interconnect) hyperlink.

Improved Power Efficiency and AI Inference in Autonomous Systems

By Shingo Kojima, Sr Principal Engineer of Embedded Processing, Renesas Electronics 03.26.2024

Leveraging Advanced Microcontroller Features to Improve Industrial Fan Performance

By Dylan Liu, Geehy Semiconductor 03.21.2024

FerriSSD Offers the Stability and Data Security Required in Medical Equipment

By Lancelot Hu 03.18.2024

“That material is not only a community, the material of the GPU extends from each core and each reminiscence, throughout the 2 die, into each core, which implies software program sees one totally coherent GPU,” Buck stated. “There’s no locality, no programming variations – there is only one big GPU.”

Nvidia Blackwell — Nvidia’s B200, the corporate’s first GPU superchip. (Supply: Nvidia)

B200 will supply 2.5× the FLOPS of H100 on the identical precision, however it additionally helps decrease precision codecs, together with FP6 and FP4. A second-gen model of the transformer engine reduces precision so far as potential throughout inference and coaching to maximise throughput.

Buck described how {hardware} help for dynamic scaling meant the first-gen transformer engine may dynamically regulate scale and bias whereas sustaining accuracy so far as potential, on a layer-by-layer foundation. The transformer engine successfully “does the bookkeeping,” he stated.

“For the subsequent step [in the calculation], the place do it is advisable transfer the tensor in that dynamic vary to maintain every thing in vary? When you fall out, you’re out of vary,” he stated. “We’ve to foretell it…the transformer engine seems all the best way again, a thousand [operations] again in historical past to mission the place it must dynamically transfer the tensor in order that the ahead calculation stays inside vary.”

For the Blackwell era, the transformer engine has been upgraded to allow micro-scaling not simply on the tensor stage, however for parts throughout the tensor. Teams of “tens of parts” can now have completely different scaling elements, with that stage of granularity supported in {hardware} right down to FP4.

“With Blackwell, I can have a separate vary for each group of parts throughout the tensor, and that’s how I can go under FP8 right down to 4-bit illustration,” Buck stated. “Blackwell has {hardware} to do this micro-scaling…so now the transformer engine is monitoring each tensor in each layer, but in addition each group of parts within the tensor.”

Nvidia Grace Blackwell — Nvidia’s Grace Blackwell is a Grace CPU mixed with two B200 GPUs. (Supply: Nvidia)

Communication

With B200 boosting efficiency 2.5× over H100, the place do Nvidia’s 25×-30× efficiency claims come from? The hot button is in communication for giant generative AI fashions, Buck stated.

Whereas earlier generative AI fashions had been a single monolithic transformer, in the present day’s largest generative AI fashions use a method known as combination of consultants (MoE). With MoE, layers are composed of a number of mini-layers, that are extra targeted on specific duties. A router mannequin decides which of those consultants to make use of for any given MoE layer. Fashions like Gemini, Mixtral and Grok are constructed this fashion.

The difficulty is these fashions are so giant that someplace between 4 and 32 particular person consultants are probably being run on separate GPUs. Communication between them turns into the bottleneck; all-to-all and all-reduce operations are required to mix outcomes from completely different consultants. Whereas giant consideration and feedforward layers in monolithic transformers are sometimes cut up throughout a number of GPUs, the issue is especially acute for MoE fashions.

Hopper has eight GPUs per NVLink (brief vary chip-to-chip communication) area at 900 GB/s—however when transferring from, say eight to 16 consultants, half the communication has to go over Infiniband (used for server communications) at solely 100 GB/s.

“So in case your information heart has Hoppers, one of the best you are able to do is half of your time goes to be spent on consultants speaking, and when that’s occurring, the GPUs are sitting idle—you’ve constructed a billion-dollar information heart and at finest, it’s solely 50% utilized,” Buck stated. “This can be a downside for contemporary generative AI. It’s do-able—folks do it—however it’s one thing we needed to unravel within the Blackwell era.”

Nvidia Blackwell NVL72 — Nvidia’s NVL72 rackscale system hyperlinks 36 Grace Blackwells, for a complete of 72 B200 GPUs. It’s designed for generative AI coaching and inference. (Supply: Nvidia)

For Blackwell, Nvidia doubled NVLink speeds to 1800 GB/s per GPU, and prolonged NVLink domains to 72 GPUs in the identical rack. Nvidia’s NVL72 rack scale system, additionally introduced at GTC, has 36 Grace Blackwells—for a complete of 72 B200 GPUs.

Nvidia additionally constructed a brand new swap chip, NVLink Swap, with 144 NVLink ports and a non-blocking switching capability of 14.4 TB/s. There are 18 of those switches within the NVL72 rack, with an all-to-all community topology, which means each GPU within the rack can speak to each different GPU within the rack on the full bidirectional bandwidth of 1800 GB/s—18× what it will have been with Infiniband.

“We crushed it,” Buck stated.

The brand new switches may do math. They help Nvidia’s scalable hierarchical aggregation and discount protocol (Sharp) know-how, which might carry out sure sorts of basic math within the swap. This implies the identical information doesn’t should be despatched to completely different endpoints a number of occasions and reduces the time spent speaking.

“If we have to add tensors or one thing like that, we don’t even must hassle the GPUs any extra, we are able to do this within the community, giving it an efficient bandwidth for all-reduce operations of 3600 GB/s,” Buck stated. “That’s how we get to 30 occasions quicker.”

B200 GPUs can run in a 1000-W energy envelope with air cooling, however with liquid cooling, they’ll run on 1200 W. The bounce to liquid cooling was not essentially about wanting to spice up the ability provide to every GPU, Buck stated.

“The explanation for liquid cooling is for the NVL72, we needed to construct an even bigger NVLink area,” he stated. “We couldn’t construct an even bigger PCB, so we constructed it rack scale. We may do this with a number of racks, however to do quick signaling, we’d should go to optics…that might be a variety of transceivers. It might want one other 20 kW of energy, and it will be six occasions dearer to do this versus copper, which is a direct connection to the GPU SerDes.”

Copper’s distances are shorter than optics’—restricted to round a meter—so the GPUs should be shut collectively in the identical rack.

“Within the rack, the 2 compute trays are sandwiched between the switches, it wouldn’t work in the event you did a top-of-rack NVLink swap, as a result of the space from the underside to the highest of the rack wouldn’t have the ability to run with 1800 GB/s or 200 Gb/s SerDes—it’s too far,” Buck stated. “We transfer the NVSwitch to the center, we are able to do every thing in 200 Gb/s SerDes, all in copper, six occasions decrease value for 72 GPUs. That’s why liquid cooling is so necessary—now we have to do every thing inside a meter.”

Trillion-parameter fashions can now be deployed on a single rack, decreasing general value. Buck stated that versus the identical efficiency with Hopper GPUs, Grace Blackwell can do it with 25× much less energy and 25× much less value.

“What meaning is that trillion-parameter generative AI might be in all places—it’s going to democratize AI,” he stated. “Each firm may have entry to that stage of AI interactivity, functionality, creativity…I’m tremendous excited.”

[ad_2]