“We plan to make TensorFloat-32 supported natively in TensorFlow to enable data scientists to benefit from dramatically higher speedups in NVIDIA A100 Tensor Core GPUs without any code changes,” he added. “TensorFloat-32 provides a huge out-of-the-box performance increase for AI applications for training and inference while preserving FP32 levels of accuracy,” said Kemal El Moujahid, director of Product Management for TensorFlow. In June, developers will be able to access a version of the TensorFlow framework and a version of the PyTorch framework with support for TF32 on NGC, NVIDIA’s catalog of GPU-accelerated software. At the same time, NVIDIA is working with the open-source communities that develop AI frameworks to enable TF32 as their default training mode on A100 GPUs, too. That’s why NVIDIA is making TF32 the default on its cuDNN library which accelerates key math operations for neural networks. All of them have the same convergence-to-accuracy behavior as FP32. To validate the accuracy of TF32, we used it to train a broad set of AI networks across a wide variety of applications from computer vision to natural language processing to recommender systems. Applications-level results on other AI training and HPC apps that rely on matrix math will vary by workload. TF32 Is Demonstrating Great Results TodayĬompared to FP32, TF32 shows a 6x speedup training BERT, one of the most demanding conversational AI models. Employing Automatic Mixed Precision, users can get a further 2x higher performance with just a few lines of code. It supports both FP16 and Bfloat16 (BF16) at double the rate of TF32. Non-matrix operations continue to use FP32.įor maximum performance, the A100 also has enhanced 16-bit math capabilities. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. The combination makes TF32 a great alternative to FP32 for crunching through single-precision math, specifically the massive multiply-accumulate functions at the heart of deep learning and many HPC apps.Īpplications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range. TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads.
TF32 strikes a balance that delivers performance with range and accuracy. The chart below shows how TF32 is a hybrid that strikes this balance for tensor operations. It should use enough bits to deliver precision without using so many it slows processing and bloats memory. Its precision - how fine the lines are on the ruler - comes from the number of bits used for its mantissa, the part of a floating point number after the radix or decimal point.Ī good format strikes a balance. The number of bits in a format’s exponent determines its range, how large an object it can measure. It helps to step back for a second to see how TF32 works and where it fits. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TensorFloat-32 is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations used at the heart of AI and certain HPC applications. Today, the NVIDIA Ampere architecture introduces a new approach for improving training performance on the single-precision models widely used for AI. In November, we explained the differences among popular formats such as single-, double-, half-, multi- and mixed-precision math used in AI and high performance computing. Because deep learning is a young field, there’s still a lively debate about which types of math are needed, for both training and inferencing. As with all computing, you’ve got to get your math right to do AI well.