The future of AI chips may not be GPUs

In the layout of artificial intelligence (AI) computing architectures, the collaborative working model between the Central Processing Unit (CPU) and acceleration chips has become a typical AI deployment scheme. The CPU plays the role of providing basic computing power, while the acceleration chip is responsible for enhancing computational performance and aiding in the efficient execution of algorithms. Common AI acceleration chips, according to their technical paths, can be divided into three major categories: Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), and Application-Specific Integrated Circuits (ASICs).

In this competition, GPUs have become the mainstream AI chips due to their unique advantages. So, how does the GPU stand out among many options? Looking forward to the future of AI, is the GPU still the only solution?

01

How does the GPU win in the present?

There is a close relationship between AI and GPUs.

Strong Parallel Computing Capability

AI large models refer to large-scale deep learning models that require processing massive amounts of data and performing complex computations. The core advantage of GPUs lies in their strong parallel computing capabilities. Compared to traditional CPUs, GPUs can handle multiple tasks simultaneously, making them particularly suitable for processing large-scale datasets and complex computational tasks. In fields that require a large amount of parallel computing, such as deep learning, GPUs have shown an unparalleled advantage.

Advertisement

Comprehensive EcosystemSecondly, to facilitate developers in fully leveraging the computing power of GPUs, major manufacturers have provided a wealth of software libraries, frameworks, and tools. For instance, NVIDIA's CUDA platform offers developers a rich set of tools and libraries, making the development and deployment of AI applications relatively easy. This makes GPUs more competitive in scenarios that require rapid iteration and adaptation to new algorithms.

Good Universality

GPUs were initially designed for graphics rendering, but over time, their application fields have gradually expanded. Today, GPUs not only play a central role in graphics processing but are also widely used in deep learning, big data analysis, and other fields. This universality allows GPUs to meet a variety of application needs, while specialized chips like ASICs and FPGAs are limited to specific scenarios.

Some people compare GPUs to a versatile multi-functional kitchen utensil suitable for various cooking needs. Therefore, in most cases of AI applications, GPUs are considered the best choice. Correspondingly, while having a wide range of functions, it often comes with a lack of "refinement" in specific areas.

Next, let's take a look at the constraints that GPUs face compared to other types of accelerator chips.

02

GPUs also have their constraints

As mentioned at the beginning of the text, common AI accelerator chips can be divided into three major categories based on their technical paths: GPUs, FPGAs, and ASICs.FPGA (Field Programmable Gate Array) is a type of semi-custom chip that allows users to reprogram according to their specific needs. The advantages of FPGAs are that they address the shortcomings of custom circuits and overcome the limitations of the number of gates in traditional programmable devices. They offer flexible compilation at the hardware level of the chip, with lower power consumption than CPUs and GPUs. However, the downsides include the difficulty of hardware programming languages and a higher development threshold, as well as higher chip costs and prices. FPGAs are faster than GPUs and CPUs because they have a customizable structure.

ASIC (Application Specific Integrated Circuit) is an integrated circuit designed and manufactured specifically for a product's needs, with a higher degree of customization compared to GPUs and FPGAs. Generally, the computational power of ASICs is higher than that of GPUs and FPGAs. However, they have a significant initial investment and strong specialization, which reduces their versatility. If the algorithm changes, the computational power will decrease significantly, requiring a redesign.

Let's examine the disadvantages of GPUs compared to these two types of chips.

Firstly, the theoretical performance per unit cost of GPUs is lower than that of FPGAs and ASICs. From a cost perspective, the generality of software to hardware decreases from left to right among GPUs, FPGAs, and ASICs, while specialization and customizability gradually increase. Correspondingly, the design and development costs also increase, but the theoretical performance per unit cost is higher. For example, for classic algorithms or deep learning algorithms still in the laboratory stage, it is suitable to use GPUs for software exploration; for technologies that have gradually become standards, it is suitable to use FPGAs for hardware acceleration deployment; for computing tasks that have become standards, it is better to directly launch dedicated chips like ASICs.

From a company's perspective, for the same large-scale data computing tasks, the deployment costs of mature GPUs and FPGAs with the same memory size and computational power are similar. If a company's business logic changes frequently, such as every 1-2 years, then GPUs have a lower development cost and faster deployment speed. If the company's business changes every 5 years or so, although the development cost of FPGAs is high, the cost of the chip itself is much lower compared to GPUs.

Secondly, the computational speed of GPUs is inferior to that of FPGAs and ASICs. FPGAs, ASICs, and GPUs all have a large number of computing units, so their computational capabilities are very strong. When performing neural network computations, the speed of all three is much faster than that of CPUs. However, due to the fixed architecture of GPUs, the hardware's native support for instructions is also fixed, while FPGAs and ASICs are programmable, and their programmability is key. This allows software and terminal application companies to provide solutions different from their competitors and to flexibly modify the circuit according to the algorithms they use.

Therefore, in many application scenarios, the computational speed of FPGAs and ASICs is significantly better than that of GPUs.

In terms of specific application scenarios, GPUs have strong floating-point computing capabilities, making them suitable for high-precision neural network computations; FPGAs are not good at floating-point operations, but they can perform strong pipeline processing for network data packets and video streams; ASICs have almost unlimited computational power depending on the cost, which depends on the hardware designer.Thirdly, the power consumption of GPUs is significantly higher than that of FPGAs and ASICs.

Looking at power consumption, GPUs are notoriously high, with individual chips reaching up to 250W, or even 450W (RTX4090). In contrast, FPGAs typically consume only 30-50W. This is mainly due to memory access. The memory interface of GPUs (GDDR5, HBM, HBM2) has an extremely high bandwidth, about 4-5 times that of the traditional DDR interface of FPGAs. However, in terms of the chip itself, the energy consumed by reading DRAM is more than 100 times that of SRAM. The frequent reading of DRAM by GPUs generates extremely high power consumption. Additionally, the operating frequency of FPGAs (below 500MHz) is lower than that of CPUs and GPUs (1-3GHz), which also results in lower power consumption.

Looking at ASICs, the performance and power optimization of ASICs are tailored to specific applications, thus offering higher performance and lower power consumption for specific tasks. Since the design is targeted at specific functions, ASICs generally outperform FPGAs in terms of execution efficiency and energy efficiency.

For example, in the field of intelligent driving, deep learning applications such as environmental perception and object recognition require faster computational response while keeping power consumption low, otherwise, it would significantly impact the driving range of smart cars.

Fourthly, GPU latency is higher than that of FPGAs and ASICs. FPGAs have lower latency compared to GPUs. GPUs typically need to divide different training samples into fixed-size "Batches" to maximize parallelism, requiring several batches to be collected before processing them together.

The architecture of FPGAs is batch-free. Once a data packet is processed, it can be output immediately, giving it an advantage in latency. ASICs are another technology that achieves extremely low latency. After optimization for specific tasks, ASICs can usually achieve lower latency than FPGAs because they can eliminate the additional programming and configuration overhead that may exist in FPGAs.

Given this, why have GPUs become the hot choice for AI computing today?

In the current market environment, due to the lack of stringent requirements for cost and power consumption by major manufacturers, coupled with NVIDIA's long-term investment and accumulation in the field of GPUs, GPUs have become the most suitable hardware products for large model applications. Although FPGAs and ASICs theoretically have potential advantages, their development process is relatively complex, and they still face many challenges in practical applications, making it difficult to popularize widely. As a result, many manufacturers have chosen GPUs as a solution, which has also led to the emergence of the fifth potential problem.

Fifthly, the production capacity issue of high-end GPUs is also a cause for concern.

Ilya Sutskever, Chief Scientist at OpenAI, stated that GPUs are the new era of Bitcoin. Against the backdrop of the surge in computing power, NVIDIA's B-series and H-series GPUs have become "hard currency."However, despite the strong demand for this series, considering the tight supply and demand of HBM and CoWos, as well as the shortage of TSMC's advanced production capacity, the GPU production capacity really cannot keep up with the demand.

It is necessary to understand that "a clever housewife cannot cook a meal without rice." Under this situation, technology giants need to respond to market changes more flexibly, stock up on more GPU products, or find alternative solutions.

Nowadays, many manufacturers have started to explore and develop more specialized and refined computing devices and solutions beyond the GPU. So how will the future AI acceleration chips develop?

03

Technology giants take a different path

In today's era of rapid technological development and algorithm updates on a monthly basis in the era of big data, GPUs are indeed suitable for more people; but once the future business needs are fixed, FPGAs and even ASICs will become better underlying computing devices.

The leading chip and technology companies have also started to research and produce special-purpose computing chips for deep learning and DNN, or semi-custom chips based on the FPGA architecture, such as Google's Tensor Processing Unit (TPU) for tensor computing, and Intel's Altera Stratix V FPGA.

Google bets on customized ASIC chips: TPU

Google has been secretly developing chips focused on AI machine learning algorithms since 2013, and has used them in cloud computing data centers, replacing NVIDIA GPUs.This self-developed TPU chip was unveiled in 2016, designed to perform large-scale matrix operations for deep learning models, such as natural language processing, computer vision, and recommendation system models. In fact, Google had already constructed the AI chip TPU v4 in its data centers in 2020, but it was not until April 2023 that the details were first disclosed.

It is worth noting that the TPU is a customized ASIC chip, designed from scratch by Google and specifically for machine learning workloads.

On December 6, 2023, Google officially announced a new multimodal large model called Gemini, which includes three versions. According to Google's benchmarking results, the Gemini Ultra version has demonstrated "state-of-the-art performance" in many tests, even completely outperforming OpenAI's GPT-4 in most tests.

While Gemini was in the spotlight, Google also dropped another bombshell - the brand new self-developed chip TPU v5p, which is also the most powerful TPU to date. According to official data, each TPU v5p pod combines 8960 chips together at a speed of 4800 Gbps/chip through the highest bandwidth chip interconnect (ICI) in a three-dimensional ring topology, doubling the FLOPS and high-bandwidth memory (HBM) of the TPU v5p compared to the TPU v4.

Subsequently, in May of this year, Google announced the sixth-generation data center AI chip Tensor processing unit - Trillium, and stated that it will be delivered later this year. Google said that the computing performance of the sixth-generation Trillium chip is 4.7 times higher than that of the TPU v5e chip, and the energy efficiency is 67% higher than v5e. This chip is designed to power technologies that generate text and other content from large models. Google also said that the sixth-generation Trillium chip will be available to its cloud customers by the end of this year.

It is reported that Nvidia's market share in the AI chip market is as high as about 80%, and the vast majority of the remaining 20% is controlled by various versions of Google TPU. Google itself does not sell chips, but rents access through its cloud computing platform.

Microsoft: Launch of the universal chip based on the Arm architecture - Cobalt, and the ASIC chip - Maia 100

In November 2023, Microsoft released its first self-developed AI chip, Azure Maia 100, and the chip for cloud software services, Azure Cobalt, at the Ignite technology conference. Both chips will be manufactured by TSMC, using 5nm process technology.

It is reported that Nvidia's high-end products sometimes sell for $30,000 to $40,000 each, and the chips used for ChatGPT are estimated to require about 10,000, which is a huge cost for AI companies. Tech giants with a large demand for AI chips are actively seeking alternative sources of supply, and Microsoft's choice to develop on its own is to enhance the performance of generative AI products such as ChatGPT while reducing costs.

Cobalt is a universal chip based on the Arm architecture, with 128 cores. Maia 100 is an ASIC chip specifically designed for Azure cloud services and AI workloads, used for cloud training and inference, with a transistor count of 10.5 billion. These two chips will be introduced into Microsoft Azure data centers to support services such as OpenAI and Copilot.The Vice President in charge of the Azure chip division, Rani Borkar, stated that Microsoft has begun testing the Maia 100 chip with Bing and Office AI products, and OpenAI, Microsoft's main AI partner and the developer of ChatGPT, is also in the midst of testing. Market commentary suggests that the timing of Microsoft's AI chip initiative is well-timed, coinciding with the takeoff of large language models cultivated by companies such as Microsoft and OpenAI.

However, Microsoft does not believe that its AI chips can widely replace NVIDIA's products. Some analysts believe that if Microsoft's efforts are successful, it could also help it gain more leverage in future negotiations with NVIDIA.

It is reported that Microsoft is expected to announce a series of new developments in cloud software and hardware technology at the upcoming Build technology conference. What is particularly noteworthy is that Microsoft will open up access to its self-developed AI chip, Cobalt 100, to Azure users.

Intel Bets on FPGA Chips

Intel stated that early artificial intelligence workloads, such as image recognition, largely rely on parallel performance. Since GPUs are specifically designed for video and graphics cards, it has become common to apply them to machine learning and deep learning. GPUs excel in parallel processing and perform a large number of computational operations in parallel. In other words, if the same workload must be executed multiple times quickly, they can achieve incredible speed improvements.

However, there are limitations to running artificial intelligence on GPUs. GPUs cannot provide performance comparable to ASICs, which are chips specifically built for a given deep learning workload.

FPGAs, on the other hand, can provide hardware customization with integrated artificial intelligence and can be programmed to work similarly to GPUs or ASICs. The reprogrammable and reconfigurable nature of FPGAs makes them particularly suitable for the rapidly evolving field of artificial intelligence, allowing designers to quickly test algorithms and accelerate the launch of products to the market.

The Intel FPGA family includes Intel Cyclone 10 GX FPGA, Intel Arria 10 GX FPGA, and Intel Stratix 10 GX FPGA, among others. These products have I/O flexibility, low power consumption (or energy per inference), and low latency, which already provide advantages in AI inference. These advantages are further supplemented in three new Intel FPGA and system-on-chip (SoC) families, resulting in a significant improvement in AI inference performance. These three families are the Intel Stratix 10 NX FPGA and new members of the Intel Agilex FPGA family: the Intel Agilex D series FPGA, and the new Intel Agilex device family codenamed "Sundance Mesa." These Intel FPGA and SoC families include dedicated DSP modules optimized for tensor mathematical operations, laying the foundation for accelerating AI computing.

In March of this year, chip giant Intel announced the establishment of a new independent FPGA company—Altera. Intel acquired Altera in June 2015 for $16.7 billion. At the time of acquisition, Altera was the world's second-largest FPGA company. Nine years later, Intel decided to operate the FPGA business independently and chose to name it Altera again.

NPU (Neural Processing Unit) is also an ASIC chip that refers to the human neural synapse. With the rise of deep learning neural networks, CPUs and GPUs gradually struggle to meet the needs of deep learning, and the NPU, a processor specifically for deep learning of neural networks, was born. The NPU adopts a "data-driven parallel computing" architecture and is particularly good at handling massive multimedia data such as video and images. Unlike the von Neumann architecture followed by CPUs and GPUs, the NPU refers to the structure of human neural synapses, integrating storage and computation.Arm recently announced the launch of the Ethos-U85 NPU. As the third generation NPU product for edge AI from Arm, the Ethos-U85 is suitable for scenarios such as industrial automation and video surveillance, with a fourfold increase in performance. Compared to the previous generation, the Ethos-U85 has a 20% improvement in energy efficiency and can achieve an 85% utilization rate on commonly used neural networks. It is designed to be suitable for systems based on the Arm Cortex-M / A processor core and can accept higher memory latency.

Collaborative, training and inference integration, and a series of intelligent chip products and platform-based basic system software with a unified ecosystem. Cambricon's products are widely used by server manufacturers and industrial companies, targeting the internet, finance, transportation, energy, power, and manufacturing industries.

In addition, OpenAI is also exploring the development of its own AI chips and has begun to evaluate potential acquisition targets. AWS's self-developed AI chip lineup includes the inference chip Inferentia and the training chip Trainium. Electric vehicle manufacturer Tesla is also actively involved in the development of AI accelerator chips. Tesla mainly focuses on the needs of autonomous driving and has launched two AI chips to date: the Full Self-Driving (FSD) chip and the Dojo D1 chip.

In May last year, Meta disclosed details of its data center project supporting AI work, mentioning that it has built a custom chip, referred to as MTIA, to accelerate the training of generative AI models. This is the first time Meta has launched a custom AI chip. Meta said that MTIA is a member of the chip "family" that accelerates AI training and inference workloads. In addition, Meta introduced that MTIA uses the open-source chip architecture RISC-V, and its power consumption is only 25 watts, far lower than the power consumption of mainstream chip manufacturers such as Nvidia. It is worth noting that in April this year, Meta announced the latest version of its self-developed chip MTIA. Analysis points out that Meta's goal is to reduce dependence on chip manufacturers such as Nvidia.