Hello, welcome to Shunhai Technology Company!
| 0755-28100016 EN
Company News Industry News Product news Encyclopedia of Electronic Components Technical Info Answer
Rapid sizing

Pingti Nvidia? The domestic GPU 10000 card cluster has arrived

Date:2024-07-04 Viewed:292

In the past two years, the development of large language models has been rapid, and the demand for computing power has surged. However, it is difficult to find a high-end GPU card such as NVIDIA A100, is it a challenge or an opportunity? Numerous domestic computing power manufacturers are beginning to search for new alternative solutions.



As the only GPU enterprise in China that can benchmark NVIDIA in terms of functionality, Moore Thread attempts to use a "clustered" solution to help domestic GPUs break through computing power bottlenecks.



On July 3, on the eve of the 2024 World Artificial Intelligence Conference, Moore Thread announced a major upgrade of its KUAE intelligent computing cluster solution, significantly expanding from the current kilocalorie level to the scale of ten thousand calories, in order to support large models and provide sustained, efficient, stable, and widely applicable universal computing power support for trillion parameter level model training.







AI main battlefield, Wanka is standard configuration







In the era of AI big models, giants are all involved in a computing arms race.


On May 10, 2023, Google launched the supercomputer A3 Virtual Machines with 26000 Nvidia H100 GPUs, and built a TPUv5p 8960 card cluster based on self-developed chips;


In March 2024, Meta shared two new AI training clusters, both of which contained 24576 Nvidia Tensor Core H100 GPUs, an increase from the previous generation's 16000 GPUs;


The ChatGPT-4 developed by OpenAI has 16 expert models with a total of 1.8 trillion parameters, and one training session requires 90 to 100 days on approximately 25000 A100s.

It has been proven that the main battlefield of the AI big model, Wanka, is already a standard configuration.






So, in the era of AI big models, what kind of computing power is needed? We can gain some insights from the development trend of large models.



Under the continuous influence of the Scaling Law launched in 2020, it has driven the trend of "violent aesthetics" in large models. Taking the development of ChatGPT in OpenAI as an example, the direction of large model training is to increase the parameter size from billions to trillions, at least by more than 100 times; The amount of data to be processed has increased from TB level to 10+TB, at least more than 10 times; The computational workload has increased by at least 1000 times. Such a large model must have sufficiently large computing power to quickly keep up with technological evolution.



Not only is the scale large enough, AI computing power must also have universality. This is because the current large model is based on the Transformer architecture, which, although the mainstream architecture, cannot be unified. It is still accelerating its fusion evolution from dense to MoE, from single modality to multimodal, and from diffusion to autoregressive. At the same time, in addition to the Transformer architecture, there are also constantly other innovative architectures emerging, such as Mamba, RWKV, and RetNet. Therefore, the Transformer architecture does not equal the final answer.



In addition, the integration of AI, 3D, and HPC across technologies and fields is accelerating, such as using AI+3D to achieve spatial intelligence, AI+simulation computing to achieve physical intelligence, and AI+scientific computing to achieve 4Science. The evolution of computing paradigms and the changing demands for diverse computing power in more scenarios have given rise to a desire for a universal accelerated computing platform.






As the number of model parameters increases from billions to trillions, large models urgently need a super training factory, that is, a "large and universal" accelerated computing platform, to greatly shorten training time and achieve rapid iteration of model capabilities. "Only when the scale is large enough, computing is more universal, and ecological compatibility is good, can it truly be useful," pointed out Zhang Jianzhong, founder and CEO of Moore Thread.



The super ten thousand card cluster has become a standard configuration for pre training large models. For infrastructure manufacturers, the presence or absence of a ten thousand card cluster will be the key to winning the AI main battlefield.



However, building a Wanka cluster is not an easy task.



The Wanka cluster is not a simple stack of ten thousand GPU cards, but a super complex system engineering.






Firstly, it involves the problem of large-scale networking and interconnection, as well as how to improve the effective computing efficiency of clusters. Numerous practices have shown that a linear increase in cluster size cannot directly lead to a linear increase in the effective computing power of clusters.

In addition, training high stability and availability, as well as fast fault localization and diagnostic tools, are also crucial. A super ten thousand card cluster is composed of thousands of GPU servers, thousands of switches, tens of thousands of optical fibers/tens of thousands of optical modules. The training task involves the joint operation of millions of devices, and any component failure may cause training interruptions.

Furthermore, the iteration and innovation of large models are constantly emerging, and the innovation of various new types of models and model architectures requires the Wanka cluster to have a fast migration ability of ecological Day 0 to adapt to constantly changing technological needs. At the same time, we cannot be limited to the current scenario of accelerating large model calculations, and we also need to consider the future demand for universal computing.

The road to building a Wanka cluster is as difficult and challenging as mountaineering, but it is a difficult and correct path.

Building a Large Model Training Super Factory

After nearly four years of accumulation, based on the successful verification of the thousand card cluster, Moore Thread has launched the KUAE ten thousand card intelligent computing cluster solution, which can meet the core requirements of computing power in the era of large models for "sufficient scale, universal computing, and ecological compatibility", and achieve further upgrading of domestic cluster computing capabilities.

The Moore Thread Kua'e Wanka Cluster is based on a fully functional GPU, integrating software and hardware, and providing a complete system level computing power solution. It includes the Kua'e computing cluster as the core infrastructure, the Kuae Cluster Management Platform (KUAE Platform), and the Kuae Model Studio, aiming to solve the construction and operation management problems of large-scale GPU computing power in an integrated delivery manner. This solution can be used out of the box, greatly reducing the time cost of traditional computing power construction, application development, and operation platform construction, and achieving rapid market launch and commercial operation.

The Kua'e Wanka intelligent computing solution has five characteristics:

The scale of a single cluster has exceeded 10000 calories, with a total computing power exceeding 10000 P;

The effective computing efficiency of the cluster can reach a target of up to 60%;

Excellent stability, with an average weekly training efficiency of over 99%, an average fault free operation time of over 15 days, and a maximum stable training time of over 30 days;

Has strong computational versatility, designed specifically for general-purpose computing, can accelerate all large models;

Having good CUDA compatibility, ecological adaptation Instant On, accelerating the Day0 migration of new models.

"We hope that our product can provide customers with a better and selectable localization tool. When foreign products cannot be used, it can be easily and quickly used on domestic platforms." Zhang Jianzhong said, "For current large model users in China, our biggest advantage is excellent ecological compatibility. Developers can port to our Kua'e cluster with almost no code modification, and the migration cost is close to zero. The migration work can be completed in a few hours."

To make this large model training factory truly operational, it also requires the support of a group of social circles:

Domestic large model enterprises such as Zhipu AI, Zhiyuan Research Institute, Peking University Rabbit Exhibition, Dipu Technology, Shizhe AI, Yuren Technology, Lechuang Energy, Ruilai Intelligence, Real Intelligence, Reportify, Hanhou Group, and Yijing Zhilian have all successfully operated on the Kua'e cluster of Moore Thread. It is worth mentioning that Mole Thread is the first domestic GPU company to connect to the AskTok dome and conduct large-scale model training, and Kua'e is also the first cluster in the industry to successfully run and fully run domestic large-scale models.

Make domestic GPU computing power clusters truly useful

The Wanka cluster is a super project that requires the concerted efforts of the industry to participate in its construction. At the press conference, Mole Thread signed strategic contracts with leading central enterprises such as Qinghai Mobile and Qinghai Unicom for the Wanka cluster project. These collaborations will further promote the application and implementation of the Moore Thread Wanka cluster in various places.



With advantages such as high compatibility, high stability, high scalability, and high computing power utilization, the Moore Thread Kua'e Intelligent Computing Cluster has successfully gained recognition from multiple large model enterprises and become an important force in the training and application of large models in China. "A few years ago, domestic computing power was just a spare tire for customers, but now it has become their first choice because they need to ensure long-term supply and local service," Zhang Jianzhong explained.

Although building a Wanka cluster is a daunting task, Moore's thread has demonstrated the determination to climb, which is a difficult and correct path. But this is not only to meet the computing power needs of a single enterprise, but also to address the shortage of computing power across the entire industry. Although difficult, it is very necessary!



epilogue



The release of the full stack solution for the Moore Thread 10000 card level Kua'e Intelligent Computing Center marks a significant breakthrough in the computing power level of domestic GPUs, which will prioritize solving the problem of training complex trillion parameter large models. And Moore Thread's positioning is no longer a GPU company, but an AI focused accelerated computing platform enterprise.

 

Copyright © Shenzhen Shunhai Technology Co., Ltd. all right reserved 粤ICP备15069920号