Service voor op maat gemaakte automatische assemblagemachines sinds 2014 - RuiZhi Automation

Moore Threads’ “AI Factory”: Defining Next-Generation AI Infrastructure through System-Level Innovation

Guide: The “AI Factory” proposed by Moore Threads, similar to the process upgrade of chip fabs, represents a systematic and all-round transformation. It requires innovations from underlying chip architecture, optimization of cluster overall architecture, to comprehensive upgrades of software algorithms and resource scheduling systems.
Shanghai, July 25 – On the eve of the opening of the World Artificial Intelligence Conference (WAIC 2025), Moore Threads held a technical sharing session themed “Evolution of Computing Power, Revolution of Precision” and innovatively proposed the concept of “AI Factory”. In his keynote speech, Zhang Jianzhong, founder and CEO of Moore Threads, stated that to address the bottleneck in large model training efficiency amid the explosive growth of generative AI, Moore Threads will build next-generation AI training infrastructure through system-level engineering innovation, committed to creating a “super factory” for producing advanced models in the AGI era.
Forging a “Super Factory” for Advanced Models
The competition in cutting-edge AI models is driving a rapid improvement in AI intelligence, with global tech giants iterating models at an astonishing speed. From the rapid updates of GPT series, Gemini to DeepSeek and Qwen, the iteration cycle of model training has been shortened to less than 3 months. This high-frequency iteration is not only reflected in large language models (LLMs) but also extended to frontier model fields such as multimodal models, speech models, and world models. The exponential breakthroughs of these models in performance, efficiency, and application scenarios not only promote AI from dedicated fields to general intelligence but also their rapid iteration characteristics put forward an urgent demand for next-generation high-performance AI computing infrastructure.
The “AI Factory” proposed by Moore Threads, like the process upgrade of chip fabs, is a systematic and all-round transformation. It requires innovations from underlying chip architecture, optimization of cluster overall architecture, to comprehensive upgrades of software algorithms and resource scheduling systems. This all-round infrastructure transformation will promote AI training from the scale of thousands of cards to tens of thousands or even hundreds of thousands of cards, achieving a leap in productivity and innovation efficiency through system-level engineering.
The intelligent “production capacity” of this “AI Factory” is jointly determined by five core elements, and its efficiency formula can be summarized as: AI Factory Production Efficiency = Generalization of Accelerated Computing × Effective Computing Power of Single Chip × Single Node Efficiency × Cluster Efficiency × Cluster Stability.
Based on the general computing power of full-featured GPUs, Moore Threads aims to transform the powerful potential of full-featured GPU accelerated computing platforms into engineering-level training efficiency and reliability guarantees through in-depth technological innovations in the coordinated improvement of advanced architecture, chip computing power, single-node efficiency, cluster efficiency optimization, and reliability.
Five Core Technologies: Systematically Improving AI Training Efficiency
Through system-level innovations with in-depth hardware-software collaboration, Moore Threads builds the “AI Factory” with five core technologies, committed to achieving a qualitative leap in large model training efficiency.
Technology 1: Full-Featured GPU to Achieve Generalization of Accelerated Computing
In the construction of AI infrastructure, the completeness of computing functions and accuracy is the core cornerstone supporting diverse scenarios. Moore Threads, with its self-developed full-featured GPU as the core, has built a general base with “complete functions” and “complete accuracy”, covering all scenario needs from AI training and inference to scientific computing.
Innovative breakthrough: Single chip covers multiple scenarios. Based on the breakthrough design of the MUSA architecture, Moore Threads’ GPU single chip can integrate AI computing acceleration, graphics rendering, physical simulation, and ultra-high-definition video encoding and decoding capabilities, adapting to diverse application scenarios such as AI training and inference, embodied intelligence, and AIGC.
Accuracy benchmark: 20%~30% performance improvement. In terms of computing accuracy, Moore Threads supports a complete accuracy spectrum from FP64 to INT8, and through FP8 mixed-precision technology, achieves a 20%~30% performance improvement in the training of mainstream cutting-edge large models, setting an industry benchmark for the computing power efficiency of domestic GPUs.
Forward-looking layout: Promoting the evolution of AI infrastructure. This technical system not only meets the efficient computing needs of the large model era but also provides forward-looking support for the evolution of world models and emerging AI architectures, helping AI infrastructure to continuously upgrade towards high generalization and complete accuracy.
Technology 2: Self-Developed MUSA Architecture to Improve Effective Computing Power of Chips
Strong effective computing power of chips is the core driving force for the efficient operation of the “AI Factory”. Based on the self-developed MUSA architecture, Moore Threads significantly improves the computing efficiency of a single GPU through three breakthroughs in computing, memory, and communication.
Innovative architecture breaks traditional limitations: Moore Threads adopts an innovative multi-engine, scalable GPU architecture, and builds a globally shared pool of computing, memory, and communication resources through hardware resource pooling and dynamic resource scheduling technology. This design not only breaks the limitation of single-function traditional GPUs but also significantly improves resource utilization while ensuring generalization. Its parameterized configurable scalable architecture allows for rapid tailoring of optimized chip configurations for target markets, greatly reducing the development cost of new chips.
Significant improvement in computing performance: At the computing level, Moore Threads’ AI acceleration system (TCE/TME) fully supports multiple mixed-precision computing such as INT8/FP8/FP16/BF16/TF32. As one of the first domestic GPU manufacturers to achieve FP8 computing power mass production, its FP8 technology, through innovative designs such as fast format conversion, intelligent adaptation of dynamic range, and high-precision accumulators, improves Transformer computing performance by about 30% while ensuring computing accuracy.
Comprehensive optimization of memory and communication efficiency: In terms of memory systems, through technologies such as multi-precision near-memory reduction engines, low-latency Scale-Up, and general computing parallel resource isolation, 50% bandwidth saving and 60% latency reduction are achieved. In the field of communication and interconnection, the  ACE asynchronous communication engine reduces 15% of computing resource consumption, and MTLink2.0 interconnection technology provides 60% higher bandwidth than the domestic industry average, laying a solid foundation for large-scale cluster deployment.
Technology 3: MUSA Full-Stack System Software to Improve Single-Node Computing Efficiency
As AI computing power competition enters a deeper stage, Moore Threads achieves key technological breakthroughs through MUSA full-stack system software, promoting the AI Factory from single-point innovation to system-level efficiency improvement. Its core innovations include:
Task scheduling optimization: Kernel launch time is reduced by 50%;
Extreme performance operator library: GEMM operator computing power utilization reaches 98%, and Flash Attention operator computing power utilization exceeds 95%;
Communication efficiency leap: MCCL communication library achieves 97% bandwidth utilization of RDMA networks; cluster performance is improved by 10% by optimizing computing-communication parallelism based on asynchronous communication engines;
Innovation in low-precision computing efficiency: FP8 optimization and recomputation technology significantly reduce training overhead;
Improved development ecosystem: Based on Triton-MUSA compiler + MUSA Graph, DeepSeek R1 inference is accelerated by 1.5 times, fully compatible with mainstream frameworks such as Triton.
Technology 4: Self-Developed KUAE Large-Scale Cluster to Optimize Cluster Efficiency
When single-node efficiency reaches a new height, how to achieve efficient collaboration of large-scale clusters becomes a new challenge. Moore Threads’ self-developed KUAE computing cluster realizes efficient collaboration of thousands of nodes through 5D large-scale distributed parallel computing technology, promoting AI infrastructure from single-point optimization to system engineering-level breakthroughs.
Innovative 5D parallel training: Moore Threads integrates data, model, tensor, pipeline, and expert parallel technologies, fully supporting mainstream architectures such as Transformer, significantly improving the training efficiency of large-scale clusters.
Performance simulation and optimization: The self-developed Simumax tool automatically searches for optimal parallel strategies for ultra-large-scale clusters, accurately simulates FP8 mixed-precision training and operator fusion, providing a scientific basis for shortening the training cycle of models such as DeepSeek.
Second-level backup and recovery: To address the problem of large model stability, the innovative CheckPoint acceleration scheme uses RDMA technology to compress the backup and recovery time of hundreds of GB levels from several minutes to 1 second, improving GPU effective computing power utilization.
Technology 5: Zero-Interruption Fault Tolerance Technology to Improve Cluster Stability and Reliability
On the basis of building an efficient cluster, a stable and reliable operating environment is the guarantee for the continuous output of the “AI Factory”.
Especially in 10,000-card-level AI clusters, training interruptions caused by hardware failures will seriously waste computing power. Moore Threads innovatively introduces zero-interruption fault tolerance technology. When a failure occurs, only the affected node group is isolated, and the remaining nodes continue training, with standby machines seamlessly 接入,without interruption throughout the process. This solution enables KUAE clusters to have an effective training time ratio of over 99%, significantly reducing recovery costs.
At the same time, KUAE realizes dynamic monitoring and intelligent diagnosis through a multi-dimensional training insight system, improving the efficiency of exception handling by 50%; combined with cluster inspections and takeoff checks, the training success rate is increased by 10%, providing stable guarantees for large-scale AI training.
From Training to Verification: Building a Complete Closed Loop
With the goal of building an advanced “AI Factory”, Moore Threads has built an efficient “AI Factory” with five core elements: the general computing power of full-featured GPUs, innovative MUSA architecture, optimized MUSA software stack, self-developed KUAE cluster, and zero-interruption fault tolerance technology, providing strong and reliable infrastructure support for AI large model training.
A complete “AI Factory” not only needs to efficiently train large models but also needs to have inference and verification capabilities. Based on the self-developed MUSA technology stack, Moore Threads has built a full-process inference solution covering LLM, vision, and generative models, realizing seamless connection of “training-verification-deployment”. Its self-developed MT Transformer inference engine, TensorX inference engine, and vLLM-MUSA inference framework provide extreme performance support for model verification and deployment.
AI Factory Drives Intelligent Upgrades in Thousands of Industries
Relying on the AI Factory, Moore Threads has successfully built an efficient system covering the entire process of “training-inference-deployment”. This breakthrough marks that domestic computing infrastructure has  the key capabilities to support large-scale, high-efficiency, and high-reliability model production in the AGI era.
From the cornerstone of graphics rendering to the AI computing power engine, Moore Threads’ full-featured GPUs continue to accelerate computing innovation. With “KUAE + MUSA” as the core of intelligent computing business, Moore Threads will accelerate the empowerment of thousands of industries, promoting the application and deployment of full-featured GPU-driven AI technologies in key fields such as physical simulation, AIGC, scientific computing, embodied intelligence, intelligent agents, medical image analysis, and industrial large models.
At the same time, Moore Threads  that openness is the source of ecological prosperity. Moore Threads will host the first MUSA Developer Conference in October this year, inviting global developers to explore cutting-edge technologies and share the independent new MUSA ecosystem.
About Moore Threads
Moore Threads, with full-featured GPUs as the core, is committed to providing global accelerated computing infrastructure and one-stop solutions, offering strong AI computing support for the digital and intelligent transformation of various industries.
Our goal is to become an internationally competitive leading GPU enterprise, building an advanced accelerated computing platform for the digital and intelligent world integrating artificial intelligence and digital twins. Our vision is to accelerate for a better world.

What is the market price of an insulin pen assembly machine?
What is the working principle of an insulin pen assembly machine?

Share:

More Posts

Send Us A Message

Related Product

E-mail
E-mailadres: 644349350@qq.com
WhatsApp
WhatsApp mij
WhatsApp
WhatsApp QR-code