Shangtang Large Device x Huawei Ascend 384 Super Node: Successful Adaptation, Domestic AI Infrastructure Accelerates Again

2025-09-04 16:17 0

/PRZWT/Recently, the SenseTime large-scale device SenseCore and Yuteng 384 supernodes took the lead in completing comprehensive adaptation, achieving the expected goals in function and performance verification, and achieving a key breakthrough in accelerating the domestic AI computing power from "available" to "easy to use", providing a solid support for the efficient training and reasoning of large models.

SuperPod is a new architecture that integrates multiple GPUs/NPUs into a unified computing unit through high-speed interconnection technology to solve the problems of computing power collaboration and communication efficiency in AI large-scale model training.

The Atlas 900 A3 SuperPoD is the industry's largest supernode solution launched by Huawei. With its innovative "full peer architecture", it has achieved a key breakthrough in high-speed interconnection bus - expanding the bus from the inside of the server to the whole cabinet, or even across the cabinet. Ultimately, CPU, NPU, DPU, storage and memory resources are all interconnected and pooled to form a "supercomputer", achieving greater computing power density and interconnection bandwidth.

SenseTime partners with Huawei to achieve multiple innovations in super node adaptation

At the same time as Huawei's launch, this new solution architecture also puts forward higher requirements for the upgrade of the software stack and the optimization of platform scheduling, so that it can "run fast and stable".

As an AI cloud-native platform, SenseTime's large-scale device, SenseCore, is committed to providing users with agile, flexible, and reliable full-stack AI infrastructure services, and promoting the efficient implementation and large-scale application of large-scale model technologies at the ultimate cost-effectiveness.

Based on the characteristics of SenseTime's large-scale device, SenseCore and Yuteng 384 supernodes, the two teams jointly tackled key issues and proposed a number of industry innovations in scheduling optimization, system stability, and fault recovery.

Scheduling optimization: In terms of scheduling capabilities, in addition to supporting basic capabilities such as single-machine and multi-machine scheduling within POD, cross-POD multi-machine scheduling, and affinity scheduling, the SenseCore platform cooperates with the model parallel strategy to achieve automatic division of logical supernodes, enabling large communication strategies such as EP/TP to make full use of Lingqu networks and improve model training efficiency.

Cross-POD training stability: In addition, the SenseCore team submitted multiple MR fixes to fix the ranking disorder of master/work tasks in multi-POD scenarios, fundamentally solving the problem of probabilistic failure of cross-POD training tasks.

Multi-dimensional fault detection and recovery: In terms of fault detection capabilities, it covers multi-dimensional detection from server hardware, high-speed interconnection bus, RoCE network to task, process software and hardware. Combined with detection capabilities, it realizes a multi-level recovery mechanism for Job/Pod/process, and comprehensively enhances the reliability and fault tolerance of 384 supernodes in training scenarios.

The successful adaptation of SenseTime's large-scale device, SenseCore, and Vanteng 384 supernodes has made multi-tenant, large-scale, and elastic AI Cloud as a Service possible. At the same time, the SenseTime large-scale device has completed the delivery of a customer, with the end-to-end delivery capability of Vanteng 384 supernodes from liquid-cooled clusters to AI platforms. In the future, the two parties will explore more application scenarios, including large-model inference acceleration, agent application deployment, large-model training and inference optimization for vertical industries, etc., to further accelerate the application of Vanteng 384 supernodes based on SenseCore in various industries.

Xuan Shanming, CTO of SenseTime Technology's large-scale device business group, said: "SenseTime large-scale devices attach great importance to and deeply participate in the construction of localized computing power ecology. SenseCore has become the first AI cloud platform to complete the adaptation of Lanteng 384 supernodes. It not only benefits from the openness, perfect functions and rich application practices of the SenseCore platform, but also is an important milestone in the integrated development of domestic AI infrastructure.

SenseCore fully unleashes the potential of Huteng's computing power through deep integration with Huteng, providing the industry with a more agile, intelligent and reliable computing power base. SenseTime will also build AI solutions for various industries on this basis, and jointly promote the intelligent upgrade of thousands of industries. "

Source: Corporate press release
Press release Overseas media release advertorials Release advertorials release press conference Release press release overseas media release media release platform media release release press release Invite media to invite overseas press release Overseas press release