Nvidia's new GPU! Single-frame AI performance soared by 650%, 100TB of large memory, specializing in long-text reasoning-TodayInfo

For every $100 million investment in a company, it can obtain $5 billion in token income.

Author | ZeR0

Editor | Moying

Xindongxi reported on September 10 that last night, NVIDIA launched another AI computing trick and launched a new dedicated GPU designed for long context reasoning and video generation applications - NVIDIA Rubin CPX.

NVIDIA founder and CEO Jenxun Huang said: "Just like RTX has revolutionized graphics and physical AI, Rubin CPX is the first CUDA GPU specially designed for massive contextual AI, an AI model that can handle the reasoning of millions of knowledge tokens at the same time."

Rubin CPX is equipped with 128GB GDDR7 memory, NVFP4AI computing power can reach 30PFLOPS, it is very suitable for running long context processing (more than 1 million tokens) and video generation tasks.

Vera Rubin NVL144 CPX platform can integrate 144 Rubin CPX GPUs, 144 Rubin GPUs, and 36 Vera CPUs in a single rack, providing 8EFLOPS AI performance (NVFP4 accuracy) and 100TB fast memory, and the memory bandwidth reaches 1.7PB/s.

its AI performance is more than twice that of NVIDIA's Vera Rubin NVL144 platform. It is more than 2 times that of the GB300 NVL72 system based on Blackwell Ultra. It can also provide 7.5 times that of the GB300 NVL72 system based on Blackwell Ultra. It can also provide 7.5 times that of the GB300 NVL72 system. Compared with the GB300 NVL72 system, it can also provide 3 timesfaster attention mechanism.

Rubin CPX GPU is expected to be available at the end of 2026.

On September 17, the 2025 Global AI Chip Summit initiated and hosted by Zhigang will be held in Shanghai. The conference has a main forum, two special forums for large-scale AI chips and AI chip architectures, as well as two technical seminars for integrated storage and computing, super nodes and intelligent computing clusters. Nearly 40 guests will share and discuss. Professor Wang Zhongfeng of IEEE Fellow will start, with domestic AI chips such as Huawei Ascend gathering forces, and Huawei Cloud and Alibaba Cloud lead the super node and intelligent computing cluster forces. Scan the QR code to register~

01.

New dedicated GPU:

128GB memory, 30PFLOPS computing power

Rubin CPX is built on the NVIDIA Rubin architecture, adopts a cost-effective single-chip design, equipped with 128GB GDDR7 memory, uses NVFP4 accuracy, and has been optimized to achieve 30PFLOPS, can provide performance and token benefits far exceeding existing systems for AI inference tasks, especially long context processing (more than 1 million tokens) and video generation.

Compared with the NVIDIA GB300 NVL72 system, this dedicated GPU also provides 3x faster attention mechanism, thereby improving the AI model's ability to handle longer context sequences without slowing down.

In contrast, the Rubin GPU released in March this year has a peak inference capability of 50PFLOPS with FP4 accuracy. NVIDIA only announced the innovative 4-bit floating-point format NVFP4 in June this year. The goal of this format is to maintain model performance with ultra-low precision.

Its analysis shows that when DeepSeek-R1-0528 is quantized from the original FP8 format to the NVFP4 format using post-training quantization (PTQ), its accuracy drop in key language modeling tasks does not exceed 1%. In AIME 2024, NVFP4's accuracy rate has even been improved by 2%.

Rubin CPX uses GDDR7, which is cheaper than the 288GB HBM4 high bandwidth memory that comes with the Rubin GPU.

02.

Single-rack AI performance reaches 8EFLOPS,

Provides 100TB of fast memory and 1.7PB/s memory bandwidth

Rubin CPX works in collaboration with the new NVIDIA Vera Rubin NVL144 CPX platform to carry out the generation phase processing to form a complete high-performance decomposed service solution.

Vera Rubin NVL144 CPX platform can integrate 144 Rubin CPX GPUs, 144 Rubin GPUs, and 36 Vera CPUs in a single rack, providing 8EFLOPS AI performance (NVFP4 accuracy) and 100TB fast memory, and the memory bandwidth reaches 1.7PB/s.

its AI performance is more than twice that of NVIDIA's Vera Rubin NVL144 platform, and is 7.5 times that of the GB300 NVL72 rack system based on Blackwell Ultra.

NVIDIA also shared the benchmark results of the GB300 NVL72 system on Tuesday, with its DeepSeek-R1 inference performance being 1.4 times higher than the previous generation. The system also sets records for all new data center benchmarks added in the MLPerf Inference v5.1 suite, including Llama 3.1 405B Interactive, Llama 3.1 8B, Whisper.

NVIDIA plans to equip customers who want to reuse their existing Vera Rubin 144 systems with a dedicated Rubin CPX computing tray (tray).

Rubin CPX is available in a variety of configurations, including Vera Rubin NVL144 CPX, which can be used in conjunction with the NVIDIA Quantum‑X800 InfiniBand scale-out computing architecture or the Spectrum-X Ethernet network platform equipped with NVIDIA Spectrum-XGS Ethernet technology and ConnectX-9 SuperNIC.

NVIDIA is expected to launch a dual-rack product that combines the Vera Rubin NVL144 and Vera Rubin NVL144 racks to increase the fast memory capacity to 150TB.

03.

Born for decompositional inference optimization,

Pair with Nvidia flagship GPU

What is the difference between this brand new dedicated GPU and Nvidia's flagship GPU?

According to Shar Narasimhan, product director of Nvidia's data center, Rubin CPX will be used as Nvidia's dedicated GPU for context and pre-filling computing, thereby significantly improving the performance of massive contextual AI applications. The original Rubin GPU is responsible for generation and decoding calculations.

inference consists of two stages: context stage and generation stage. The requirements for infrastructure in these two stages are completely different.

The context phase is limited by computing power, and requires high throughput processing to extract and analyze large amounts of input data, and finally generate the first token output result.

The generation phase is limited by memory bandwidth and relies on fast memory transfers and high-speed interconnects (such as NVLink) to maintain token-by-token output performance.

Decomposed reasoning enables these stages to be processed independently, thus enabling targeted optimization of computing and memory resources. This architectural transformation can increase throughput, reduce latency, and improve overall resource utilization.

But decomposition brings new complexity, requiring precise coordination between low-latency key-value cache transmission, large-language model-aware routing, and efficient memory management.

NVIDIA built the Rubin CPX GPU to achieve professional acceleration in the compute-intensive long context phase and seamlessly integrate the dedicated GPU into a decomposed infrastructure.

Nvidia optimizes inference by combining GPU capabilities with context and generation workloads.

Rubin CPX GPU is optimized for efficient processing of long sequences and is designed to enhance long context performance, complement existing infrastructure, improve throughput and responsiveness, while providing scalable efficiency and maximizing the return on investment (ROI) for large-scale generative AI workloads.

To process video, the AI model may need to process up to 1 million tokens in 1 hour of content, which challenges the limits of traditional GPU computing. Rubin CPX integrates video decoder and encoder and long context reasoning processing in a single chip, providing unprecedented capabilities for applications such as video search and high-quality video generation.

Rubin CPX will be able to run the latest multimodal models of the NVIDIA Nemotron family, providing state-of-the-art inference capabilities for enterprise-level AI agents. For production-grade AI, the Nemotron model can be delivered through the NVIDIA AI Enterprise software platform.

04.

Conclusion: 30~50 times the return on investment,

Each USD 100 million investment can bring $5 billion in revenue

Vera Rubin NVL144 CPX uses NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet, paired with ConnectX-9 SuperNIC and coordinated by the Dynamo platform to support the next wave of million token context AI inference workloads and reduce inference costs.

under large-scale operations, the platform can achieve a return on investment of 30 to 50 times, which is equivalent to US$100 million in capital expenditures of US$5 billion. Nvidia said this "sets a new benchmark for reasoning economics."

Rubin CPX will transform AI programming assistants from simple code generation tools to complex systems that can understand and optimize large software projects.

The well-known American AI programming platform Cursor, AI video generation startup Runway, AI programming startup Magic, etc. are exploring the use of Rubin CPX GPU to accelerate their code generation, complex video generation and other applications.