[LINK] The Sunway Pro CPU
Stephen Loosley
StephenLoosley at outlook.com
Mon Nov 27 20:16:35 AEDT 2023
China's secretive Sunway Pro CPU quadruples performance over its predecessor, allowing the supercomputer to hit exaflop speeds
By Anton Shilov 3 days ago
https://www.tomshardware.com/tech-industry/supercomputers/chinas-secretive-sunway-pro-cpu-quadruples-performance-over-its-predecessor-allowing-the-supercomputer-supercomputer-to-hit-exaflop-speeds
China continues to advance supercomputing technologies despite U.S. sanctions.
Earlier this year, the National Supercomputing Center in Wuxi (an entity blacklisted in the U.S.) launched its new supercomputer based on the enhanced China-designed Sunway SW26010 Pro processors with 384 cores.
Sunway's SW26010 Pro CPU not only packs more cores than its non-Pro SW26010 predecessor, but it more than quadrupled FP64 compute throughput due to micro-architectural and system architecture improvements, according to Chips and Cheese.
However, while the manycore CPU is good on paper, it has several performance bottlenecks.
First details of the many-core Sunway SW26010 Pro CPU and the supercomputers that use it emerged back in 2021.
Now, the company has showcased actual processors and disclosed more details about their architecture and design, which represent a significant leap in performance, recently at SC23.
The new CPU is expected to enable China to build high-performance supercomputers based entirely on domestically developed processors.
Each Sunway SW26010 Pro has a maximum FP64 throughput of 13.8 TFLOPS, which is massive.
For comparison, AMD's 96-core EPYC 9654 has a peak FP64 performance of around 5.4 TFLOPS.
The SW26010 Pro is an evolution of the original SW26010, so it maintains the foundational architecture of its predecessor but introduces several key enhancements.
The new SW26010 Pro processor is based on an all-new proprietary 64-bit RISC architecture and packs six core groups (CG) and a protocol processing unit (PPU).
Each CG integrates 64 2-wide compute processing elements (CPEs) featuring a 512-bit vector engine as well as 256 KB of fast local store (scratchpad cache) for data and 16 KB for instructions; one management processing element (MPE), which is a superscalar out-of-order core with a vector engine, 32 KB/32 KB L1 instruction/data cache, 256 KB L2 cache; and a 128-bit DDR4-3200 memory interface.
MPEs and CPEs use a directory-based protocol to enable coherent data sharing to reduce data movement between cores and support fine-grained interactions between different cores, which is particularly important for applications with irregular data sharing access.
With six CPEs, each SW26010 processor has 384 CPEs and six MPEs, thus 390 cores in total and a PPU.
Not only does the SW26010 Pro run faster than the predecessor (CPE runs at 2.25 GHz, MPE runs at 2.10 GHz instead of 1.45 GHz for CPE and MPE on the predecessor), but the new 64-bit RISC microarchitecture on the SW26010 Pro CPU has been completely revamped to quadruple the processor's FP64 data processing throughput.
To provide more memory bandwidth to new cores, designers shifted the CPU from DDR3 to DDR4 memory controllers, which significantly increased memory bandwidth and capacity. Each CG is now equipped with 16 GB of DDR4 memory, doubling the 8 GB of DDR3 memory found in each cluster of the SW26010. This enhancement increases the total memory supported by one CPU from 32 GB in the SW26010 to 96 GB in the SW26010-Pro.
CPU
Despite these advancements, both the SW26010 and SW26010-Pro share a common limitation in their cache and memory subsystem. The SW26010-Pro attempts to address its cache issue by increasing the scratchpad capacity to 256 KB, up from the 64 KB in the SW26010.
But a 256KB scratchpad cache per CPE amid the lack of proper L2 is not enough, so both processors still have a major performance bottleneck. Meanwhile, a dual-channel DDR4-3200 (51.2 GB/s) memory subsystem is barely enough for 64 cores, each featuring a 512-bit vector FPU and capable of up to 16 FP64 FLOPS/cycle.
In conclusion, the SW26010 Pro represents a significant step forward from the SW26010, particularly in terms of memory capacity, compute density, and overall performance.
These enhancements demonstrate China's growing prowess in supercomputing.
However, the new processor has two main drawbacks: a weak caching subsystem (which can be mitigated with software optimizations, but these optimizations are costly from time and money perspectives) and insufficient memory bandwidth.
As a result, it remains to be seen whether it could be used to build systems to solve complex real-world problems that truly offer ExaFLOPS performance levels.
..
More information about the Link
mailing list