Entering the server. Can RISC-V compete with x86?

As we all know, x86 architecture processors currently dominate the PC and server markets, while Arm architecture processors dominate the mobile market and occupy a large market share in the IoT market. However, in recent years, the RISC-V architecture has become very popular in the Internet of Things field that focuses on energy efficiency due to its advantages such as open source, streamlined instructions, and scalability.

Driven by RISC-V International and related chip manufacturers, RISC-V has also begun to enter the server market with higher performance requirements.

In early 2023, RISC-V International identified HPC as a strategic priority area for RISC-V growth, and coupled with the recently approved vector extensions and a large number of HPC software efforts to port key HPC libraries and tools, it is clear that momentum in this area is growing rapidly.

Many projects around the world, such as the European eProcessor project, the Esperanto CPU with thousands of RISC-V cores, and the multi-vendor RISE project aimed at developing support for key software components of RISC-V, may promote the popularity of RISC-V in high-end computing, including HPC, and ultimately enable the community to build supercomputers around this technology.

Additionally, early application research supports the benefits that RISC-V can bring to high-performance workloads.

In December 2022, chip start-up Ventana Microsystems released the world's first RISC-V architecture-based 192-core CPU-VeyronV1 at the RISC-V Summit.

According to reports, VeyronV1 uses advanced 5nm process technology, based on Ventana's self-developed high-performance RISC-V core, 8-pipeline design, supports out-of-order execution, with a main frequency of up to 3.6GHz, each cluster has up to 16 cores, and multiple clusters can support up to Expands to 192 cores, has 48MB shared L3 cache, has advanced side channel attack mitigation measures, IOMMU and Advanced Interrupt Architecture (AIA), supports comprehensive RAS functions, and top-down software performance adjustment methods to meet various needs of the data center.

According to data disclosed by Ventana, in the SPECint2017 test,At 300W power consumption, the 128-core version of VeyronV1 is significantly ahead of the 64-core AMDEPYCMilan7763 (280W), and is twice as powerful as the 64-core AWS Graviton G3 (Neoversev1 core) and the 40-core Intel Xeon Ice Lake8380 (270W). Of course, this is mainly due to the fact that the number of cores has reached twice that of competing products.

It should be pointed out that VeyronV1 does not have a SIMD or vector execution unit, which will be very disadvantageous for Intel or AMD server processors with AVX-512.

In addition, VeyronV1 is not currently in mass production. It was previously promised to provide samples to customers in the second or third quarter of this year. Therefore, the above officially announced data are still on paper.

In contrast, the 64-core RISC-V server chip SG2042 launched by a domestic manufacturer in March this year has been shipped in small batches.

Recently, foreign researcher Nick Brown conducted actual tests on this chip through the RAJAPerf benchmark suite and found that compared with the latest widely available RISC-V chips, its average performance per core increased by 5 to 10 times. However, under multi-threaded workloads, the average performance of x86 high-performance CPUs is still 4-8 times higher.

According to the research report, the 64-core RISC-V processor runs at 2GHz, consists of four high-performance C920 cores, and adopts a 12-level out-of-order multi-issue superscalar pipeline design.

The C920 provides the RV64GCV instruction set with three decode, four rename/schedule, eight issue/execution and two load/store execution units. Supports vectorization standard extension (RVVv0.7.1), vector width is 128 bits, supports data types FP16, FP32, INT8, INT16, INT32 and INT64. However, C920 does not support FP64 vectorization

Double-precision floating point is the foundation of the vast majority of high-performance workloads, so cores that can support vectorizing these operations may provide higher performance for HPC, the study said. Each C920 core also contains 64KB of L1 instruction (I) and data (D) cache, 1MB of L2 cache, shared among the cluster of four cores, and 64MB of L3 system cache, shared by all cores in the cluster. Four DDR4-3200 memory controllers and 32 PCIeGen4 lanes are also available.

An important consideration for HPC workloads is vectorization, and since the C920 core only supports RVVv0.7.1, compiler support is a challenge. The current upstream version of the RISC-VGNU compiler does not support any version of the vector extensions. Although the GNU repository contains a rvv next branch, which is designed to support rvvv1.0, it was not actively maintained at the time the researchers wrote their study.

Additionally, there was an rvv-0.7.1 branch for rvvv0.7.1, but this branch has been removed. Due to the lack of support for mainline GCC, T-Head provides its own fork of the GNU compiler (Xuantie GCC), which is optimized for its processors.

T-Head's custom compiler supports both RVVv0.7.1 and their own custom extensions. While several versions of this compiler have been provided, GCC8.4, as part of its 20210618 release, offers the best auto-vectorization capabilities, so this was the version chosen for the benchmarking experiments conducted by the researchers.

This version of the compiler generates a vector length specific (VLS) RVV component that is specific to the C920's 128-bit vector width. All kernels were compiled at optimization level three, and all reported results are averaged over five runs.

Comparison with other high-performance RISC-V cores

The researchers compared the performance of SG2042 with StarFive development boards VisionFiveV1 and VisionV2. V1 contains the StarFive JH7100SoC, while V2 contains the StarFive JH7110SoC.

Both SoCs, JH7100 and JH7110, are built on the 64-bit RISC-VSiFiveU74 core, with JH7100 containing two cores and JH7110 containing four cores. The SoC is listed as running at 1.5GHz, and the U74 core contains 32KB (D) and 32KB (I) L1 cache. Both SoC models also contain 2MBL2 cache shared between cores.

However, SiFiveU74 only offers RV64GC and therefore does not support RISC-V vector extensions.

△Figure 1 shows the single-core performance comparison between VisionFiveV2 and V1 and SG2042 in terms of double precision (FP64) and single precision (FP32). Where the bar is the average number of times faster or slower across the category, and the lines range from largest to smallest.

As can be seen in Figure 1, a single C920 core outperforms the U74 cores of V2 and V1 in both double and single precision.

At double precision, the average performance of the C920 core is 4.3 to 6.5 times that of the U74 in V2 running at double precision. Additionally, in single precision, the C920 performed 5.6 to 11.8 times the benchmark average performance. That's an impressive performance gain, and there are no cores on the C920 that run slower than the U74.

The performance of some cores on the C920 is very impressive, for example, the memory set benchmark from the algorithm group runs 40 times faster in FP32 and 18 times faster in FP64 than the U74.

It is important to emphasize that this benchmark is on these cores in the best possible configuration, i.e. vectoring is utilized on the C920, but vectoring is not supported on the U74 and is therefore not available on V1 or V2.

There is a significant performance difference between FP32 and FP64 on the SG2042, indicating that in fact C920 vector operations do not support FP64. In comparison, the performance difference between running double and single precision on V2 is much smaller.

One aspect of the results in Figure 1 that surprised the researchers was that VisionFiveV1 was significantly slower than V2. Considering the tests were just running RAJAPerf on a single core, the dual-core and quad-core nature of the chip doesn't matter as they both contain the same U74 core, so the performance should be fairly similar.

However, V1 is six to three times slower than V2 at double precision, and one to three times slower at single precision. While it can be assumed that the V1 may be running at a lower clock frequency than the V2, although they are both listed as running at 1.5GHz in the datasheet, there is no documentation or output on the machine to confirm this.

As can be seen in Figure 1, the performance achieved by a single C920 core is impressive compared to existing, publicly available commodity RISC-V cores. T-Head describes the core as a high-performance RISC-V processor.

Tests also show significant improvements in performance across the entire benchmark suite compared to the U74, which was previously considered the best choice among widely available RISC-VCPUs on which to experiment with HPC workloads.

In addition to single-core performance, SG2042 is also significantly ahead of V1's JH7100 and V2's JH7110 SoC in terms of core count.

Comparison with x86 server CPU performance

So compared to other commercial x86 server chips, how does SG2042 perform in HPC workloads?

In this regard, the researchers compared it with other CPUs used in current-generation servers, namely the 64-core AMD RomeEPYC7742, the 18-core Intel Broadwell Xeon E5-2695, the 28-core Intel Ice Lake Xeon 6330, and the 4-core Intel Sandy Bridge Xeon E5-2609.

The tests were only performed on the physical cores of these x86CPUs since all SMT is disabled by default.

The AMDEPYC7742 contains 64 physical cores in four NUMA regions, each with 16 cores, but eight memory controllers. Each core contains 32KB (I) and 32KB (D) L1 cache, 512KB of L2 cache, and 16MB of L3 cache shared between the four cores. The EPYC7742 provides support for AVX2, has 256-bit wide vector registers, twice as wide as the SG2042, and supports vectorization for FP64.

The Intel Xeon E5-2695's 18 physical cores are located in a NUMA area, providing 32KB (I) and 32KB (D) L1 cache, 256KB of L2 cache, and 45MB of L3 cache shared across cores. Similar to the AMD EPYC7742, the Xeon E5-2695 supports AVX2 and has four memory controllers.

The Intel Xeon 6330 is the latest CPU compared, with all 28 physical cores in a NUMA region, with 8 memory controllers, with 32KB (I) and 48KB (D) L1 cache, 1MBL2 cache per core, and 43MB shared L3 cache. Xeon6330 supports AVX512 and provides 512-bit wide vector registers.

The Intel Xeon E5-2609 is the oldest CPU in this test. It was released in 2012 and offers only four physical cores. Each core has 64KB (I) and 64KB (D) L1 cache, as well as 256KB L2 cache and shared 10MBL3 cache. This E5-2609 only supports AVX, so the vector register length is the same as the SG2042, 128 bits, although AVX supports FP64.

In all tests,The researchers disabled hyperthreading on the x86 physical core.The researchers used GCC version 8.3 on all systems except ARCHER2, and compilation was always performed at optimization level O3. A system that all executes on the highest performing number of threads.

△Figure 4 shows the single-core performance of each chip running the benchmark suite on FP64. Where the bar is the average number of times faster or slower across the category, and the lines range from largest to smallest. SG2042 is the mean baseline.

From the test results,All x86 cores performed better than the C920 except for the ancient Xeon E5-2609 core, which had slower average performance in the streaming and algorithm benchmark categories.

AMD EPYC7742 and Intel Xeon6330 CPUs tend to perform better than the Intel XeonE5-2695, which is understandable since the XeonE5-2695 is the older model of the three.

△Figure 5 shows the number of times the single-core performance of each chip runs the benchmark suite on FP32 compared to the baseline.

As you can see from Figure 5, the AMD EPYC7742 is pretty lackluster when executing in single precision versus double precision, while the average performance of the Intel processor is just as good. In fact, when using FP32, the ancient Xeon E5-2609 core outperforms the C920 on average at every level.

However, the average bar graph in Figure 5 does not provide the complete picture.

The C920 only supports vectorization for FP32, and in fact, as can be seen from the lines in Figure 5 and Figure 4, the maximum speed of many benchmark classes for FP32 is faster than that of FP64.

Additionally, there are more of the slowest cores that perform slower on an x86 CPU than a C920 on FP32. These cores are where autovectorization is effectively applied, and in fact, it can be seen that for the lcals benchmark class, at least one core on all x86 CPUs performs worse than the C920.

In summary, in terms of single-core performance comparison, the average performance of AMD EPYC7742 under FP32 is 3 times faster than C920, Intel XeonE5-2695 is 2 times faster, Intel Xeon6330 is also 4 times faster, XeonE5-2609 is 2 times faster, and these numbers under FP64 are 4 times, 4 times, 5 times and 20% faster respectively.

△FP64 multi-threaded performance comparison, reporting the number of times faster or slower than the baseline

Figure 6 shows the performance comparison for double-precision FP64.

It can be seen that the basic, lcals, polybench and stream class tests benefit the most from more cores, so the average performance of the SG2042 is better than that of the ancient Xeon E5-2609.

△FP32 multi-thread performance comparison, reporting the number of times faster or slower than the baseline

Figure 7 shows the multi-threaded performance comparison of FP32, and these results contain the largest differences. To improve readability, the researchers restricted the vertical axis and labeled actual values exceeding that value.

When it comes to multi-threaded FP32, the SG2042 tends to perform slightly better than FP64 against x86 CPUs, although the polybench class is an outlier in that it performs much better on the three latest x86 CPUs and the Intel Xeon E5-2609 performs much worse.

To summarize, when comparing SG2042 multi-threaded performance to x86 CPUs, its 64-core average performance is better than the 4-core Intel Xeon E5-2609 in all benchmark types running on FP32 and FP64.

The performance of the 64-core AMD EPYC7742 in FP32 and FP64 is 8 times and 5 times that of the SG2042 respectively. The 18-core Intel Xeon E5-2695 achieves an average of 6 times and 4 times in single precision and double precision respectively. Finally, the 28-core Intel Xeon6330 performs 6x and 8x better in FP32 and FP64 respectively.

in conclusion:

Although many companies are currently developing high-performance RISC-V hardware prototypes, until now, options have been very limited when looking to run workloads on commercially available RISC-V software, the researchers said.

Regardless, while these solutions enable experimentation with RISC-V, they do not architecturally provide the features required to produce high-performance workloads. So while the HPC community is interested in RISC-V, it's not quite ready for the technology.

Of course, as the world's first widely available multi-core RISC-V server chip for HPC, SG2042 may significantly increase interest and adoption of RISC-V in the HPC community. However, a key issue is that it still lags far behind the x86 CPUs prevalent in the current generation of supercomputers.

Still, this is a very exciting RISC-V server chip that offers some significant changes compared to currently commercially available RISC-V hardware.

While performance is not yet at the level of x86 server CPUs, it should be emphasized that RISC-V vendors have come a long way in a short period of time. In contrast, x86 CPUs have a long history and benefit from their many years of development.

At present, RISC-V's main competitor in the server CPU market is Arm server CPU. After all, theoretically RISC-VCPU can have lower cost, higher customization and scalability than Arm CPU.

For the next generation of high-performance RISC-V processors, the researchers believe that providing RVVv1.0 will be very useful, as this will provide the use of mainline GCC and Clang for compiling vectorized code.

In addition, providing FP64 vectorization, wider vector registers, increased L1 cache, and more memory controllers per NUMA region may also bring significant performance benefits and help close the gap with x86 high-performance processors.

access:

Jingdong Mall