订阅内容

Intel recently launched the 5th generation of Intel® Xeon® Scalable processors (Intel Xeon SP), code-named Emerald Rapids; a family of high-end, enterprise-focused processors targeted at a diverse range of workloads. To explore how Intel’s new chips measure up, we’ve worked with Intel and others to run benchmarks with Red Hat Enterprise Linux 8.8 / 9.2 and greater. 

Intel’s 5th Gen Xeon Scalable processors are a drop-in compatible with existing 4th Gen Xeon Scalable motherboards. It now supports up to 64 cores per socket vs 60 cores, can handle DDR5-5600 memory speeds over DDR5-4800 prior generation, up to 3x the LLC, and up to 20 GT/s UPI 2.0 speeds. The Red Hat Performance Engineering team configured a peak prototype system from Intel for both of these models to conduct performance measurements.

SAP Performance

RHEL 8.8 SAP HANA Leadership on 5th Generation Intel Xeon Scalable Processor

Leaning on our long history of collaboration, Red Hat and Intel once again worked together to deliver state-of-the-art performance to enterprise data centers and beyond. Red Hat’s development and performance engineering teams have been working on hardware enablement and validation of these new scalable processors for more than a year running a variety of benchmarks prior to the GA release of Red Hat Enterprise Linux. 

Higher per-core performance, larger last level cache, faster memory, and storage combined with workload-optimized cores benefit overall system performance. To demonstrate performance and provide additional scalability and sizing information for SAP HANA applications and workloads, SAP introduced the Business Warehouse (BWH) edition of SAP HANA Standard Application Benchmark [1]. Presently on version 3, this benchmark simulates a variety of users with different analytical requirements and measures the key performance indicator (KPI) relevant to each of the three benchmark phases, which are defined below:

  1. Data load phase, testing data latency and load performance (lower is better)
  2. Query throughput phase, testing query throughput with moderately complex queries (higher is better)
  3. Query runtime phase, testing the performance of running very complex queries (lower is better)

Red Hat Enterprise Linux (RHEL) was used in several recent publications of the above benchmark. Specifically, two separate initial record sizes (1.3 and 2.6 billion records) using a Dell PowerEdge R760 server with 5th Gen Intel Xeon Scalable processors, demonstrated that running the workload on Red Hat Enterprise Linux could deliver a significant performance boost over the previous generation of Intel servers (see Table 1).

Table 1. Results in scale-up category running SAP BW Edition for SAP HANA Standard Application Benchmark, Version 3 on SAP NetWeaver 7.50 and SAP HANA 2.0

 

Initial

Records

(Billions)

Phase 1

(lower is better)

Phase 2

(higher is better)

Phase 3

(lower is better)

Red Hat Enterprise Linux 8.8 [2]

2.6

7,083 sec

13,410

68 sec

SUSE Linux Enterprise Server 15 [3]

2.6

10,404 sec

9,917

76 sec

5th generation Intel Xeon / Red Hat Enterprise Linux advantage

 

31.9%

35.2%

10.5%

Additionally, using a dataset size of 1.3 billion initial records, a Dell EMC PowerEdge R760 server running Red Hat Enterprise Linux also outscored a similarly configured server on two out of three benchmark KPIs demonstrating better dataset load time and complex query runtime (see Table 2).

Table 2. Results in scale-up category running SAP BW Edition for SAP HANA Standard Application Benchmark, Version 3 on SAP NetWeaver 7.50 and SAP HANA 2.0

 

Initial Records (Billions)

Phase 1

(lower is better)

Phase 2

(higher is better)

Phase 3

(lower is better)

Red Hat Enterprise Linux 8.8 [4]

1.3

6,069 sec

17,846

65 sec

SUSE Linux Enterprise Server 15 [5]

1.3

8,041 sec

14,288

61 sec

5th generation Intel Xeon / Red Hat Enterprise Linux advantage

 

24.5%

24.9%

-6.6%

These results demonstrate Red Hat’s commitment to helping OEM partners and ISVs deliver high-performing solutions to our mutual customers and showcase close alignment between Red Hat and Dell that, in collaboration with SAP, led to the creation of certified, single-source solutions for SAP HANA. Available in both single-server and larger, scale-out configurations, Dell’s solution is optimized with Red Hat Enterprise Linux for SAP Solutions.

TPC-H @ SF =10000

Another Industry Standard Benchmark is the TPC-H decision support benchmark from the Transaction Processing Council (TPC).

The results show strong performance of HPE ProLiant DL380 class machines on the TPC-H benchmark @ SF= 10000 scoring a 17.9% improvement in performance in Queries/Hour (QphH) and a 31.4% price performance gain (Price/QphH). The audited TPC-H results were run by HPE and using Microsoft SQLserver 2022 64 bit on 5th Gen Intel Xeon SP running RHEL9.3 compared to a 4th Gen Intel Xeon SP results w/ the same SQLserver 2022 on Microsoft Windows Server 2022 Standard Edition operating systems. The combination of RHEL9.3 and 5th Gen Intel Xeon SP designs help show the value of upgrading the Server and the OS to a solution that achieved  the #1 non-clustered 10,000GB TPC-H performance result  [6]

TPC-H  w/ HPE DB @ 10 TB     SF = 10000

       

Sponsor

System

Performance (QphH)

Price/kQphH

System Availability

Date Submitted

DB Software Name

OS Software Name

Prior 4th Gen Intel Xeon Processor

HPE ProLiant DL380 Gen11

2,028,444

821.80 USD

5/1/2023

2/8/2023

Microsoft SQL Server 2022 Enterprise Edition 64 bit

Microsoft Windows Server 2022 Standard Edition

NEW 5th Gen Intel Xeon Processor

HPE ProLiant DL380 Gen11

2,391,511

625.77 USD

6/30/2024

1/25/2024

Microsoft SQL Server 2022 Enterprise Edition 64 bit

Red Hat Enterprise Linux Server Release 9.3

Speedup Gen5/Gen4

 

17.9%

31.4%

    

RHEL 9.4 (beta) AI/ML and computing performance with Intel® AMX

Here we explore the 5th Gen Intel Xeon processor [7] performing AI/ML capabilities by comparing performance to the previous 4th Gen Intel Xeon  processor [8] using some of the Phoronix Test Suite (PTS) benchmarks for PyTorch and TensorFlow, and the Neural Magic DeepSparse and Intel®  OpenVINO™ test suites. These four benchmark suites have more than 100 subtests between them. See [9] to reproduce these results.

We also ran general CPU computing benchmarks like SPEC CPU Base Rate (estimated), and some Two Dimensional FFTW in our lab systems to compare apples to apples on beta RHEL 9.4 systems.

(Our SPEC CPU Base Rate results are not an official run. We used Intel binaries with the ic2024.0.2-lin-sapphirerapids-rate-20231213.cfg config)

The results reflect out-of-the-box performance gains. None of the benchmarks have any 5th Gen Intel Xeon SP specific tunings or optimizations beyond what the compiler can detect automatically. Our results show 5th Gen Intel Xeon SP Average Speedup factors range from 1.07 to 1.22, and Max Speedups range from 1.19 to 1.89 relative to 4th Gen Intel Xeon SP. 

Graph comparing Average and Max Speedup

Summary

The Red Hat Performance Engineering team works with Intel to ensure performance capabilities of Red Hat Enterprise Linux on systems prior to hardware vendors shipping them in production. This blog reviewed a number of capabilities of Intel’s 5th Generation of features including higher cpu count, faster DDR5 memory, larger 3rd level caches, and improved interprocessor bandwidth. All of these features are supported in shipping versions of RHEL 8.8 and RHEL 9.2. We shared how OEMs used these features to produce leading results on SAP [1] industry standard benchmarks and TPC [6]. We also ran tests on RHEL 9.4 beta showing significant speedups for CPU workloads and AI/ML benchmarks comparing 5th Gen Intel Xeon SP to 4th Gen Intel Xeon SP.

The collaboration between Intel and Red Hat helps expand our capabilities and we will continue delivering innovative features in future versions of RHEL, where we hope to continue being the trusted OS for customers and partners.

Learn more


Footnotes

[1] SAP Results as of March 1, 2023, SAP and SAP HANA are the registered trademarks of SAP AG in Germany and in several other countries. See www.sap.com/benchmark for more information

[2] Dell PowerEdge R760 (2 processor / 128 cores / 256 threads, Intel Xeon
Platinum 8592+ processor, 1.9 GHz, 80 KB L1 cache and 2048 KB L2 cache per core, 320 MB L3 cache per processor, 1536 GB main memory). Certification number #2023076

[3] Atos BullSequana SH20 (2 processor / 120 cores / 240 threads, Intel Xeon
Platinum 8490H processor, 1.9 GHz, 80 KB L1 cache and 2048 KB L2 cache per core, 112.5 MB L3 cache per processor, 1024 GB main memory). Certification number #2023028

[4] Dell PowerEdge R760 (2 processor / 128 cores / 256 threads, Intel Xeon
Platinum 8592+ processor, 1.9 GHz, 80 KB L1 cache and 2048 KB L2 cache per core, 320 MB L3 cache per processor, 1536 GB main memory). Certification number #2023075

[5] Atos BullSequana SH20 (2 processor / 120 cores / 240 threads, Intel Xeon
Platinum 8490H processor, 1.9 GHz, 80 KB L1 cache and 2048 KB L2 cache per core, 112.5 MB L3 cache per processor, 1024 GB main memory). Certification number #2023026

[6] TPC and TPC-H are trademarks of the Transaction Processing Performance Council. All third-party marks are property of their respective owners: see: https://www.tpc.org/tpch/results. All comparisons and claims as of March 15, 2024. Filtered by  10,000 GB results:  https://www.tpc.org/tpch/results/tpch_perf_results5.asp?resulttype=nonc…

[7] 5th Gen Intel Xeon SP Hardware Configuration

Processor:    2 x Intel Xeon Platinum 8592+ @ 3.90GHz (128 Cores / 256 Threads)
Motherboard:  Intel D50DNP1SBB (SE5C7411.86B.9533.D01.2310110651 BIOS)
Memory:       1008 GB @ 5800 MT/s
Architecture:            x86_64
 CPU op-mode(s):        32-bit, 64-bit
 Address sizes:         52 bits physical, 57 bits virtual
 Byte Order:            Little Endian
CPU(s):                  256
 On-line CPU(s) list:   0-255
Vendor ID:               GenuineIntel
 BIOS Vendor ID:        Intel(R) Corporation
 Model name:            INTEL(R) XEON(R) PLATINUM 8592+
   BIOS Model name:     INTEL(R) XEON(R) PLATINUM 8592+
   CPU family:          6
   Model:               207
   Thread(s) per core:  2
   Core(s) per socket:  64
   Socket(s):           2
   Stepping:            2
   CPU(s) scaling MHz:  100%
   CPU max MHz:         3900.0000
   CPU min MHz:         800.0000
   BogoMIPS:            3800.00
Flags:
   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
   tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
   cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm
   pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
   cpuid_fault epb cat_l3 cat_l2 cdp_l3 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid
   ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma
   clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
   cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window
   hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg
   tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk
   pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features:
 Virtualization:        VT-x
Caches (sum of all):     
 L1d:                   6 MiB (128 instances)
 L1i:                   4 MiB (128 instances)
 L2:                    256 MiB (128 instances)
 L3:                    640 MiB (2 instances)
NUMA:
 NUMA node(s):          4
 NUMA node0 CPU(s):     0-31,128-159
 NUMA node1 CPU(s):     32-63,160-191
 NUMA node2 CPU(s):     64-95,192-223
 NUMA node3 CPU(s):     96-127,224-255
Vulnerabilities:         
 Gather data sampling:  Not affected
 Itlb multihit:         Not affected
 L1tf:                  Not affected
 Mds:                   Not affected
 Meltdown:              Not affected
 Mmio stale data:       Not affected
 Retbleed:              Not affected
 Spec rstack overflow:  Not affected
 Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
 Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
 Srbds:                 Not affected
 Tsx async abort:       Not affected

[8] 4th Gen Intel Xeon SP  Hardware Configuration

Processor:   2 x Intel Xeon Platinum 8480+ @ 3.80GHz (112 Cores / 224 Threads)
Motherboard: Dell 0VRV9X (1.3.2 BIOS)
Memory:      2016 GB @ 4800 MT/s
Architecture:            x86_64
 CPU op-mode(s):        32-bit, 64-bit
 Address sizes:         46 bits physical, 57 bits virtual
 Byte Order:            Little Endian
CPU(s):                  224
 On-line CPU(s) list:   0-223
Vendor ID:               GenuineIntel
 BIOS Vendor ID:        Intel
 Model name:            Intel(R) Xeon(R) Platinum 8480+
   BIOS Model name:     Intel(R) Xeon(R) Platinum 8480+
   CPU family:          6
   Model:               143
   Thread(s) per core:  2
   Core(s) per socket:  56
   Socket(s):           2
   Stepping:            8
   CPU(s) scaling MHz:  98%
   CPU max MHz:         3800.0000
   CPU min MHz:         800.0000
   BogoMIPS:            4000.00
Flags:
   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht 
   tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
   cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm 
   pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
   cpuid_fault epb cat_l3 cat_l2 cdp_l3 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid 
   ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma 
   clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
   cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window 
   hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg 
   tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk 
   pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features:
 Virtualization:        VT-x
Caches (sum of all):
 L1d:                   5.3 MiB (112 instances)
 L1i:                   3.5 MiB (112 instances)
 L2:                    224 MiB (112 instances)
 L3:                    210 MiB (2 instances)
NUMA:
 NUMA node(s):          2
 NUMA node0 CPU(s):     0,2,4,6,8, . . .
 NUMA node1 CPU(s):     1,3,5,7,9, . . .
Vulnerabilities:
 Gather data sampling:  Not affected
 Itlb multihit:         Not affected
 L1tf:                  Not affected
 Mds:                   Not affected
 Meltdown:              Not affected
 Mmio stale data:       Not affected
 Retbleed:              Not affected
 Spec rstack overflow:  Not affected
 Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
 Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
 Srbds:                 Not affected
 Tsx async abort:       Not affected

[9] Using Phoronix-Test-Suites in Containers

The PTS framework is an extremely convenient way to run performance tests, and it has a large ecosystem with many recorded results available for comparison. For official information, including official instructions explaining how to run PTS tests, see Phoronix Test Suite and OpenBenchmarking.org.

We ran the AI/ML related tests in Centos Stream 9 containers (on RHEL 9.4 beta hosts) to avoid any accidental modifications to the host system environment and to enforce a clean slate for each repeated trial.

Steps to reproduce the AI/ML related test results on your system:

  1. podman run -it --rm --net=host --privileged centos:stream9 /bin/bash
  2. sed -i "/\[crb\]/,+9s/enabled=0/enabled=1/" /etc/yum.repos.d/centos.repo
  3. dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
  4. dnf -y install atlas-devel autoconf automake binutils blas blas-devel boost-devel boost-thread bzip2 cmake expat-devel findutils gcc gcc-c++ gcc-gfortran gflags-devel git glog-devel gmock-devel gzip hdf5-devel iputils leveldb-devel libquadmath-devel libusb-devel libusbx-devel lmdb-devel make meson nfs-utils ninja-build openblas-devel opencv opencv-devel openssl-devel patch pciutils php-cli php-json php-xml procps-ng protobuf-compiler protobuf-devel python3 python3-devel python3-pip python3-yaml snappy-devel tar unzip vim-enhanced wget xz zip
  5. At this point you might mount a shared volume with phoronix-test-suite already installed, or you can just download and unpack it in the container with steps like these:
    1. wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz
    2. tar xvzf phoronix-test-suite-10.8.4.tar.gz
    3. cd phoronix-test-suite
  6. ./phoronix-test-suite  install      deepsparse openvino pytorch tensorflow
  7. ./phoronix-test-suite  benchmark    deepsparse openvino pytorch tensorflow

关于作者

Michey is a member of the Red Hat Performance Engineering team, and works on bare metal/virtualization performance and machine learning performance.. His areas of expertise include storage performance, Linux kernel performance, and performance tooling.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事