Themes - SysMoore & Hardware Renaissance In The Next Decade (Pt.2)

Themes - SysMoore & Hardware Renaissance In The Next Decade (Pt.2)

Summary

  • In Part 2 of the SysMoore series we compare the different approaches to chip-level packaging between Intel and AMD.
  • We also discuss PCBs, as scaling at the chip level reaches physical limits, PCBs are becoming essential for enabling high-speed, high-density connections necessary for advanced AI and data center workloads.
  • Additionally we consider what's happening to scale compute at the server/rack level, outlining how Nvidia plans to implement its upcoming GB200 architecture within a server rack.
  • In Part 3, we shall cover power supply, networking, cooling, and memory.

Advanced Packaging to Advanced PCB Motherboard to Advanced everything

Chiplet and Advanced Packaging

AMD’s success in the server CPU market, particularly against Intel, owes much to its innovative use of chiplet architectures. While this is sometimes called advanced packaging, it isn’t quite the same thing. AMD’s advantage comes from offering higher core counts per processor package, making it especially appealing for cloud providers and virtual private server (VPS) applications. However, it’s this same chiplet strategy that limits AMD’s full takeover of Intel’s market share.

Previously, increasing core counts meant manufacturing a CPU with more cores integrated onto a single, large silicon die — a method with practical limits. Intel’s answer to scaling core counts was Ultra Path Interconnect (UPI), a proprietary protocol allowing multiple CPUs to communicate within a single server by connecting multiple CPU sockets on a motherboard. This multi-socket setup lets users install up to four CPUs per server, connected through UPI via copper lines embedded in the motherboard’s layered PCB (Printed Circuit Board).

However, Intel’s approach has inherent limitations. Bandwidth between sockets reaches only about 80 GB/s (in 4th Gen Xeon Sapphire Rapids), significantly lower than the intra-chip bandwidth of over 10 TB/s. Latency is also an issue: within a single die, core-to-core latency is around 59 ns, while cross-socket latency jumps to an average of 138 ns, affecting performance in data-intensive applications.

AMD took a different route by designing small, modular Core Chiplet Dies (CCDs), each containing eight cores, along with a separate I/O Die (IOD) to handle I/O functions like memory, PCIe, and USB. Unlike compute functions, I/O and cache have seen little benefit from transistor scaling in advanced nodes. The transistor density improvements from nodes like 10nm to 3nm primarily apply to logic (compute) components, making it economically unviable to manufacture non-compute components on the latest, costlier nodes. By producing these different functional blocks on the most appropriate process nodes, AMD optimizes both performance and cost.

!DOCTYPE html> Contact Footer Example