AMD拿下美国超算大单，成为高性能计算(HPC)新贵

Aspencore_EETOA_WeeklyRecap 来自电子工程专辑 16:16

BRIAN SANTO: I’m Brian Santo, EE Times Editor in Chief, and you’re listening to EE Times on Air. This is your Briefing for the week ending March 13th.

BRIAN SANTO: 我是EE Times的主编Brian Santo，您正在收听EE Times on Air。以下是截止3月13日的本周播报。

In this episode…

在本期节目中。

Supercomputers. The US Department of Energy just announced what will be the fastest supercomputer in the world by far. Supercomputing is a prestigious market, and a highly competitive one for the companies that make processing chips. With the latest round of new supercomputers, there was a clear – and somewhat unexpected – winner.

超级计算机。美国能源部（DoE）刚刚公布了迄今为止世界上最快的超级计算机。超级计算是一个极为重要的市场，同时，对芯片公司来说，这也是一个极具竞争性的市场。在最新一轮的超级计算机“比赛”中，出现了一位引人注目的胜者，而且这个结果有些出人意料。

The TOP500 is a list of the 500 most powerful supercomputers in the world. It's updated at least twice a year. For the last few years, the top two computers in the world have been operated by the US Department of Energy. They are Summit and Sierra, which reside at Oak Ridge National Lab and Lawrence Livermore National Lab, respectively. Both operate at speeds measured in petaflops – a petaflop being one quadrillion floating-point operations per second.

TOP500是一个关于世界上前500名最强大超级计算机的排名表。该排名每年至少更新两次。在过去的几年中，全球排名前两位的超级计算机一直由美国能源部运营——它们是Summit和Sierra，分别位于美国橡树岭国家实验室和劳伦斯·利弗莫尔国家实验室。这两台超级计算机的运行速度均以petaflops为单位——即每秒钟进行1千万亿次的数学运算。

During the past year, the DoE has been contracting for its next round of supercomputers. There will be three: one for Argonne National Laboratory, called Aurora; one for Oak Ridge, called Frontier; and one for Lawrence Livermore, called El Capitan.

在过去一年中，美国能源部一直在开展其下一批超级计算机签约工作。目前签下了三台超级计算机：一个用于阿贡国家实验室，名为Aurora；一个用于橡树岭国家实验室，名为Frontier；还有一个用于劳伦斯·利弗莫尔国家实验室，名为El Capitan。

The three usher in what is being called the 'exascale era.' All three will run at speeds measured in exaflops, or quintillions of floating-point operations per second.

这三台超级计算机迎来了所谓的“百万兆时代”。它们的运行速度都将以exaflop或是每秒万兆浮点运算来衡量。

From a system-level perspective, the contracts were divided up between Cray and HPE, but since HPE just bought Cray in 2019, HPE will be building all three. Of equal interest is who will be providing the processors.

从系统级别的角度来看，合同将分配给Cray和HPE，但由于HPE在2019年刚刚收购了Cray，因此将由HPE来搭建这三台计算机。同样重要的问题是——将由谁来提供处理器。

Summit and Sierra, the current champs, incorporate a combination of CPUs from IBM and GPUs from Nvidia.

目前的冠军计算机Summit和Sierra是IBM的CPU和Nvidia的GPU结合的产物。

The CPUs and GPUs in Aurora will both be provided by Intel. The CPUs and GPUs in Frontier will both be provided by AMD. Last week, the DoE announced that the CPUs and GPUs in El Capitan will also be provided by AMD. Furthermore, AMD got the El Capitan deal by demonstrating it can make the supercomputer significantly faster than the other two in this same generation: Frontier and Aurora.

Aurora的CPU和GPU都将由英特尔提供。Frontier的CPU和GPU都将由AMD提供。上周，美国能源部宣布El Capitan的CPU和GPU也将由AMD提供。此外，AMD是在进行了展示，证明其可使超级计算机运行速度显著快于同代中的其他两个产品，即Frontier和Aurora之后，才得到了El Capitan的合同。

Kevin Krewell, an analyst from Tirias Research, is an expert on processor technology. He’s been on the podcast before, and we invited him back to talk about the latest announcement.

Tirias Research的分析师Kevin Krewell是处理器技术方面的专家。Kevin之前参加过我们的节目，今天我们再次邀请他来谈谈美国能源部的这次新公告。

I asked him — how big of a deal this is for AMD?

我问他——这对AMD而言意味着什么？

KEVIN KREWELL: Oh, it is a big deal for AMD. It's their second exascale supercomputer win. The first one I think was Frontier. Now with these two wins, it's just AMD and Intel. Intel has the third win, and both companies have both the CPU and GPU part of that win, so it's interesting that the DoE has decided to pair companies that have both CPU and GPU together as important aspect of these design wins.

KEVIN KREWELL: 噢，这对AMD来说意义重大。这是他们第二次获得百万兆级超级计算机合同。我想第一份合同是Frontier。现在有这合同的，就是AMD和英特尔了。英特尔得到了第三份合同，这两家公司能赢得合同，部分原因是他们都同时拥有CPU和GPU技术。因此有意思的是，美国能源部已决定将同时拥有CPU和GPU技术的公司放在一起，作为赢得合同的重要参考因素。

In the case of AMD, one of the key design aspects is that the El Capitan will be assembled late 2021 and go operational in 2022. What AMD is going to provide is a coherent connection between the CPU and GPU, so they have a shared memory architecture. And to the team at Lawrence Livermore Lab, that was a very critical aspect of this design win. And it's something unique that AMD-- and actually Intel as well-- can do that's more difficult for somebody that does control both the CPU and GPU side of the equation.

就AMD而言，设计的关键点之一是El Capitan将于2021年下半年进行组装，并于2022年投入运营。AMD将提供CPU和GPU之间的贯通连接，因此它们具有共享内存架构。对于劳伦斯·利弗莫尔实验室的团队来说，这是AMD能赢得合同的关键设计所在。而且，AMD独特之处在于——实际上英特尔也同样如此，他们都能做到这一点，这对于不能同时控制CPU和GPU的公司来说更加困难。

BRIAN SANTO: Let's start with the people who don't control both side of the equation, the team that's kind of on the outs here is IBM and Nvidia.

BRIAN SANTO: 让我们从不能同时控制CPU和GPU的公司开始，IBM和Nvidia就是这样的两家公司。

KEVIN KREWELL: Yes. And this is a bit of a surprise. IBM has done well in the past with its power processor, and Nvidia has, right now, the most performant GPU in high-performance computing. So the fact that both IBM power and Nvidia GPUs are out of the picture right now was I'm sure a bit of a disappointment for both companies. They don't control the total connectivity here, and I think that's one of the key things.

KEVIN KREWELL: 是的。这有点令人惊讶。IBM过去在Power处理器方面做得很出色，而Nvidia目前在高性能计算中拥有性能最优的GPU。目前IBM的Power处理器和Nvidia的GPU被排除在计划之外，这个事实我相信IBM和Nvidia两家公司都很难接受。他们不能控制完整连接，我认为这是关键问题之一。

But El Capitan is a $600 million project, but no all of that money is going just to hardware. A fair amount of it-- tens of millions of dollars at least-- are going to developing the software that will run on the El Capitan supercomputer, and the goal there is to take aim at these ROCm, which is their version of Cuda (but it's an open-source version) and bring that up to a high-quality level. So the Lawrence Livermore Lab people are invested heavily in building up the AMD software story, which to this point had pretty much trailed what Nvidia was doing with Cuda. That's been a real important part for AMD. I think it's going to help solidify a weak spot in AMD's overall strategy.

但是，El Capitan是一个耗资6亿美元的项目，而并不是所有的经费都用在硬件上。其中相当一部分经费——至少数千万美元——会用于开发将在El Capitan超级计算机上运行的软件，即ROCm，相当于是AMD版本的Cuda（只不过Nvidia Cuda是开源的），并将其提升到高质量的水平。因此，劳伦斯·利弗莫尔实验室的人员投入了大量资金来打造AMD软件，到目前为止，还远远落后于Nvidia在Cuda上所完成的工作。对于AMD来说，那是真正重要的部分。我认为这将有助于巩固AMD整体战略中的薄弱环节。

BRIAN SANTO: Well, they came around really strong. When they first announced El Capitan, they were talking about it being a 1.5 exaflops machine, and they said that, in the meantime, as they were evaluating the bids for processors and GPUs, AMD convinced them that they'd be able to get from 1.5 exaflops (which is what they promised for Frontier) to 2. And that's apparently based on the interconnect that you were talking about earlier.

BRIAN SANTO: 好吧，AMD真的很强大。当美国能源部首次公布El Capitan计算机时，人们在谈论说，这是1.5 exaflops的机器，他们还说，与此同时，在评估处理器和GPU的出价时，AMD确信他们可以从1.5 exaflops（这是AMD承诺Frontier可以做到的）提升到2。这显然是基于你之前提到的系统互连。

KEVIN KREWELL: Right. AMD did a couple things. One is, they pulled in the roadmap for this specific project. The CPU is the fourth-generation Zen core. It's a processor called Genoa, and it will be manufactured in a 5 nanometer process. And AMD's interconnect between the Genoa core and the third generation of the GPUs (what is now called the CDNA architecture, which is the compute version of AMD's GPUs). And that's not a defined process node. It's an advanced process node, but they wouldn't be nailed down on which exactly it was. But with the Infinity fabric, the third generation of that, AMD will now have a coherent link between the CPUs and GPUs. That plus the performance that AMD promised, and it's four GPUs to one CPU, and the GPU's providing most of the flops there. And that architecture all together won the deal for AMD.

KEVIN KREWELL: 对。AMD做对了几件事。其中一件是，他们为该特定项目制定了路线图。这次所用的CPU是第四代Zen核——一个名为Genoa的处理器，将以5纳米工艺制造。AMD在Genoa内核和第三代GPU之间进行了互连（现在称为CDNA架构，这是AMD GPU的计算版本）。那不是既定的流程节点，而是更为高级的流程节点，但还没有确定其确切是什么。但是有了第三代Infinity架构，AMD现在将在CPU和GPU之间建立起贯通的连接。加上AMD承诺的性能，将用四个GPU连接一个CPU，而GPU提供大部分的运算。这种架构使AMD赢得了合同。

BRIAN SANTO: Wow. Interesting. And we had mentioned earlier, before we got onto this conversation, it's the interconnect that's becoming really important. It's kind of the reason why Nvidia bought Mellanox.

BRIAN SANTO: 哇。有意思。在我们进行对话之前，已经提到了，互连变得非常重要。这差不多也是Nvidia收购Mellanox的原因。

KEVIN KREWELL: Well, the Mellanox interconnect is a rack-based interconnect. It connects racks together. The Infinity fabric is a scalable solution for AMD. It can be used for on-chip networking; it can be used as chip-to-chip, as in their chiplet strategy where the chips are interconnected on a package; and it can be extended to processor-to-processor interconnects. So it's a very scalable fabric, and it goes all the way from inside the chip to connecting multiple chips together and multiple packages together. That's the important factor I think for AMD. That's sort of their secret sauce in making this all happen.

KEVIN KREWELL: 嗯，Mellanox的互连是基于机架的。它将机架连接在一起。Infinity光纤网络是AMD的可扩展解决方案。它可以用于片上网络；它可以用于芯片到芯片，就像在芯片方案中，将芯片互连到一个封装上一样；并且可以扩展为处理器到处理器的互连。所以这是一个非常可扩展的结构，从芯片内部一直到将多个芯片连接在一起，以及将多个封装连接在一起。我认为这对于AMD是很重要的方面。这是他们能取得这些成就的秘诀。

BRIAN SANTO: I think the overall point that you now are taking a look at, in order to get the extract the greatest amount of performance, looking at the system holistically, processors and interconnect has become very important. That's the AMD machine and the Intel machine, too, right?

BRIAN SANTO: 我认为，你现在持有的总体观点是，为了最大程度地提取性能，从整体视角来看系统，处理器和互连已变得非常重要。那也是AMD的机器和英特尔的机器所具备的，对吗？

KEVIN KREWELL: Intel is doing something very similar for their architecture as well. Nvidia has its NVLink, which would be the Nvidia equivalent, and they can coherently connect multiple GPUs together, and NVLink can be connected to a power processor. But power processors are the only processors that have an NVLink connection at this point.

KEVIN KREWELL: 英特尔在架构方面也做了非常相似的事。Nvidia拥有其NVLink产品，该产品差不多算是Nvidia的代表产品了，NVLink可以将多个GPU贯通地连接在一起，并且NVLink可以连接到Power处理器。但是，目前只有Power处理器具有NVLink连接接口。

BRIAN SANTO: All right. We talked about El Capitan. That was kind of like the first half of AMD's news. They followed that announcement up about a week later with some more stuff.

BRIAN SANTO: 好的。我们谈到了El Capitan。这有点像AMD后续发布的新闻的上半集。AMD在大约一周后发布了更详细的公告。

KEVIN KREWELL: Actually, the day after. Wednesday was the El Capitan news. Thursday was the financial analyst day. And on the financial analyst day, they kind of clarified some of the news from the El Capitan launch. To me, the most significant was a roadmap change on the GPU side.

KEVIN KREWELL: 其实是在美国能源部宣布之后的第二天。El Capitan的新闻星期三发布。星期四是AMD财务分析日。在财务分析日当天，AMD还澄清了El Capitan发布会上的一些新闻。对我而言，最重要的是GPU方面的路线图更改。

Typically, AMD's used their Radeon GPUs that are used for consumer products, and then used them as Radeon Instinct processor for compute and machine learning. AMD's now going forward, is going to bifurcate the roadmap, where they'll be a separate consumer product, which they continue to develop, that's the RDNA products, Radeon DNA, and then a specific compute architecture, the CDNA. They'll add tensor cores in, they'll add more reliability and RAS features for enterprise class compute, and then this Infinity fabric scalability, which the third generation of Infinity fabric will support up to connecting eight GPUs together.

通常，AMD会使用其针对消费类产品的Radeon GPU，然后将其用作Radeon Instinct处理器进行计算和机器学习。AMD现在正往前迈进，将路线图分为两部分，它们将作为独立消费产品被继续开发，即RDNA产品——Radeon DNA。然后是特定的计算架构CDNA。AMD将添加张量计算内核，将为企业级计算增强更优的可靠性和RAS功能。还有这种Infinity结构可扩展性，第三代Infinity结构将最多支持八个GPU同时连接。

Those features will be dedicated to the CDNA parts, and then the RDNA parts, which is the consumer and graphics used for Play Station and that. They're going to be adding later this year both ray tracing and also scalable rendering. So later this year, they're expecting on the RDNA side, which is the consumer side, to have a more massive solution there, something that's more competitive with higher-end Nvidia solutions. So they're finally going to take that on.

这些功能将专用于CDNA部件，之后用于RDNA部件，RDNA部件是用于Play Station的消费者和图形产品。他们将在今年晚些时候增加光线追踪和可伸缩渲染。因此，在今年晚些时候，AMD期望在RDNA方面（即消费者侧）能有一个更广泛的解决方案，在与Nvidia对比时，能在高端解决方案中更具竞争力。因此，他们最终开始实施这项工作。

BRIAN SANTO: They're hoping to have this HPC technology percolate down to other applications that aren't HPC. Is that correct?

BRIAN SANTO: 他们希望这种HPC技术可以渗透到非HPC的其他应用中。可以这样理解吗？

KEVIN KREWELL: Yeah. One theme from the financial analyst day was AMD's focus on higher-performance computing applications. And that includes the idea of heterogeneous or accelerated compute. And to do that, it's a combination of CPUs plus GPUs together. And that's AMD's solution. They don't have FPGAs, they don't have a dedicated machine learning ASIC. It's not to say that they don't at some point in time add that. On the CDNA parts, they will be adding dedicated tensor cores for tensor processing and AI.

KEVIN KREWELL: 是的，财务分析日的主题之一是AMD对高性能计算应用的关注。这包括异构或加速计算的想法。为此，它将CPU和GPU结合在一起。那就是AMD的解决方案。他们没有FPGA，也没有专用的机器学习ASIC。这并不是说他们不会在将来某个时候添加这些功能。在CDNA部件上，他们将添加用于张量处理和AI的专用张量内核。

So there are definitely applications where AMD's looking at this. The idea is that future workloads, even on scale-out servers, will include machine learning applications, and the ability to tie CPUs with accelerators will be a critical feature for many future workloads. And therefore, that's something that AMD wants to be on the forefront of. So that's a differentiator.

所以肯定有AMD正在研究的应用。他们的想法是，即使在横向扩展服务器上，未来的工作负载也将包括机器学习应用，并且将CPU与加速器绑定的能力会是未来许多工作负载的关键功能。这就是AMD想要走在前列的事情。这是一个竞争优势。

BRIAN SANTO: Yeah. It's kind of interesting, too: the DoE was talking about that during their conference about El Capitan. I asked them about specific AI accelerators. They said, the de facto accelerator is the GPU. But the guy who's running that program is in charge of both HPC and AI at Lawrence Livermore. And he was saying, We're definitely looking at AI workloads in the HPC context, and if it works out with the supercomputers that they have in place, they're able to scale out El Capitan to add those nodes in.

BRIAN SANTO: 是的，这也很有意思：美国能源部在其关于El Capitan的会议上谈论了这一点。我问了他们关于特定AI加速器的问题。他们说，事实上加速器就是GPU。但负责该项目的人同时在劳伦斯·利弗莫尔负责HPC和AI。他说，“我们肯定是在HPC环境中来研究AI工作负载，如果能够实现与现有的超级计算机一起工作，就可以扩展El Capitan来添加这些节点。”

KEVIN KREWELL: Yeah. They said it's an experimental phase right now. They're trying to figure out how they can use AI processing, machine learning, to do a better job of testing a nuclear stockpile. So they're going to be doing some experiments with it. They haven't really gone down this path before. They've just been a pure simulation team focus. So this is an area where they're experimenting. And the nice thing about what they're doing is, if they find it's useful, they can still use the GPUs for both compute and also for AI machine learning applications. Because AMD will have those dedicated tensor cores in there, too.

KEVIN KREWELL: 是的。AMD说目前还处在实验阶段。他们试图弄清楚如何使用AI处理和机器学习来更好地测试核储备。因此他们将对此进行一系列实验。AMD以前并没有真正尝试过这条路，只是一个纯粹专注于仿真的团队。这是他们正在尝试的领域。他们正在做的这件事带来的好处是，如果发现其有用，那么他们仍然可以将GPU用于计算和AI机器学习应用。因为AMD也将在其中拥有专用的张量内核。

BRIAN SANTO: Absolutely wild stuff. We'll have to see what happens with the next supercomputers from 2023 or 2024, right?

BRIAN SANTO: 绝对是令人吃惊的产品。我们必须等到2023或2024年，看看下一代超级计算机会如何发展，对吗？

KEVIN KREWELL: Yes. It'll be another ball game. I mean, right now, this is a huge leap forward for the supercomputer capability in the United States government and the DoE. It's almost unimaginable what's the next big leap after this.

KEVIN KREWELL: 是的。这将是另一场竞赛了。我的意思是，目前的进展对于美国政府和美国能源部的超级计算机功能来说，已经是一个巨大的飞跃。在这之后的下一次重大飞跃几乎是难以想象的。

BRIAN SANTO: They were saying that this El Capitan machine will be as powerful as the next 200 supercomputers combined.

BRIAN SANTO: 他们说这台El Capitan计算机的功能将与接下来的200台超级计算机结合到一起的效果一样强大。

KEVIN KREWELL: Yeah. It's amazing.

KEVIN KREWELL: 是的，真是太奇妙了。

BRIAN SANTO: It's boggling, isn't it?

BRIAN SANTO: 真是令人吃惊，不是吗？

KEVIN KREWELL: That is definitely mind-boggling. One interesting thing that came out in the Q & A in that conference was the fact that the big El Capitan is dedicated to classified workloads. But they'll be a mini El Capitan, or they called it a clone, that will be available for applications that aren't classified. They talked about being able to use this architecture for your medical research and all that. It's probably going to be on the clone El Capitan, not on the main one.

KEVIN KREWELL: 这绝对是令人难以想象的。那次会议的问答环节中有一件趣事——巨无霸型El Capitan致力于机密工作负载。但他们还将设置一个迷你El Capitan，或者称其为克隆版，可用于未分类的应用。AMD谈到，此架构将能够用于医学研究等场景。可能在克隆版的迷你El Capitan上运行，而不是在巨无霸型El Capitan上。

BRIAN SANTO: Oh, fascinating.

BRIAN SANTO: 哦，真是令人期待。

KEVIN KREWELL: Yeah.

KEVIN KREWELL: 是的。

BRIAN SANTO: All right. Kevin, thank you very much.

BRIAN SANTO: 好的，凯文，非常感谢你。

KEVIN KREWELL: No problem. Glad to be here.

KEVIN KREWELL: 不客气。非常高兴能参与节目。

BRIAN SANTO: Aurora and Frontier are both expected to be delivered in 2021. El Capitan is scheduled for delivery in 2023.

BRIAN SANTO: Aurora和Frontier预计将在2021年交付。El Capitan计划在2023年交付。

Okay! That is your Weekly Briefing for the week ending March 13th. Thanks for joining us, and we hope to see you back next Friday with our next episode.

好的！以上是截止3月13日的本周播报。感谢您的收听，我们敬请期待您下周五准时收听新一期的节目。

The Weekly Briefing is available via Spotify, iTunes, and Stitcher, but if you get there via our web site at eetimes.com, you’ll find a transcript, links to the stories we refer to, and other goodies. And if you like what you’ve heard, share the podcast with your co-workers and friends.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。