【转】龙芯的 LSX 和 LASX 矢量扩展测评

Loongson’s LSX and LASX Vector Extensions

英文资料，附机翻

February 26, 2023clamchowder 4 Comments

Loongson used to make CPUs based off the MIPS ISA, but the company recently switched to a homegrown ISA called Loongarch. This “new” ISA retains many of MIPS’s semantics, but uses incompatible encodings. Loongarch also gets extended to better support Loongson’s goals of making a viable domestic Chinese CPU.Loongarch’s LSX and LASX vector extensions are a prominent example of this. LSX is a bit like SSE on x86, with 128-bit vector registers and corresponding instructions. LASX can be compared to AVX2, as both extensions work with 256-bit vectors. Unlike SSE and AVX2, LSX and LASX are not publicly documented. However, Loongnix provides a LSX/LASX capable toolchain. That means we can discover LSX and LASX instructions and play around with them. I don’t have time to fully document those ISA extensions, so this article will share some interesting details.Brief IntroLSX provides 128-bit registers named VR0 through VR31 and LASX provides 256-bit ones named XR0 through XR31. Just like with SSE and AVX, these registers are aliased to each other. They’re aliased to the 64-bit FP registers (F0 through F31) as well. That means F1 refers to the low 64 bits of XR1, and VR1 refers to the low 128 bits.

Both LSX and LASX provide a variety of instructions to work with vectors. Common things like vector addition, multiplication, and logic operations do exactly what you think. Floating point instructions can work on either FP32 or FP64 elements, while integer ones can work with 8-bit, 16-bit, 32-bit, or 64-bit elements. Of course, there are 128-bit and 256-bit load/store instructions as well.Besides arithmetic instructions, LASX has instructions for permute, min/max, absolute value, and load-and-broadcast (called XVLDREPL for some reason). Some of these don’t have clear AVX2 equivalents. For example, there’s a variant of the max instruction (XVMAXI) that takes an immediate, and returns the max of the immediate and each element in its corresponding position. It’s pretty interesting, though that instruction uses a 5-bit field to encode the immediate as a two’s complement signed value, meaning you can’t encode a value bigger than 15 (or smaller than -15). For data movement, LASX has instructions to move values in specified vector lanes to memory or GPRs.There are many more instructions that I didn’t bother to test, but first impressions are that it has a decent coverage of vector operations. A few specific things seem to be missing, like sum of absolute differences for accelerating video encoding.Instruction Encoding ExamplesUnlike MIPS, which encoded registers in the middle of the instruction, Loongarch moves the register fields to the least significant bits. That applies to LSX and LASX as well. In keeping with MIPS tradition, LSX/LASX instructions are non-destructive, meaning you don’t have to overwrite one of the source registers. That means fused multiply operations are the equivalent of FMA4, and require four register fields. Because Loongarch uses fixed length, 32-bit instructions just like MIPS, the opcode field appears to be variable length to allow encoding more than three register operands.

Sometimes, LSX and LASX opcodes differ by a single bit, suggesting that bit indicates whether the instruction targets 128-bit or 256-bit vector length. But that doesn’t apply universally. In some cases, a couple of bits immediately below the opcode seem to specify the data type.Like later versions of MIPS, Loongarch has indexed load instructions for dealing with arrays. Loongson has incorporated versions of these into its LASX and LSX instruction set extensions. Notably though, Loongarch still doesn’t let you specify a base, index, and scale in one instruction. Both x86 and ARM let you do that, letting them perform array accesses with fewer instructions.From experimentation, LASX has some weird semantics with regards to partial register access. We’re not going to thoroughly analyze the instruction set, but here are some examples of the weirdness.Partial Register AccessTo start, 128-bit LSX math instructions will operate on the entire 256-bit LASX register. Both VFADD.S (add packed FP32 elements in a 128-bit VR register) and VADD.W (packed addition of 32-bit integers in a 128-bit vector register) will also add to the upper 128-bit half of a 256-bit register. Basically, that means a 128-bit math instruction (VFADD) will behave like its 256-bit equivalent (XVFADD) even though their opcodes are different. Contrast that with x86’s behavior, where a 128-bit operation on 256-bit vector will leave the upper half untouched.

Things get even funnier if you load from memory into a partial register alias. Again, x86 leaves the upper half preserved, though a scalar FP load will zero the high bits of a 128-bit register. With Loongson, what happens to the rest of the register appears to be undefined and quite unpredictable. I find undefined behavior interesting.Loongson’s Loongarch reference manual says the high 32 bits of a 64-bit FP register are undefined after using FLD.S. FLD.S loads a FP32 value from memory and puts it into the first 32 bits of the target register. The rest of the bits are undefined, but the next 32 bits are usually populated with by the next 32-bit value from memory. That suggests the memory subsystem natively handles accesses at 64-bit granularity, and isn’t meant to go smaller.

What happens above the first 64 bits appears to be completely random. Sometimes elements are zero, sometimes they’re garbage, and sometimes the FLD.S instruction acts as a full 256-bit load.If we shove FLD.S load up to the end of a 16 KB page, so that loading anything more would cross a page boundary, bits 32 to 63 become very unpredictable too. The most common results are either zero, or loading from the start of the cache line. Sometimes more than one extra element is loaded as well. Less commonly, a few elements are loaded from completely random locations, including valid memory locations in the next page.The 3A5000 displays similarly weird behavior if we fill a 256-bit register, then try to separately fill the first half with VLD (128-bit vector load). Often, the VLD instruction behaves like XVLD, and loads 256 bits into the entire vector register. If the upper 128-bits are across a page boundary, results again become more random.

The takeaway here is that Loongson’s 3A5000 will remember whether a register is holding a 128-bit or 256-bit value. Once that’s the case, operations the low bits of the vector register will have unpredictable effects on the upper bits. Loongson likely considers the upper bits of a vector register to be undefined after an operation on a subset of the register. In theory, this could improve performance or simplify the design. Some x86 CPUs can incur penalties related to preserving the high bits of a vector register when operating on the lower half. For example, Sandy Bridge can incur a 70 cycle penalty when transitioning to and from a “saved state” designed to preserve the upper half of AVX YMM registers.

However, Loongson takes a different penalty. If scalar floating point operations are used alongside vector ones, FP/vector renaming capacity decreases by around 32 entries. Even though the registers are aliased to each other (F0-F31 refer to the same architectural registers as XR0-XR31), it looks like the core has to separately store state for both of them. Sandy Bridge has it worse, with FP renaming capacity severely reduced when mixing scalar and vector operations. Newer CPUs like Skylake don’t take any impact to reordering capacity.3A5000’s Vector PerformanceLoongson’s 3A5000 is the only CPU with LSX/LASX support, so we’ll take a look into its FPU and vector implementation here. The 3A5000 has a dual port FPU, with native support for 256-bit execution. Both the execution units and the registers are 256 bits wide. To feed the execution units, the L1D can handle two 256-bit accesses per cycle. Both can be loads, and one can be a store. Unlike on Zen 1, there’s no splitting 256-bit instructions into two 128-bit micro-ops.

Vector integer and logic operations can use both pipes, with simple operations like adds and bitwise operations enjoying single cycle latency. More complicated operations like permutes or integer multiplies take three or four cycles, which is quite decent. For floating point operations, the vector unit is less capable. FP adds and multiplies each get a specialized pipe, creating parallels to Sandy Bridge and older designs. Loongson does have FMA support, but both pipes share a single FMA unit. That setup lets FMA operations dual issue alongside a FP add or FP multiply. However, an even mix of FP add, FP multiply, and FMA instructions doesn’t quite reach 2 IPC, possibly because of sub-optimal pipe assignment and contention for the shared FMA unit.Floating point execution units tend to be bigger and more power hungry than integer ones. Loongson’s strategy probably focused on getting the benefits of 256-bit vector length rather than going for maximum performance. Floating point throughput matches Zen 1 assuming programs can use 256-bit vectors. But the 3A5000 falls behind Skylake, which can do two 256-bit FMA instructions per cycle. Loongson also struggles with latency. Basic FP operations execute with 5 cycle latency, which is far from ideal especially at a low 2.5 GHz clock speed. For comparison, Zen 1 can execute FP adds and multiplies with 3 cycle latency. FMA operations on Zen 1 have 5 cycle latency.

Instruction

Description

3A5000 Throughput/Latency

Pipe?

xvadd.d

256-bit vector add with packed 64-bit integers

2 per cycle1 cycle latency

Both

xvmul.d

256-bit vector multiply with packed 64-bit integers

2 per cycle4 cycle latency

Both

xvxor.v

256-bit bitwise exclusive or

2 per cycle1 cycle latency

Both

xvsll.h

256-bit vector shift

2 per cycle1 cycle latency

Both

xvfadd.s

256-bit vector add with packed FP32 elements

2 per cycle5 cycle latency

FADD pipe

xvfmul.s

256-bit vector multiply with packed FP32 elements

2 per cycle5 cycle latency

FMUL pipe

xvfmadd.d

256-bit vector fused multiply add with packed FP64 elements

1 per cycle5 cycle latency

Both, but only one execution unit

xvpermi.d

256-bit permute, controlled by immediate

2 per cycle3 cycle latency

Both

Therefore, Loongson doesn’t seem to be aiming particularly high with its vector execution units. The 3A5000 is not going to push through more vector operations per cycle than Zen 1, even though Zen 1 has 128-bit execution units. Its floating point side isn’t very strong, with high latency and low throughput compared to Intel and AMD’s 2017 era technology. Non-FP execution is better, though Intel and AMD can still bring more ports and more throughput to bear.To hide execution and memory access latency, the 3A5000 has a unified 32 entry FP scheduler and 96 vector registers available for renaming (though with the caveat from above). Add in 32 non-speculative registers, and we’re probably looking at 128 total vector registers. Those registers are 256 bits wide, giving 4 KB of total vector RF capacity. Zen 1 uses a unified 36 entry FP scheduler, with a 64 entry non-scheduling queue in front of it. AMD therefore can track a lot more operations waiting for execution, even if it has to split 256-bit instructions into two micro-ops. Loongson does have a lead with register file capacity, because AMD only has 128-bit wide registers (and 160 of them total). But that advantage will only show if applications use 256-bit vectors a lot.In terms of execution units and scheduling resources, the 3A5000’s FPU lands somewhere between high performance and low power implementations. It’s not a match for Zen 1, and definitely not a match for Skylake. Loongson’s 256-bit vector width and unified scheduler should give it a leg up over Ampere Altra, but from our libx264 testing, that wasn’t really the case. The 3A5000 does convincingly beat Intel’s old Goldmont Plus based Celeron J4125 in the same video encoding test. However Goldmont Plus aims for a much lower power target than the 3A5000 and Goldmont Plus lacks any AVX or FMA instructions.Final WordsBy using incompatible encodings, Loongson can say they have a new ISA and develop it independently from MIPS. Calling it Loongarch rather than MIPS means they don’t have to deal with rights for the ISA, even if Loongarch and MIPS share a lot of semantics to the point where you can use MIPS64 manuals. This approach makes a lot of sense. Keeping the semantics means Loongson can quickly reuse most of the toolchain. Changing the encodings means they have a new ISA and aren’t held back by any licensing restrictions.Alongside AVX and SVE, Loongson’s LASX is another ISA extension that takes vector length above 128 bits. More importantly, Loongson is part of China’s efforts to build up domestic CPU capabilities. LASX suggests China is aiming for high performance, because 128-bit vector execution would be adequate for low power applications where high performance is not a concern.Yet the Loongson 3A5000’s LASX implementation is not competitive with AVX2 implementations found in AMD and Intel desktop CPUs, even if we go back a couple of generations. Skylake and Zen 1 both have wider vector execution setups and can keep more operations in flight to absorb latency. The 3A5000’s low clocks put a giant nail in the coffin, ensuring that it’s completely outmatched by any remotely modern desktop CPU. For sure, getting 256-bit vector execution units to run at high clock speeds is a challenging exercise. But AMD and Intel have figured out how to do it. Loongson has not.

龙芯的 LSX 和 LASX 矢量扩展

2023 年 2 月 26 日蛤蜊浓汤 4 条评论

龙芯曾经基于 MIPS ISA 制造 CPU，但该公司最近转向了一种名为 Loongarch 的自主开发的 ISA。这个“新的”ISA 保留了许多 MIPS 的语义，但使用了不兼容的编码。Loongarch 也得到扩展，以更好地支持龙芯的目标，即制造可行的中国国产 CPU。Loongarch 的 LSX 和 LASX 矢量扩展就是一个突出的例子。LSX有点像x86上的SSE，有128位的向量寄存器和相应的指令。LASX 可以与 AVX2 进行比较，因为这两个扩展都使用 256 位向量。与 SSE 和 AVX2 不同，LSX 和 LASX 没有公开记录。然而，Loongnix 提供了一个支持 LSX/LASX 的工具链。这意味着我们可以发现 LSX 和 LASX 指令并使用它们。我没有时间完整记录那些 ISA 扩展，因此本文将分享一些有趣的细节。简介LSX 提供 128 位寄存器，名为 VR0 到 VR31，LASX 提供 256 位寄存器，名为 XR0 到 XR31。就像 SSE 和 AVX 一样，这些寄存器彼此互为别名。它们也是 64 位 FP 寄存器（F0 到 F31）的别名。也就是说F1指的是XR1的低64位，VR1指的是XR1的低128位。

LSX 和 LASX 都提供了多种指令来处理向量。矢量加法、乘法和逻辑运算等常见事物完全按照您的想法进行。浮点指令可用于 FP32 或 FP64 元素，而整数指令可用于 8 位、16 位、32 位或 64 位元素。当然，还有 128 位和 256 位加载/存储指令。除了算术指令外，LASX 还具有置换、最小/最大值、绝对值和加载和广播（出于某种原因称为 XVLDREPL）的指令。其中一些没有明确的 AVX2 等效项。例如，max 指令 (XVMAXI) 有一个变体，它接受一个立即数，并返回立即数的最大值和其对应位置的每个元素。这很有趣，尽管该指令使用 5 位字段将立即数编码为二进制补码有符号值，这意味着您不能对大于 15（或小于 -15）的值进行编码。对于数据移动，LASX 具有将指定向量通道中的值移动到内存或 GPR 的指令。还有很多我没有费心去测试的指令，但第一印象是它对向量运算有很好的覆盖。似乎缺少一些具体的东西，比如加速视频编码的绝对差之和。指令编码示例与在指令中间对寄存器进行编码的 MIPS 不同，Loongarch 将寄存器字段移动到最低有效位。这也适用于 LSX 和 LASX。与 MIPS 传统保持一致，LSX/LASX 指令是非破坏性的，这意味着您不必覆盖其中一个源寄存器。这意味着融合乘法运算等同于 FMA4，并且需要四个寄存器字段。因为 Loongarch 像 MIPS 一样使用固定长度的 32 位指令，所以操作码字段看起来是可变长度的，以允许对三个以上的寄存器操作数进行编码。

有时，LSX 和 LASX 操作码只有一位不同，这表明该位指示指令的目标是 128 位还是 256 位向量长度。但这并不普遍适用。在某些情况下，紧接在操作码下方的几位似乎指定了数据类型。与后来的 MIPS 版本一样，Loongarch 有索引加载指令来处理数组。龙芯已将这些版本合并到其 LASX 和 LSX 指令集扩展中。但值得注意的是，Loongarch 仍然不允许您在一条指令中指定基数、索引和比例。x86 和 ARM 都允许您这样做，让它们用更少的指令执行数组访问。从实验来看，LASX 在部分寄存器访问方面有一些奇怪的语义。我们不打算彻底分析指令集，但这里有一些奇怪的例子。部分寄存器访问首先，128 位 LSX 数学指令将对整个 256 位 LASX 寄存器进行操作。VFADD.S（在 128 位 VR 寄存器中添加打包的 FP32 元素）和 VADD.W（在 128 位向量寄存器中打包添加 32 位整数）也将添加到 256 的高 128 位一半位寄存器。基本上，这意味着 128 位数学指令 (VFADD) 的行为与其等效的 256 位数学指令 (XVFADD) 相同，即使它们的操作码不同。将其与 x86 的行为进行对比，其中对 256 位向量的 128 位操作将保持上半部分不变。

如果您从内存加载到部分寄存器别名，事情会变得更加有趣。同样，x86 保留了上半部分，尽管标量 FP 加载会将 128 位寄存器的高位归零。对于龙芯，寄存器的其余部分会发生什么似乎是不确定的，而且非常不可预测。我发现未定义的行为很有趣。龙芯的Loongarch参考手册说使用FLD.S后64位FP寄存器的高32位是未定义的。FLD.S 从内存中加载一个 FP32 值并将其放入目标寄存器的前 32 位。其余位未定义，但接下来的 32 位通常由内存中的下一个 32 位值填充。这表明内存子系统本机以 64 位粒度处理访问，并不意味着变得更小。

前 64 位以上发生的事情似乎是完全随机的。有时元素为零，有时它们是垃圾，有时 FLD.S 指令充当完整的 256 位加载。如果我们将 FLD.S 加载到 16 KB 页面的末尾，那么加载更多内容将跨越页面边界，第 32 到 63 位也变得非常不可预测。最常见的结果要么为零，要么从缓存行的开头加载。有时还会加载不止一个额外的元素。不太常见的是，一些元素是从完全随机的位置加载的，包括下一页中的有效内存位置。如果我们填充 256 位寄存器，然后尝试用 VLD（128 位矢量加载）分别填充前半部分，3A5000 会显示类似的怪异行为。通常，VLD 指令的行为类似于 XVLD，并将 256 位加载到整个向量寄存器中。如果高 128 位跨页边界，结果再次变得更加随机。

这里的要点是龙芯的 3A5000 会记住寄存器是保存 128 位还是 256 位的值。一旦出现这种情况，操作向量寄存器的低位将对高位产生不可预知的影响。在对寄存器的一个子集进行操作后，龙芯很可能认为向量寄存器的高位是未定义的。理论上，这可以提高性能或简化设计。某些 x86 CPU 在低半部分运行时可能会导致与保留向量寄存器的高位相关的惩罚。例如，Sandy Bridge 在进出旨在保留 AVX YMM 寄存器的上半部分的“已保存状态”时，可能会导致 70 个周期的损失。

不过，龙芯的判罚不同。如果标量浮点运算与向量运算一起使用，FP/向量重命名容量会减少大约 32 个条目。尽管寄存器彼此互为别名（F0-F31 指的是与 XR0-XR31 相同的架构寄存器），但看起来内核必须分别为它们存储状态。Sandy Bridge 的情况更糟，当混合标量和向量运算时，FP 重命名容量会严重降低。Skylake 等较新的 CPU 不会对重新排序容量产生任何影响。3A5000 的矢量性能龙芯的 3A5000 是唯一支持 LSX/LASX 的 CPU，因此我们将在此处了解其 FPU 和矢量实现。3A5000 有一个双端口 FPU，原生支持 256 位执行。执行单元和寄存器都是 256 位宽。为了提供给执行单元，L1D 每个周期可以处理两次 256 位访问。两者都可以是负载，一个可以是商店。与 Zen 1 不同，没有将 256 位指令拆分为两个 128 位微操作。

向量整数和逻辑运算可以使用两个管道，简单的运算（如加法和按位运算）享受单周期延迟。更复杂的操作，如置换或整数乘法，需要三到四个周期，这是相当不错的。对于浮点运算，向量单元的能力较差。FP 加法和乘法每个都得到一个专门的管道，创建与 Sandy Bridge 和旧设计的平行。龙芯确实有 FMA 支持，但两个管道共享一个 FMA 单元。该设置允许 FMA 操作与 FP 添加或 FP 乘法一起进行双重发行。但是，FP 加法、FP 乘法和 FMA 指令的均匀混合并不能完全达到 2 IPC，这可能是因为管道分配不理想以及共享 FMA 单元的争用。浮点执行单元往往比整数执行单元更大、更耗电。龙芯的策略可能侧重于获得 256 位矢量长度的优势，而不是追求最佳性能。假设程序可以使用 256 位向量，浮点吞吐量与 Zen 1 相匹配。但是 3A5000 落后于 Skylake，它每个周期可以执行两个 256 位 FMA 指令。龙芯也在与延迟作斗争。基本的 FP 操作以 5 个周期的延迟执行，这远非理想，尤其是在 2.5 GHz 的低时钟速度下。作为比较，Zen 1 可以执行 FP 加法和乘法，延迟为 3 个周期。Zen 1 上的 FMA 操作有 5 个周期的延迟。

操作说明

描述

3A5000 吞吐量/延迟

管道？

xvadd.d

256 位向量与压缩的 64 位整数相加

每个周期 2 个1 个周期延迟

两个都

xvmul.d

256 位向量与压缩的 64 位整数相乘

每个周期 2 个4 个周期延迟

两个都

xvxor.v

256 位按位异或

每个周期 2 个1 个周期延迟

两个都

xvsll.h

256 位矢量移位

每个周期 2 个1 个周期延迟

两个都

xvfadd.s

256 位向量与打包的 FP32 元素相加

每个周期 2 个5 个周期延迟

FADD管

xvfmul.s

256 位向量与打包的 FP32 元素相乘

每个周期 2 个5 个周期延迟

FMUL管

xvfmadd.d

256 位向量融合乘加与打包的 FP64 元素

每个周期 1 个5 个周期延迟

两者都有，但只有一个执行单元

xvpermi.d

256 位置换，由立即数控制

每个周期 2 个3 个周期延迟

两个都

因此，龙芯在矢量执行单元上的目标似乎并不特别高。即使 Zen 1 具有 128 位执行单元，3A5000 也不会比 Zen 1 在每个周期推动更多的矢量运算。它的浮点方面不是很强，与 Intel 和 AMD 的 2017 时代技术相比具有高延迟和低吞吐量。非 FP 执行更好，尽管 Intel 和 AMD 仍然可以带来更多的端口和更多的吞吐量。为了隐藏执行和内存访问延迟，3A5000 有一个统一的 32 入口 FP 调度器和 96 个向量寄存器可用于重命名（尽管有上面的警告）。添加 32 个非推测寄存器，我们可能会看到总共 128 个向量寄存器。这些寄存器为 256 位宽，提供 4 KB 的总矢量 RF 容量。Zen 1 使用统一的 36 entry FP 调度器，前面有一个 64 entry 的非调度队列。因此，AMD 可以跟踪更多等待执行的操作，即使它必须将 256 位指令拆分为两个微操作。龙芯确实在寄存器文件容量方面领先，因为AMD只有128位宽的寄存器（总共160个）。但这种优势只有在应用程序大量使用 256 位向量时才会体现出来。在执行单元和调度资源方面，3A5000 的 FPU 介于高性能和低功耗之间。它不是 Zen 1 的对手，也绝对不是 Skylake 的对手。龙芯的 256 位矢量宽度和统一调度程序应该让它比 Ampere Altra 更胜一筹，但从我们的 libx264 测试来看，情况并非如此。在同一视频编码测试中，3A5000 确实令人信服地击败了英特尔旧款基于 Goldmont Plus 的赛扬 J4125。然而，Goldmont Plus 的目标功率比 3A5000 低得多，而且 Goldmont Plus 没有任何 AVX 或 FMA 指令。最后的话通过使用不兼容的编码，龙芯可以说他们有一个新的 ISA 并且独立于 MIPS 开发它。称它为 Loongarch 而不是 MIPS 意味着他们不必处理 ISA 的权利，即使 Loongarch 和 MIPS 共享很多语义以至于您可以使用 MIPS64 手册。这种方法很有意义。保持语义意味着龙芯可以快速重用大部分工具链。更改编码意味着他们拥有新的 ISA，并且不受任何许可限制的阻碍。除了 AVX 和 SVE，龙芯的 LASX 是另一种 ISA 扩展，向量长度超过 128 位。更重要的是，龙芯是中国打造国产CPU能力的一份子。LASX 建议中国的目标是高性能，因为 128 位矢量执行对于不关心高性能的低功耗应用来说已经足够了。然而，龙芯 3A5000 的 LASX 实施与 AMD 和英特尔台式机 CPU 中的 AVX2 实施相比没有竞争力，即使我们追溯到几代人。Skylake 和 Zen 1 都具有更广泛的矢量执行设置，并且可以保持更多操作在运行中以吸收延迟。3A5000 的低时钟在棺材上钉了一颗巨大的钉子，确保它完全可以与任何远程现代台式机 CPU 匹敌。可以肯定的是，让 256 位向量执行单元以高时钟速度运行是一项具有挑战性的工作。但是 AMD 和 Intel 已经找到了解决方法。龙芯没有。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。