导读：2023年06月20日，UC伯克利团队开源了vLLM库，用于实现快速LLM推理和服务，它采用了名为PagedAttention的新的注意力算法，有效地管理注意力的键和值。vLLM配备了PagedAttention，重新定义了LLM服务的最新技术水平，相比HuggingFace Transformers，其吞吐量提高了高达24倍，而无需进行任何模型架构的更改。
PagedAttention是vLLM的核心技术，它解决了LLM服务中内存的瓶颈问题。传统的注意力算法在自回归解码过程中，需要将所有输入令牌的注意力键和值张量存储在GPU内存中，以生成下一个令牌。这些缓存的键和值张量通常被称为KV缓存。PagedAttention采用了虚拟内存和分页的经典思想，允许在非连续的内存空间中存储连续的键和值。通过将每个序列的KV缓存划分为块，PagedAttention可以高效地进行注意力计算。PagedAttention的内存利用效率接近最优，仅浪费不到4%的内存。此外，PagedAttention还支持高效的内存共享，进一步减少了复杂采样算法的内存开销，提高了吞吐量。
vLLM已在UC Berkeley的Chatbot Arena和Vicuna Demo中部署2个月，成为推理能力保证的核心技术。vLLM可以使较小的研究团队通过有限的计算资源，也能提供高性能的LLM服务。

LLMs：《vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention》翻译与解读

地址

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

时间

2023年6月20日

作者

By Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution). June 20th, 2023

UC伯克利团队

Abstract

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.

LLM承诺将彻底改变我们在所有行业中使用人工智能的方式。然而，实际提供这些模型具有挑战性，即使在昂贵的硬件上，速度也可能令人惊讶地慢。今天我们很高兴地介绍vLLM，一个用于快速LLM推断和服务的开源库。vLLM利用了我们的新型注意力算法PagedAttention，该算法有效地管理注意力键和值。配备PagedAttention的vLLM重新定义了LLM服务的最新技术水平：相较于HuggingFace Transformers，它的吞吐量高达24倍，而无需进行任何模型架构的更改。

vLLM在过去两个月内在加州大学伯克利分校开发，并在Chatbot Arena和Vicuna Demo上进行部署。它是使LLM服务即使对于像LMSYS这样计算资源有限的小型研究团队也能够负担得起的核心技术。现在您可以通过我们的GitHub存储库使用单个命令尝试vLLM。

Beyond State-of-the-art Performance超越最新技术水平的性能

We compare the throughput of vLLM with HuggingFace Transformers (HF), the most popular LLM library and HuggingFace Text Generation Inference (TGI), the previous state of the art. We evaluate in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). We sample the requests’ input/output lengths from the ShareGPT dataset. In our experiments, vLLM achieves up to 24x higher throughput compared to HF and up to 3.5x higher throughput than TGI.

我们将vLLM的吞吐量与HuggingFace Transformers（HF）（最流行的LLM库）和HuggingFace文本生成推断（TGI）（先前的最新技术水平）进行了比较。我们在两个设置下进行评估：在NVIDIA A10G GPU上进行LLaMA-7B评估，在NVIDIA A100 GPU（40GB）上进行LLaMA-13B评估。我们从ShareGPT数据集中随机选择请求的输入/输出长度。在我们的实验中，vLLM的吞吐量相较于HF高达24倍，并且比TGI高达3.5倍。

The Secret Sauce: PagedAttention秘密武器

In vLLM, we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is

Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.

Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

在vLLM中，我们确定LLM服务的性能受到内存的限制。在自回归解码过程中，LLM的所有输入标记会产生它们的注意力键和值张量，并且这些张量保存在GPU内存中以生成下一个标记。这些缓存的键和值张量通常被称为KV缓存。KV缓存

大：在LLaMA-13B中，单个序列需要高达1.7GB的空间。

动态：它的大小取决于序列长度，而序列长度是高度可变且不可预测的。因此，有效地管理KV缓存是一个重大挑战。我们发现现有系统由于碎片化和过度保留而浪费了60%至80%的内存。

To address this problem, we introduce PagedAttention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. The physical blocks are allocated on demand as new tokens are generated.

为了解决这个问题，我们引入了PagedAttention，这是一种受操作系统中虚拟内存和分页思想启发的注意力算法。与传统的注意力算法不同，PagedAttention允许在非连续的内存空间中存储连续的键和值。具体而言，PagedAttention将每个序列的KV缓存划分为多个块，每个块包含固定数量的标记的键和值。在注意力计算过程中，PagedAttention内核能够高效地识别和提取这些块。

由于块在内存中不需要连续，我们可以像操作系统的虚拟内存那样以更灵活的方式管理键和值：可以将块视为页面，标记视为字节，序列视为进程。序列的连续逻辑块通过块表映射到非连续的物理块。随着生成新标记，物理块按需进行分配。

In PagedAttention, memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%. This boost in memory efficiency proves highly beneficial: It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above.

PagedAttention has another key advantage: efficient memory sharing. For example, in parallel sampling, multiple output sequences are generated from the same prompt. In this case, the computation and memory for the prompt can be shared between the output sequences.

在PagedAttention中，内存浪费仅发生在序列的最后一个块中。实际上，这导致了接近最优的内存使用，仅浪费不到4%。这种内存效率提升非常有益：它使系统能够更多地将序列批处理在一起，提高GPU利用率，并因此显著提高吞吐量，如上面的性能结果所示。

PagedAttention还具有另一个关键优势：高效的内存共享。例如，在并行采样中，从相同的提示生成多个输出序列。在这种情况下，提示的计算和内存可以在输出序列之间共享。

PagedAttention naturally enables memory sharing through its block table. Similar to how processes share physical pages, different sequences in PagedAttention can share the blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention keeps track of the reference counts of the physical blocks and implements the Copy-on-Write mechanism.

PageAttention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput. This makes such sampling methods practical in LLM services.

PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. For more technical details about vLLM and PagedAttention, check out our GitHub repo and stay tuned for our paper.

PagedAttention通过其块表自然地实现内存共享。与进程共享物理页面类似，PagedAttention中的不同序列可以通过将它们的逻辑块映射到相同的物理块来共享这些块。为了确保安全共享，PagedAttention跟踪物理块的引用计数并实现了写时复制机制。

PagedAttention的内存共享极大地减少了复杂采样算法（如并行采样和波束搜索）的内存开销，将它们的内存使用量降低了高达55%。这可以使各种采样方法在LLM服务中实际可行。

PagedAttention是vLLM的核心技术，它是我们的LLM推断和服务引擎，支持各种模型，并具有高性能和易于使用的界面。有关vLLM和PagedAttention的更多技术细节，请查看我们的GitHub存储库，并关注我们的论文。

LMSYS Vicuna和Chatbot Arena背后的无声英雄 The Silent Hero Behind LMSYS Vicuna and Chatbot Arena

This April, LMSYS developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in Chatbot Arena for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based serving backend to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM as the new backend in order to support the growing demands (up to 5x more traffic). In an early internal micro-benchmark by LMSYS, the vLLM serving backend can achieve up to 30x higher throughput than an initial HF backend.

今年4月，LMSYS开发了受欢迎的Vicuna聊天机器人模型，并将其公开提供。此后，Vicuna在Chatbot Arena上为数百万用户提供服务。最初，LMSYS FastChat采用了基于HF Transformers的服务后端来提供聊天演示。随着演示变得越来越受欢迎，峰值流量多次增加，使得HF后端成为一个重要的瓶颈。LMSYS和vLLM团队共同努力，很快开发出了FastChat-vLLM集成，以使用vLLM作为新的后端，以支持不断增长的需求（高达5倍的流量）。根据LMSYS的早期内部微基准测试，vLLM服务后端的吞吐量可达到初始HF后端的30倍。

Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with high throughput and low latency. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The support for more models is being developed and forthcoming.

This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.

自4月中旬以来，Vicuna、Koala和LLaMA等最受欢迎的模型都已成功使用FastChat-vLLM集成进行服务-将FastChat作为多模型聊天服务前端，vLLM作为推断后端，LMSYS能够利用有限数量的大学赞助的GPU为数百万用户提供Vicuna的高吞吐量和低延迟服务。LMSYS正在将vLLM的使用扩展到更多的模型，包括Databricks Dolly、LAION的OpenAsssiant和Stability AI的stableLM。对于更多模型的支持正在开发中并即将推出。

vLLM的利用还大幅降低了运营成本。借助vLLM，LMSYS能够将用于提供上述流量的GPU数量减少了50%。vLLM每天处理平均30,000个请求，峰值可达60,000个，这清楚地证明了vLLM的稳健性。

Get started with vLLM

使用以下命令安装vLLM(更多信息请查看我们的安装指南):

$ pip install vllm

vLLM既可以用于离线推理，也可以用于在线服务。要使用vLLM进行离线推理，您可以导入vLLM并在Python脚本中使用LLM类:

from vllm import LLM

prompts = ["Hello, my name is", "The capital of France is"]  # Sample prompts.
llm = LLM(model="lmsys/vicuna-7b-v1.3")  # Create an LLM.
outputs = llm.generate(prompts)  # Generate texts from the prompts.

要使用vLLM进行在线服务，您可以通过以下方式启动OpenAI api兼容服务器:

$ python -m vllm.entrypoints.openai.api_server --model lmsys/vicuna-7b-v1.3

您可以使用与OpenAI API相同的格式查询服务器:

$ curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "lmsys/vicuna-7b-v1.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

要了解更多使用vLLM的方法，请查看快速入门指南。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。