Metadata-Version: 2.1
Name: langport
Version: 0.3.4
Summary: A large language model serving platform.
Project-URL: Homepage, https://github.com/vtuber-plan/langport
Project-URL: Bug Tracker, https://github.com/vtuber-plan/langport/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: accelerate>=0.22.0
Requires-Dist: fastapi
Requires-Dist: httpx
Requires-Dist: numpy
Requires-Dist: pydantic
Requires-Dist: requests
Requires-Dist: rich>=10.0.0
Requires-Dist: sentencepiece
Requires-Dist: shortuuid
Requires-Dist: tokenizers>=0.12.1
Requires-Dist: torch
Requires-Dist: transformers>=4.33.1
Requires-Dist: uvicorn
Requires-Dist: wandb
Requires-Dist: tenacity>=8.2.2
Provides-Extra: dev
Requires-Dist: black==23.3.0; extra == "dev"
Requires-Dist: pylint==2.8.2; extra == "dev"
Provides-Extra: ggml
Requires-Dist: ctransformers; extra == "ggml"
Provides-Extra: optimum
Requires-Dist: onnx; extra == "optimum"
Requires-Dist: onnxruntime; extra == "optimum"
Requires-Dist: optimum; extra == "optimum"

<div align="center">

# LangPort

<a href="https://github.com/vtuber-plan/langport">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/vtuber-plan/langport?style=social">
</a>
<a href="https://github.com/vtuber-plan/langport/blob/main/LICENSE">
  <img alt="License" src="https://img.shields.io/github/license/vtuber-plan/langport">
</a>

![architecture](assets/architecture.jpg)

</div>

LangPort is a open-source large language model serving platform.
Our goal is to build a super fast LLM inference service.

This project is inspired by [lmsys/fastchat](https://github.com/lm-sys/FastChat), we hope that the serving platform is lightweight and fast, but fastchat includes other features such as training and evaluation make it complicated.

The core features include:
- Huggingface transformers support.
- ggml (llama.cpp) support.
- A distributed serving system for state-of-the-art models.
- Streaming generation support with various decoding strategies.
- Batch inference for higher throughput.
- Support for encoder-only, decoder-only and encoder-decoder models.
- OpenAI-compatible RESTful APIs.
- FauxPilot-compatible RESTful APIs.
- HuggingFace-compatible RESTful APIs.
- Tabby-compatible RESTful APIs.

## Support Model Architectures
* LLaMa, LLaMa2, GLM, Bloom, OPT, GPT2, GPT Neo, GPT Big Code and so on.

## Tested Models
* NingYu, LLaMa, LLaMa2-chat, Vicuna, ChatGLM, ChatGLM2, Falcon, Starcoder, WizardLM, InternLM, OpenBuddy, FireFly, CodeGen, Phoenix, RWKV, StableLM and so on.


## Benchmark
We use single RTX3090 to run a finetuned 7B LLaMA model (OpenBuddy V0.9) under the bf16 setting.
We create 32 threads to submit chat tasks to the server, and the following figure shows the Queries Per Second (QPS) and Tokens Per Second (TPS) of FastChat and LangPort with different max model concurrency settings.

![benchmark_chat](assets/benchmark_chat.jpg)

## News
- [2023/08/04] Dynamic batch inference.
- [2023/07/16] Support int4 quantization.
- [2023/07/13] Support generation logprobs parameter.
- [2023/06/18] Add ggml (llama.cpp gpt.cpp starcoder.cpp etc.) worker support.
- [2023/06/09] Add LLama.cpp worker support.
- [2023/06/01] Add HuggingFace Bert embedding worker support.
- [2023/06/01] Add HuggingFace text generation API support.
- [2023/06/01] Add tabby API support.
- [2023/05/23] Add chat throughput test script.
- [2023/05/22] New distributed architecture.
- [2023/05/14] Batch inference supported.
- [2023/05/10] Langport project started.


## Install

### Method 1: With pip

```bash
pip install langport
```

or:

```bash
pip install git+https://github.com/vtuber-plan/langport.git 
```

If you need ggml generation worker, use this command:
```bash
pip install langport[ggml]
```

If you wanna use GPU:
```bash
CT_CUBLAS=1 pip install langport[ggml]
```

### Method 2: From source

1. Clone this repository
```bash
git clone https://github.com/vtuber-plan/langport.git
cd langport
```

2. Install the Package
```bash
pip install --upgrade pip
pip install -e .
```

## Start the server

It is simple to start a single node chat API service:
``` bash
python -m langport.service.server.generation_worker --port 21001 --model-path <your model path>
python -m langport.service.gateway.openai_api
```

If you need the embeddings API or other features, you can deploy a distributed inference cluster:
``` bash
python -m langport.service.server.dummy_worker --port 21001
python -m langport.service.server.generation_worker --model-path <your model path> --neighbors http://localhost:21001
python -m langport.service.server.embedding_worker --model-path <your model path> --neighbors http://localhost:21001
python -m langport.service.gateway.openai_api --controller-address http://localhost:21001
```

In practice, the gateway can connect to any node to distribute inference tasks:

``` bash
python -m langport.service.server.dummy_worker --port 21001
python -m langport.service.server.generation_worker --port 21002 --model-path <your model path> --neighbors http://localhost:21001
python -m langport.service.server.generation_worker --port 21003 --model-path <your model path> --neighbors http://localhost:21001 http://localhost:21002
python -m langport.service.server.generation_worker --port 21004 --model-path <your model path> --neighbors http://localhost:21001 http://localhost:21003
python -m langport.service.server.generation_worker --port 21005 --model-path <your model path> --neighbors http://localhost:21001 http://localhost:21004
python -m langport.service.gateway.openai_api --controller-address http://localhost:21003 # 21003 is OK!
python -m langport.service.gateway.openai_api --controller-address http://localhost:21002 # Any worker is also OK!
```

Run text generation with multi GPUs:

``` bash
python -m langport.service.server.generation_worker --port 21001 --model-path <your model path> --gpus 0,1 --num-gpus 2
python -m langport.service.gateway.openai_api
```

Run text generation with ggml worker:

```bash
python -m langport.service.server.ggml_generation_worker --port 21001 --model-path <your model path> --gpu-layers <num layer to gpu (resize this for your VRAM)>
```

Run OpenAI forward server: 
```bash
python -m langport.service.server.chatgpt_generation_worker --port 21001 --api-url <url> --api-key <key>
```


## License

langport is released under the Apache Software License.


## See also

- [langport-docs](https://github.com/vtuber-plan/langport/tree/main/docs)
- [langport-source](https://github.com/vtuber-plan/langport)


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=vtuber-plan/langport&type=Date)](https://star-history.com/#vtuber-plan/langport&Date)

