Parallel/ Concurrent request with vLLM

My question might be a bit basic, but I’m new to all of this and eager to learn.

I have a basic setup where I initialize an LLM using vLLM with Langchain RAG and the Llama model (specifically, llama2-13b-chat-hf). Here’s what I do:

  • I define a system prompt and an instruction f
  • I create an llm_chain
  • I then run the chain with llm_chain.run(text) , which works for a single input.

I have build an app with FastAPI. Previously I used asyncio method to handle multiple request to llm, but with each new request it become slower in response. So I decide to use vLLM method, but I got a problem now how to provide parallel or concurrent requests to vLLM when I have dealing with dozen or more users. Is there a way to call run in parallel for several inputs and receive valid results for each input?

Hi,
Have you got solutions for it?

Hii, Have you got any solution for the same. I am also facing the same issues.

Hello unfortunately no, so decide to move on without vLLM.