Using a large open coding model with a serverless GPU

I use Claude Code a lot now. It’s nice to interact with a codebase using an action I want to achieve like “change this function to use batches”.

As a senior engineer, I get to move much faster. It’s here to stay.

But I love open-source and self-hosting and being independent, so I wanted to ensure that I could keep having access to this new way of working, even if Anthropic decided to charge $1,000,000 / month.

Modal offers serverless GPU inference, so you can pay by the second for access to GPUs that would otherwise cost $X0,000 to purchase yourself or $X,000 per month to rent.

There are othes similar like RunPod and Replicate which I like quite a lot, but Modal was the easiest to spin up for this project.

So I wanted to setup a coding model, like Devstral-2-123b and wire it up to OpenCode. It might be a lot worse than Opus 4.5, but at least it would retain code actions and show me a bit about how these systems run.

I thought this would take a while to learn and configure, but it took no time at all. I still have more to learn on how vLLM works under the hood, but as far as running it, this was too easy.

# --- # pytest: false # --- # # Run OpenAI-compatible LLM inference with Devstral-2 and vLLM # In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal. # LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more. # This has complicated their interface far beyond "text-in, text-out". # OpenAI's API has emerged as a standard for that interface, # and it is supported by open source LLM serving frameworks like [vLLM](https://docs.vllm.ai/en/latest/). # This example is intended to demonstrate the basics of deploying LLM inference on Modal. # For more on how to optimize performance, see # [this guide](https://modal.com/docs/guide/high-performance-llm-inference) # and check out our # [LLM Engineer's Almanac](https://modal.com/llm-almanac). # Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs # [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible). # ## Set up the container image # Our first order of business is to define the environment our server will run in: # the [container `Image`](https://modal.com/docs/guide/custom-container). # vLLM can be installed with `uv pip`, since Modal [provides the CUDA drivers](https://modal.com/docs/guide/cuda). import json from typing import Any import aiohttp import modal vllm_image = ( modal.Image.from_registry("nvidia/cuda:12.8.0-devel-ubuntu22.04", add_python="3.12") .entrypoint([]) .apt_install("libxcb1") # required by vLLM dependencies .uv_pip_install( "vllm==0.15.0", "huggingface-hub==0.36.0", "mistral_common>=1.8.6", ) .env({"HF_XET_HIGH_PERFORMANCE": "1"}) # faster model transfers ) # ## Download the model weights # MODEL_NAME = "Qwen/Qwen3-4B-Thinking-2507-FP8" # MODEL_NAME = "mistralai/Devstral-Small-2-24B-Instruct-2512" # MODEL_REVISION = "1da725e0a5f7c2e6444035331fa04ff945e1b37a" # avoid nasty surprises when repos update! MODEL_NAME = "mistralai/Devstral-2-123B-Instruct-2512" MODEL_REVISION = "bbcc5a2717b82678c0b069a41c596463fbe5f665" # avoid nasty surprises when repos update! # Although vLLM will download weights from Hugging Face on-demand, # we want to cache them so we don't do it every time our server starts. # We'll use [Modal Volumes](https://modal.com/docs/guide/volumes) for our cache. # Modal Volumes are essentially a "shared disk" that all Modal Functions can access like it's a regular disk. For more on storing model weights on Modal, see # [this guide](https://modal.com/docs/guide/model-weights). hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True) # We'll also cache some of vLLM's JIT compilation artifacts in a Modal Volume. vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True) # ## Configuring vLLM # ### Trading off fast boots and token generation performance # vLLM has embraced dynamic and just-in-time compilation to eke out additional performance without having to write too many custom kernels, # e.g. via the Torch compiler and CUDA graph capture. # These compilation features incur latency at startup in exchange for lowered latency and higher throughput during generation. # We make this trade-off controllable with the `FAST_BOOT` variable below. FAST_BOOT = True # If you're running an LLM service that frequently scales from 0 (frequent ["cold starts"](https://modal.com/docs/guide/cold-start)) # then you'll want to set this to `True`. # If you're running an LLM service that usually has multiple replicas running, then set this to `False` for improved performance. # See the code below for details on the parameters that `FAST_BOOT` controls. # For more on the performance you can expect when serving your own LLMs, see # [our LLM engine performance benchmarks](https://modal.com/llm-almanac). # ## Build a vLLM engine and serve it # The function below spawns a vLLM instance listening at port 8000, serving requests to our model. # We wrap it in the [`@modal.web_server` decorator](https://modal.com/docs/guide/webhooks#non-asgi-web-servers) # to connect it to the Internet. # The server runs in an independent process, via `subprocess.Popen`, and only starts accepting requests # once the model is spun up and the `serve` function returns. app = modal.App("example-vllm-inference") N_GPU = 2 MINUTES = 60 # seconds VLLM_PORT = 8000 @app.function( image=vllm_image, # Enough for smaller models # gpu=f"H100:{N_GPU}", # Bigger models gpu=f"H200:{N_GPU}", scaledown_window=10 * MINUTES, # how long should we stay up with no requests? timeout=20 * MINUTES, # how long should we wait for container start? volumes={ "/root/.cache/huggingface": hf_cache_vol, "/root/.cache/vllm": vllm_cache_vol, }, ) @modal.concurrent( # how many requests can one replica handle? tune carefully! max_inputs=32 ) @modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES) def serve(): import subprocess cmd = [ "vllm", "serve", "--uvicorn-log-level=info", MODEL_NAME, "--revision", MODEL_REVISION, "--served-model-name", MODEL_NAME, "llm", "--host", "0.0.0.0", "--port", str(VLLM_PORT), ] # enforce-eager disables both Torch compilation and CUDA graph capture # default is no-enforce-eager. see the --compilation-config flag for tighter control cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"] # assume multiple GPUs are for splitting up large matrix multiplications cmd += ["--tensor-parallel-size", str(N_GPU)] cmd += [ "--enable-auto-tool-choice", "--tool-call-parser", "mistral", # "hermes" # Qwen "--max-model-len", "262144" ] print(*cmd) subprocess.Popen(" ".join(cmd), shell=True) # ## Deploy the server # To deploy the API on Modal, just run # ```bash # modal deploy vllm_inference.py # ``` # This will create a new app on Modal, build the container image for it if it hasn't been built yet, # and deploy the app. # ## Interact with the server # Once it is deployed, you'll see a URL appear in the command line, # something like `https://your-workspace-name--example-vllm-inference-serve.modal.run`. # You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/) # at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-vllm-inference-serve.modal.run/docs`. # These docs describe each route and indicate the expected input and output # and translate requests into `curl` commands. # For simple routes like `/health`, which checks whether the server is responding, # you can even send a request directly from the docs. # To interact with the API programmatically in Python, we recommend the `openai` library.

Download this file, then install the modal cli and run modal deploy your-file.py - It will output a URL.

To wire it up in OpenCode, edit your config file ~/.config/opencode/opencode.json and add something like this:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "modal-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Modal VLLM",
      "options": {
        "baseURL": "${YOUR_URL}/v1"
      },
      "models": {
        "llm": {
          "name": "mistralai/Devstral-2-123B-Instruct-2512"
        }
      }
    }
  }
}

Replace ${YOUR_URL} with the URL that Modal gave you. Then, run opencode in whatever directory you want to work in, type /connect and type in something like “modal” to choose your new server.

Then let rip with your new coding agent!

It’s not quite as good as Claude or Sonnet, but with an engineer driving it can make meaningful contributions.

You might notice that this is almost identical to the tutorial that Modal published because it is. I only had to make a few tiny changes in order to run a way more powerful model than the very small Qwen3-4B.

That is why this is barely a tutorial - more so a recording of my thoughts and what I achieved.

Some caveats:

No auth! Shut this down when unused and do not share your URL. Add auth with vLLM at least or realistically a full reverse proxy.
Two H200 might be a bit overspec, but two H100 might be underspec.
I believe vLLM supports batching, so running this all for one user is potentially very inefficient

While this is specced specifically for Modal, its pretty obvious how the parts come together and there is no secret ingredients. A docker container, some packages, download some weights, launch a server.

Now I can keep using coding models no matter what changes in the industry.

I still review and understand every line of code that comes out the other side, it’s my responsibility to make sure it is good. But it is going well, this is a nice way of working. It challenges me to take on both more tasks and harder ones.