Using a large open coding model with a serverless GPU
Jan 31, 2026
I use Claude Code a lot now. It’s nice to interact with a codebase using an action I want to achieve like “change this function to use batches”.
As a senior engineer, I get to move much faster.
It’s here to stay.
But I love open-source and self-hosting and being independent, so I wanted to ensure that I could keep having access to this new way of working, even if Anthropic decided to charge $1,000,000 / month.
Modal offers serverless GPU inference, so you can pay by the second for access to GPUs that would otherwise cost $X0,000 to purchase yourself or $X,000 per month to rent.
There are othes similar like RunPod and Replicate which I like quite a lot, but Modal was the easiest to spin up for this project.
So I wanted to setup a coding model, like Devstral-2-123b and wire it up to OpenCode. It might be a lot worse than Opus 4.5, but at least it would retain code actions and show me a bit about how these systems run.
I thought this would take a while to learn and configure, but it took no time at all. I still have more to learn on how vLLM works under the hood, but as far as running it, this was too easy.
Download this file, then install the modal cli and run modal deploy your-file.py - It will output a URL.
To wire it up in OpenCode, edit your config file ~/.config/opencode/opencode.json and add something like this:
Replace ${YOUR_URL} with the URL that Modal gave you. Then, run opencode in whatever directory you want to work in, type /connect and type in something like “modal” to choose your new server.
Then let rip with your new coding agent!
It’s not quite as good as Claude or Sonnet, but with an engineer driving it can make meaningful contributions.
You might notice that this is almost identical to the tutorial that Modal published because it is. I only had to make a few tiny changes in order to run a way more powerful model than the very small Qwen3-4B.
That is why this is barely a tutorial - more so a recording of my thoughts and what I achieved.
Some caveats:
No auth! Shut this down when unused and do not share your URL. Add auth with vLLM at least or realistically a full reverse proxy.
Two H200 might be a bit overspec, but two H100 might be underspec.
I believe vLLM supports batching, so running this all for one user is potentially very inefficient
While this is specced specifically for Modal, its pretty obvious how the parts come together and there is no secret ingredients. A docker container, some packages, download some weights, launch a server.
Now I can keep using coding models no matter what changes in the industry.
I still review and understand every line of code that comes out the other side, it’s my responsibility to make sure it is good. But it is going well, this is a nice way of working. It challenges me to take on both more tasks and harder ones.