2026-04-22 · Deepak Sharma

Why Margin lets you self-host the model

If you're a sales leader at a bank, your meeting transcripts are some of the most sensitive data your company holds. They contain unannounced acquisitions, customer pricing, internal conflict — the lot. The default for AI notetakers is to send all of that to a third-party LLM API. That's a non-starter for most large enterprises.

Margin's Enterprise tier ships a self-hosted ML service. The LLM runs in your VPC, on your GPUs (or ours, behind a private link). Transcripts never leave your network. The product is exactly the same — same UI, same routing, same Ask Margin.

We chose Ollama as the inference layer because it abstracts away the model. Our `services/ml` directory has a thin client that calls Ollama for extraction, embeddings, RAG, and live coaching. Swap in vLLM, swap in TGI, swap in a Llama 4 model on a single H100 — none of it requires a code change.

The model defaults are: qwen2.5:7b for spine extraction, qwen2.5:3b for live whispers, nomic-embed-text for embeddings. All open, all small, all run on a single GPU. For workspaces that want frontier quality, hosted models stay available as a backstop.

This is the part of the product that should not be in the marketing copy. It should just be true. So it is.

← All posts Try Margin →