Why Margin lets you self-host the model
If you're a sales leader at a bank, your meeting transcripts are some of the most sensitive data your company holds. They contain unannounced acquisitions, customer pricing, internal conflict — the lot. The default for AI notetakers is to send all of that to a third-party LLM API. That's a non-starter for most large enterprises.
Margin's Enterprise tier ships a self-hosted ML service. The LLM runs in your VPC, on your GPUs (or ours, behind a private link). Transcripts never leave your network. The product is exactly the same — same UI, same routing, same Ask Margin.
We chose Ollama as the inference layer because it abstracts away the model. Our `services/ml` directory has a thin client that calls Ollama for extraction, embeddings, RAG, and live coaching. Swap in vLLM, swap in TGI, swap in a Llama 4 model on a single H100 — none of it requires a code change.
The model defaults are: qwen2.5:7b for spine extraction, qwen2.5:3b for live whispers, nomic-embed-text for embeddings. All open, all small, all run on a single GPU. For workspaces that want frontier quality, hosted models stay available as a backstop.
This is the part of the product that should not be in the marketing copy. It should just be true. So it is.