When using frontier models like Claude 4 Sonnet to edit your codebase, you’re paying premium rates for both valuable changes and unchanged sections alike.Instant apply is about the separation of concerns — use heavyweight frontier models for only the new sections of code, and use a lightweight apply model to merge the new into the old.Our Instant Apply model precisely performs merges while running at 10,000 tok/s on average.
We use this structure because it’s a natural format all LLMs are naturally good at producing.Structured diff formats like uDiff or search and replace (S&R) can be applied deterministically, but formatting error rates are high.Even the best models fail ~8-10% of the time, and it’s much worse for the more economical models like GPT 4.1-mini and Haiku. Formatting errors are also higher in the context of a workflow, where all edits must be represented in one step.
Instant Apply is trained on a wide range of abbreviated edit snippets to make it SoTA for merging code.For the above plot, we manually measured errors for different models on a set of 100 examples across 5 different programming languages. The rate was calculating by counting instances of the following:
Syntax Errors: Merged code is not syntactically valid. e.g. Missing imports, unclosed brackets, etc.
Hallucinations: Model included code that was not explicitly defined in the edit snippet.
Truncations: Merged code didn’t appropriately fill in a // ... rest of code ... block.
Even compared to a very powerful model like Claude 3.5 Sonnet, Relace had over 2x fewer errors.
Our model is deployed with speculative decoding on an optimized inference engine that achieves >10,000 tok/s on average.This is two orders of magnitude faster than any Anthropic or OpenAI model, and four times as fast as models run on specialized silicon (e.g. Cerebras).In practice, due to the nature of speculative decoding you may observe variance on speed depending on the complexity of the edit snippet. Here’s a distribution of latency across n=500 requests: