Model Overview

When using frontier models like Claude 4 Sonnet to edit your codebase, you’re paying premium rates for both valuable changes and unchanged sections alike. Instant apply is about the separation of concerns — use heavyweight frontier models for only the new sections of code, and use a lightweight apply model to merge the new into the old. Our Instant Apply model precisely performs merges while running at 10k+ tok/s on average.

Code Snippets

To save on cost and latency with Instant Apply, you must prompt the frontier model to create abbreviated snippets. Say we have the following code:

function helloWorld() {
  console.log("Hello world!");
}

If we want to add a function below it, the frontier model should only write the new function along with contextual information for where to put it.

// ... keep existing code

function goodbyeWorld() {
  console.log("Goodbye world!");
}

We use this structure because it’s a natural format all LLMs are naturally good at producing. Structured diff formats like uDiff or search and replace (S&R) can be applied deterministically, but formatting error rates are high. Even the best models fail ~8-10% of the time, and it’s much worse for the more economical models like GPT 4.1-mini and Haiku. Formatting errors are also higher in the context of a workflow, where all edits must be represented in one step.

Error Rate Comparison

Instant Apply is trained on a wide range of abbreviated edit snippets to make it SoTA for merging code.

For the above plot, we manually measured errors for different models on a set of 500 examples. The rate was calculating by counting instances of the following:

Functional/Merge Error: Model did not carry out the intent of the edit snippet correctly (e.g. omitting parts of code and inserting code in incorrect places).
Hallucinations: Model included code that was not explicitly defined in the edit snippet.
Truncations: Merged code didn’t appropriately fill in a // ... rest of code ... block.

Relace is able to outperform even the strongest models like Claude 4.5 Sonnet.

Speed Benchmarks

Our model is deployed with speculative decoding on an optimized inference engine that achieves 10k+ tok/s on average.

This is two orders of magnitude faster than any Anthropic or OpenAI model, and four times as fast as models run on specialized silicon (e.g. Cerebras). In practice, due to the nature of speculative decoding you may observe variance on speed depending on the complexity of the edit snippet. Here’s a distribution of latency across n=500 requests:

Relace Instant Apply Latency Distribution

Comparison to Full File Rewrites

If you are currently doing full file rewrites, Instant Apply will save you on average:

~3-4x on end-to-end latency
~3x on cost of output tokens from frontier models

To learn more about how to integrate to your product, see our guides for workflow and agent pipelines.

Getting Started

Model Guides

Repos

Pricing

Model Overview

Code Snippets

Error Rate Comparison

Speed Benchmarks

Comparison to Full File Rewrites

Getting Started

Model Guides

Repos

Pricing

​Code Snippets

​Error Rate Comparison

​Speed Benchmarks

​Comparison to Full File Rewrites

Code Snippets

Error Rate Comparison

Speed Benchmarks

Comparison to Full File Rewrites