Model Overview
Given a user request for how to change a codebase, you want to retrieve only the files relevant to implementing that request.
This is important for two reasons:
- Polluting the context window with irrelevant files makes the generated code worse.
- The fewer files you pass in, the more you save on input tokens.
We trained our reranker on hundreds of thousands of user query and code pairs to make it best in class for AI codegen applications.
Accuracy Benchmark
We evaluated our model on two retrieval benchmarks — an inhouse dataset consisting of query/codebase pairs for prompt-to-app tasks, and a more general dataset consisting of open source GitHub PRs.
Recall@k tells you of the k relevant files for the example, how many of those relevant files were ranked in the top k results.
For codegen, high recall is essential because failure to pass relevant files into context entirely breaks the generation.
Comparison to Gemini 2.0 Flash-Lite
Many people start doing retrieval by just passing the query/codebase pair into a model with a huge context window, like Gemini Flash-Lite, and use a prompt to score the relevance of each file.
We beat the accuracy of Gemini 2.0 Flash-Lite at 2x the speed and 2/3 the cost.
Choosing a Relevance Threshold
Our code reranker returns a list of objects with filename and relevance score, like this:
You’ll need to choose an appropriate threshold based on the regime you are in:
- Codebase is too big for the context window, and you have to filter down no matter what
- Codebase fits in the context, but you want to filter to save on input tokens
For case 1, only the ordering matters — you basically just feed in all the top files up to the model’s token limit into context.
We have a token_limit
parameter to make this easy. Just set it to the context length of your generation model.
For case 2, you set a relevance threshold based on the recall you want to hit.
Strategy | Recall | Threshold | Token Savings |
---|---|---|---|
Conservative | 99%+ | ≤ 0.02 | ~25% |
Balanced | 95%+ | ≤ 0.08 | ~45% |
Aggressive | 80-90% | 0.15-0.25 | up to 70% |
For more code examples and endpoint info, see the API reference.