Mistral CodeStral Tops Leaderboards with 256k Context Window

Mistral's CodeStral Achieves Top Ranking

Mistral, often called the 'European OpenAI', has launched an updated version of its code model, CodeStral. This new iteration has rapidly climbed to the top of the Copilot Arena, sharing the first-place position with DeepSeek V2.5 and Claude 3.5. A significant upgrade is the expansion of the context window to an impressive 256k, an eightfold increase.

Enhanced Performance and Speed

The new CodeStral (2501) features a more efficient architecture and tokenizer, resulting in a doubling of generation speed compared to its predecessor. It has also achieved state-of-the-art (SOTA) results across various benchmarks and demonstrates significant code completion (FIM) capabilities. According to Mistral's partner Continue.dev, the 2501 version represents a major leap forward in the field of FIM.

Copilot Arena Victory

CodeStral 2501 has secured the top spot in the Copilot Arena, a competitive platform for code models, tying with Deepseek V2.5 and Claude 3.5 Sonnet. This represents a 12-point (1.2%) improvement over the previous CodeStral version (2405). While models like Llama 3.1, Gemini 1.5 Pro, and GPT-4o rank lower, the absence of o1 suggests the rankings could change with its inclusion.

Copilot Arena Details

The Copilot Arena was established last November through a collaboration between researchers at Carnegie Mellon University and UC Berkeley, along with LMArena. It operates similarly to the LLM Arena, where users submit problems, and the system randomly selects two models to provide anonymous outputs. Users then choose the better output. As a code-specific version of the LLM Arena, Copilot Arena also serves as an open-source programming tool that allows users to compare multiple models simultaneously in VSCode. Currently, 12 code models have competed in over 17,000 battles.

SOTA Results Across Multiple Benchmarks

Mistral has also announced that CodeStral 2501 has achieved SOTA results in several metrics on traditional tests like HumanEval. The models chosen for comparison were those with fewer than 100B parameters, generally considered strong in FIM tasks. Furthermore, the context window has increased from 32k in the 2405 version (22B parameters) to 256k in the new version. In tests involving Python and SQL databases, CodeStral 2501 has consistently ranked first or second across multiple metrics.

Language Performance

CodeStral, which reportedly supports over 80 languages, achieved an average HumanEval score of 71.4%, nearly 6 percentage points higher than the second-place model. It has also attained SOTA status in common languages like Python, C+, and JS, and has exceeded 50% in C# language scores. Notably, the performance of CodeStral 2501 in Java has declined compared to its predecessor.

FIM Performance

The Mistral team also released the FIM performance data for CodeStral 2501, measured by single-line exact match. The average score and the Python, Java, and JS individual scores have all improved compared to the prior version and surpass other models like the OpenAI FIM API (3.5 Turbo). DeepSeek is a close competitor. The FIM pass@1 results show similar trends.

Availability

CodeStral 2501 is accessible through Mistral's partner, Continue, for use in VSCode or Jetbrains IDEs. Users can also deploy it themselves via API, with pricing at 0.3/0.9 USD or EUR per million input/output tokens.