Deepseek-v3 Leaked: Outperforming Claude 3.5 Sonnet in Programming

Deepseek-v3: A Surprise Contender in the LLM Arena

The world of large language models (LLMs) has been shaken by the unexpected emergence of Deepseek-v3, a model that was not officially announced but has already made significant waves due to its impressive performance. Leaked through various APIs and web pages, this model is quickly gaining recognition for its capabilities, particularly in programming benchmarks.

Core Highlights of Deepseek-v3

Unexpected Leak: Deepseek-v3 was not officially released but was discovered by Reddit users on various APIs and web pages.
Benchmark Performance: It has surpassed Claude 3.5 Sonnet on the Aider multilingual programming benchmark.
Top Open-Source LLM: Currently, Deepseek-v3 is considered the strongest open-source LLM on the LiveBench evaluation platform.
Advanced Architecture: The model features a 685B parameter Mixture of Experts (MoE) structure, showcasing major advancements over previous iterations.

Background of the Leak

The discovery of Deepseek-v3 was initially reported by Reddit users who found the model accessible through various APIs and web interfaces. This unexpected availability led to rapid testing and analysis, revealing its superior capabilities. The model's performance was assessed using several benchmarks, including Aider and LiveBench, which confirmed its leading position among open-source LLMs. The model's open-source weights are now available on Hugging Face, although a model card is still not available.

Technical Deep Dive into Deepseek-v3

Model Architecture

Parameter Size: 685 billion parameters
MoE Structure: Mixture of Experts architecture with 256 experts
Routing: Utilizes a sigmoid function for routing, selecting the top 8 experts (Top-k=8)
Context Window: Supports a 64K context window, with a default of 4K and a maximum of 8K
Token Generation Speed: Approximately 60 tokens per second

Key Architectural Changes Compared to V2

The evolution from Deepseek-v2 to v3 involves several crucial architectural enhancements, particularly in expert selection and training efficiency. These changes are pivotal to understanding the performance leap seen in Deepseek-v3.

Gate Function: Instead of the softmax function used in v2, Deepseek-v3 employs a sigmoid function for expert selection. This adjustment allows the model to choose from a broader range of experts, unlike the softmax function, which tends to favor a limited set of experts.
Top-k Selection: Deepseek-v3 introduces a novel noaux_tc method for Top-k selection, eliminating the need for an auxiliary loss. This simplifies the training process and boosts efficiency by using the main task's loss function directly.
Expert Score Adjustment: A new parameter, e_score_correction_bias, has been added to fine-tune expert scores, which results in improved performance during expert selection and model training.

Comparison With V2 and V2.5

The enhancements in Deepseek-v3 are evident when compared to its previous versions, v2 and v2.5. These comparisons highlight the strides made in model configuration and overall performance.

v3 vs v2: Deepseek-v3 is essentially a significantly enhanced version of v2, showcasing substantial improvements across all parameters.
v3 vs v2.5: Deepseek-v3 surpasses v2.5 in several key configuration aspects, including a higher number of experts, larger intermediate layer sizes, and a greater number of experts per token.

User Testing and Observations

Initial user tests and observations have provided valuable insights into Deepseek-v3's capabilities and some unexpected behaviors. These tests have been instrumental in understanding the model's strengths and quirks.

Initial Tests

Simon Willison, a developer, conducted initial tests on Deepseek-v3 and discovered that the model identifies itself as being based on OpenAI's GPT-4 architecture. This was an unexpected revelation and triggered further investigation. The model's image generation capability was also tested, where it successfully generated an SVG image of a pelican riding a bicycle.

Unexpected Self-Identification

Multiple users have reported that Deepseek-v3 identifies itself as being based on OpenAI models. This could be attributed to the use of OpenAI model responses during its training phase. This unexpected behavior has raised questions about the model's training data and methodologies.

Community Reaction

The unexpected release and impressive performance of Deepseek-v3 have generated considerable excitement within the community. Many users are impressed by the model's capabilities and its potential impact on the open-source LLM landscape.

Surpassing OpenAI Models?

Some users have expressed the belief that Deepseek-v3's performance surpasses that of OpenAI's models, especially in the open-source domain. This has sparked discussions and further testing to validate these claims. The model's availability and open-source nature make it a compelling alternative for researchers and developers.

Additional Resources

For those interested in further exploring Deepseek-v3, here are some additional resources: