Microsoft's Phi-4: A Powerful Small Model Outperforming GPT-4o and Ready for Commercial Use

Microsoft's Phi-4: A Leap in Small Language Models

Microsoft Research recently unveiled Phi-4, a small-parameter model that has garnered significant attention for its exceptional performance. With only 14 billion parameters, Phi-4 has surprisingly outperformed OpenAI's GPT-4o and other top-tier open-source models like Qwen 2.5-14B and Llama-3.3-70B in various benchmark tests.

In a more specific assessment, Phi-4 achieved an impressive score of 91.8 in the American Mathematics Competition (AMC), surpassing numerous well-known open and closed-source models, including Gemini Pro 1.5 and Claude 3.5 Sonnet. Its overall performance is even comparable to Llama-3.1, which boasts 405 billion parameters.

This development has stirred considerable excitement within the community, especially considering that unofficial versions of Phi-4's weights had previously surfaced on Hugging Face. Now, Microsoft has officially open-sourced Phi-4 under the MIT license, making it available for commercial use. The official open-source link is: phi-4

Hugging Face also celebrated the open-sourcing of Phi-4, highlighting its substantial impact.

The Key Advantages of Phi-4: Synthetic Data and Refined Training

The remarkable performance of Phi-4, despite its small size, can be attributed to the use of high-quality synthetic data. Unlike traditional web-crawled data, synthetic data provides more structured and progressive learning materials, enabling the model to learn language logic and reasoning more efficiently.

Structured Learning: Synthetic data is presented step-by-step, such as in the solutions to math problems. This helps the model better grasp the structure of problems and their solutions.
Context Alignment: Synthetic data is more aligned with the model's reasoning context, closely mirroring the output formats needed in real-world applications. This ensures that the model is pre-trained to adapt to practical scenarios. For instance, factual information from online forums is rewritten in a style similar to large language model interactions, making the information more natural and coherent in generated conversations.

Phi-4's synthetic data generation follows several key principles:

Diversity
Nuance and Complexity
Accuracy
Chain of Reasoning

These principles ensure the quality of synthetic data, covering over 50 different types of synthetic datasets. Microsoft generated approximately 400 billion unweighted tokens using multi-stage prompting, seed curation, rewriting and augmentation, and self-revision techniques.

In addition to synthetic data, Phi-4 also uses carefully selected and filtered organic data. This data is collected from various sources, including web content, licensed books, and code repositories. A two-stage filtering process identifies seed data with high educational value and reasoning depth. This seed data forms the basis for synthetic data generation and is also used directly in pre-training, further enriching the model's knowledge base.

During the filtering process, Microsoft uses a small classifier-based method to select high-quality documents from large-scale web data. Special processing is applied to multilingual data, ensuring the model can handle languages including German, Spanish, French, Portuguese, Italian, Hindi, and Japanese.

Phi-4's Training Process

Phi-4's pre-training primarily uses synthetic data, supplemented with a small amount of high-quality organic data. This mixed data strategy enables the model to learn both reasoning and problem-solving skills while also absorbing rich knowledge.

In the mid-training phase, Phi-4 extends its context length from 4096 to 16384 to improve its ability to process long texts. This includes samples longer than 8K context from high-quality non-synthetic datasets, as well as newly created synthetic datasets that meet the 4K sequence requirements.

The post-training phase is crucial for Phi-4's optimization, employing techniques like supervised fine-tuning (SFT) and direct preference optimization (DPO).

SFT Phase: The pre-trained model is fine-tuned using approximately 8 billion tokens generated from high-quality data across various domains. A learning rate of 10-6 is used, along with multilingual data from 40 languages, all in chatml format.
DPO Technique: This method adjusts the model's output by generating preference data, aligning it more closely with human preferences. Microsoft also introduces Key Tokens Search (PTS) to generate DPO pairs. PTS identifies key tokens that significantly impact the correctness of the model's answers and creates preference data specifically for these tokens, thereby enhancing the model's performance in reasoning tasks.

Phi-4's Performance Evaluation

Microsoft has evaluated Phi-4's performance across multiple benchmarks. In academic benchmarks such as MMLU, GPQA, MATH, and HumanEval, Phi-4 has shown remarkable results.

In the MMLU test, Phi-4 achieved a high score of 84.8. It even surpassed GPT-4o in the GPQA and MATH tests, demonstrating strong reasoning capabilities in math competition-related tasks. Compared to other models of similar and larger scales, Phi-4 outperformed similar open-source models like Qwen-2.5-14B-Instruct in 9 out of 12 benchmarks. This comprehensive evaluation underscores Phi-4's position as a leading small language model.