MiniMax Open Sources 456B Parameter Model with 4M Context

MiniMax Embraces the Agent Era with Open Source Models

The AI community is anticipating 2025 as the year of the AI Agent, with industry leaders highlighting their potential impact. MiniMax has responded by open-sourcing its foundational language model, MiniMax-Text-01, and visual-multimodal model, MiniMax-VL-01. These models feature a novel linear attention mechanism, significantly expanding their context windows.

Key Innovations in MiniMax's Open Source Models

A major advancement is the ability of MiniMax's models to process 4 million tokens at once, 20 to 32 times more than other models. This capability is essential for agent applications requiring long context windows for memory and collaboration.

MiniMax-Text-01 Innovations:

Lightning Attention: A linear attention form that reduces computational complexity from quadratic to linear, achieved through a right product kernel trick.
Hybrid-lightning: Combines Lightning Attention and softmax attention, with softmax attention replacing Lightning Attention every eight layers, improving scaling while maintaining efficiency.
Mixture of Experts (MoE): MoE models show superior performance over dense models, especially with similar computational loads. MiniMax also introduced an allgather communication step to prevent routing collapse.
Computational Optimization: MiniMax optimized the MoE architecture using a token-grouping based overlap scheme to reduce communication loads and used a data-packing technique for long-context training. They also adopted four optimization strategies for Lightning Attention: batched kernel fusion, separate prefill and decode execution, multi-level padding, and strided batched matrix multiplication expansion.

These innovations resulted in a 456 billion parameter LLM with 32 experts, where each token activates 45.9 billion parameters.

MiniMax-Text-01's Benchmark Performance

MiniMax-Text-01 has demonstrated excellent performance, rivaling and surpassing closed-source models like GPT-4o and Claude 3.5 Sonnet, and open-source models like Qwen2.5 and Llama 3.1.

It outperforms Instruct Qwen2.5-72B on HumanEval.
It achieved a score of 54.4 on the GPQA Diamond dataset, surpassing most fine-tuned LLMs and GPT-4o.
It achieved top-three scores in MMLU, IFEval, and Arena-Hard, showing its ability to apply knowledge and meet user queries effectively.

Superior Contextual Capabilities

MiniMax-Text-01's extended context window is a key differentiator. Its performance significantly increases beyond 128k context length in the Ruler benchmark. The model also shows exceptional performance in LongBench v2's long-context reasoning tasks and demonstrates state-of-the-art long-context learning abilities as verified by the MTOB benchmark.

Real-World Applications

The capabilities of MiniMax-Text-01 extend beyond benchmarks.

It can generate creative content, including songs with nuanced language and emotional depth.
It can perform complex tasks like translating less common languages such as Kalamang.
It exhibits excellent memory in long conversations.

MiniMax-VL-01: A Visual-Language Model

Based on MiniMax-Text-01, MiniMax developed MiniMax-VL-01, a multimodal version that integrates an image encoder and adapter. It uses a ViT for visual encoding with a two-layer MLP projector for image adaptation. This model was trained with image-language data using a proprietary dataset and a multi-stage training strategy.

MiniMax-VL-01 demonstrates strong performance on various benchmarks, often matching or exceeding other SOTA models, and is capable of analyzing complex visual data like navigation maps.

The Future of AI Agents

MiniMax is pushing the boundaries of context window capabilities, researching architectures to eliminate softmax attention and enable infinite context windows. They recognize the importance of multimodal models for AI agents, aiming to create agents that are natural, accessible, and ubiquitous, with the potential to interact with the physical world.