About MiMo-V2-Omni
MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities. 256K context window.
Specifications
- Provider
- Xiaomi
- Context Length
- 262,144 tokens
- Input Types
- text, audio, image, video
- Output Types
- text
- Category
- Other
- Added
- 3/18/2026