Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
Summary
In a recent AI Coding Contest involving the Word Gem Puzzle, the open-weights model Kimi K2.6 outperformed frontier models including GPT-5.5 and Claude Opus 4.7. The results demonstrate that open-weights models are approaching performance parity with closed-source models in real-time, rule-based programming tasks.
Key Points
- Kimi K2.6 (Moonshot AI) secured first place with 22 match points and a 7-1-0 record.
- MiMo V2-Pro (Xiaomi) placed second with 20 match points and a 6-2-0 record.
- GPT-5.5 (OpenAI) ranked third with 16 match points and a 5-1-2 record.
- The competition utilized a Word Gem Puzzle with grid sizes ranging from 10×10 to 30×30.
- Kimi K2.6 achieved the highest cumulative tournament score of 77, driven by an aggressive greedy-search sliding strategy.
- On the Artificial Analysis Intelligence Index, Kimi K2.6 scored 54, compared to 60 for GPT-5.5 and 57 for Claude.
Technical Details
The Word Gem Puzzle is a sliding-tile game where models must identify and claim English words in horizontal or vertical lines. The scoring logic heavily penalizes short words to prevent "carpet-bombing" the board: words of seven letters or more score (length - 6), while words under seven letters incur penalties (e.g., a three-letter word costs 3 points; a five-letter word costs 1 point). The challenge involves a 10-second wall-clock limit per round across various grid dimensions.
Model performance was largely determined by their approach to tile movement. Kimi K2.6 utilized a greedy algorithm, executing moves that unlocked the highest number of new positive-value words; this approach was particularly effective on 30×30 grids where high-volume sliding was required to reconstruct words from scrambled tiles. In contrast, MiMo V2-Pro and Claude Opus 4.7 functioned primarily as static scanners, claiming only the words present in the initial grid, which led to failure on larger, more scrambled boards. GPT-5.5 employed a conservative sliding strategy with a move cap to prevent thrashing, while GLM 5.1 executed over 800,000 total slides but lacked a fallback mechanism when positive moves were unavailable.
Impact / Why It Matters
The narrowing performance gap between open-weights models and frontier labs suggests that high-performance, locally deployable models are becoming viable alternatives for complex, real-time decision-making tasks. Developers can now leverage models that approach the capabilities of closed-source APIs for structured, high-stakes automation.