Introducing KAT-Dev-32B, KAT-Coder:
Advancing Code Intelligence through Scalable Agentic RL
Today, we're thrilled to announce two groundbreaking models in our KAT series: KAT-Dev-32B and KAT-Coder — representing accessible excellence and ultimate performance in code intelligence.
We are excited to introduce KAT-Dev-32B - our new open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales.
Meanwhile, the most powerful variant: KAT-Coder, achieves 73.4% and marks a superior performance on SWE-Bench Verified.
Our KAT-Dev-32B and KAT-Coder are optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include:
We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench), but it will have a significant impact on the subsequent SFT and RL stages.
We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model's generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage with "teacher trajectories" annotated by human engineers as guidance during training.
Scaling agentic RL hinges on three challenges:
We address these with prefix caching on log-prob computation, entropy-based trajectory pruning, and SeamlessFlow architecture.
KAT-Dev-32B is released to the community for further research and development: Hugging Face.
You can also get accessed to our strongest variant KAT-Coder by simply requesting an API key on StreamLake platform and install Claude Code to start coding, with a detailed technical report coming soon.
We fine-tuned a pretrained model with a two-stage process we call "Mid-Train." In the first stage, we enhanced the model's comprehensive capabilities related to "LLM as an agent," including:
In the second stage, we collected real delivery trajectories labeled by human engineers and synthesized extensive trajectory data to enhance end-to-end requirement delivery capabilities.
At this stage, building on the reinforcement learning (RL) pipeline, we introduce multiple ground truths to guide trajectory exploration, improving rollout efficiency and thereby enhancing the efficiency and stability of the RL phase. By shifting from absolute rewards to measuring discrepancies from ground truth, we further stabilize the RL stage.
Transitioning from directly assigning absolute rewards to evaluating the relative differences between rollout samples and ground truth provides a more stable and accurate reward signal for RL. Meanwhile, we supervise sample correctness in real time during rollouts and promptly terminate generations that clearly deviate from the ground truth, which also yields higher sample efficiency for RL.
After three training stages, we obtain a cold-start model prepared for RL, and the introduction of Reinforcement Finetuning (RFT) also builds a bridge between SFT and RL:
Even with the above mentioned technique, performing training through all tokens in the full tree still remains prohibitively expensive. Thus, we need a mechanism to prioritize nodes that carry the strongest training signals.
To this end, we compress trajectories into a prefix tree where each node represents a shared prefix and each edge corresponds to a segment of tokens. Under a fixed compute budget, the goal is to retain only the most valuable nodes for training. We estimate the informativeness of nodes based on entropy signals aggregated across the tree and their likelihood of being reached, and then prune the tree by expanding nodes in order of importance until the budget is exhausted. Additional heuristics ensure that structurally important regions (e.g., tool or memory events) are preserved and that local context is maintained for stable training. This entropy-based pruning enables us to substantially reduce redundant computation while retaining most of the effective training signal, leading to significant throughout gains and lower overall cost.
To scale RL, it is essential to fully decouple RL training from the diverse internal logic of agents, while also maximizing the utilization of heterogeneous computational architectures. Following the design of SeamlessFlow, we implemented an intermediate layer between agents and RL training dedicated to trajectory-tree management, ensuring strict separation between the two. In addition, we adopted the proposed tag-driven scheduling mechanism to orchestrate task allocation across heterogeneous clusters, thereby minimizing pipeline bubbles and sustaining high-throughput training.
We unify the deployment and evaluation interfaces across different RL execution environments, enabling any newly added environment to be seamlessly integrated at low cost. This unified design lays a solid foundation for scaling RL training across heterogeneous data sources and task types. Specifically, for software development scenarios, we focus on three essential components: problem descriptions paired with corresponding branch code, executable environments, and verifiable test cases. We collect pull requests and the associated issues from open-source repositories and some internal repositories, and filter low-quality data based on the stars, PR activities, and the issue content of these repositories. We then systematically construct executable environment images and generate unit test cases for each collected instance. In addition to software engineering data, we also incorporate other verifiable domains such as math and reasoning tasks, further enriching the diversity of RL signals.
More importantly, beyond open-source data, we further collect and leverage anonymized, enterprise-grade codebases derived from real-world industrial systems for RL training. Unlike training solely on public repositories (like those on GitHub), which often contain simpler projects, these large-scale, complex codebases—spanning multiple programming languages and representing genuine business logic—expose models to significantly more challenging development scenarios, providing highly valuable assets for RL. Training agents to solve such real-world industrial problems not only enhances learning robustness but also grounds the resulting models' programming proficiency in realistic, production-level contexts.
Below is how we observed the model's performance change on SWE-Bench Verified:
KAT-Coder is now available for use with Claude Code. Simply request an API key and End Point ID on the StreamLake Wanqing platform and install Claude Code to start coding.
npm install -g @anthropic-ai/claude-code
Obtain the API key and create an inference endpoint according to the document.
# replace 'ep-xxx-xxx' with your Wangqing End Point ID
export ANTHROPIC_BASE_URL=https://wanqing.streamlakeapi.com/api/gateway/v1/endpoints/ep-xxx-xxx/claude-code-proxy
# replace 'WQ_API_KEY' with your Wangqing API KEY
export ANTHROPIC_AUTH_TOKEN=WQ_API_KEY
Then you should be able to use Claude Code with KAT-Coder!
During our agentic RL scaling, we observed significant emergence:
Substantial Reduction in Multi-turn Interactions: The model demonstrates a capability to complete tasks with fewer interaction turns, with an average decrease of 32% in the interaction turns compared to the model trained after SFT stage.
Parallel Tool Calling: After trained with our RL stage, the model exhibits the capability to call for multiple tools simultaneously, departing from the conventional sequential calling paradigm.
We hypothesize that the emergence primarily arise from the latent optimization objective introduced by the trajectory tree structure:
Formation of Efficiency Preference: Within a trajectory tree, shorter paths (corresponding to fewer interaction turns) are shared across a larger number of trajectories, which creates a latent optimization objective whereby the model learns to propose more efficient solution strategies.
Natural Selection of Parallelization: In the tree structure, parallel tool calling creates additional branching possibilities. These branches are processed independently during training, enabling the model to simultaneously explore various tool calling. Furthermore, our Long-Term Entropy Pruning mechanism preserves tree nodes with more information, and nodes with multi-tool calling typically exhibit higher entropy. This process gradually guides the model towards acquiring a "batch processing" capability.
We are committed to pushing the boundaries of what's possible in code intelligence:
Deep integration with popular IDEs, version control systems, and development workflows to create seamless coding experiences.
Extending our models' capabilities to cover emerging programming languages and frameworks, ensuring comprehensive language support.
Exploring multi-agent systems where KAT models can work together on complex software projects, enabling unprecedented collaboration.
Integrating visual understanding capabilities to process architecture diagrams, UI designs, debugging screenshots, and documentation images alongside code, making the development process more intuitive and efficient.