I wanted to call attention to Arli Ai's success in applying my recent modifications to refusal ablation to a MoE model successfully. Nice work, @OwenArli ! ArliAI/GLM-4.5-Air-Derestricted Ablation on a MoE model is no small thing; I expect preserving norms/magnitudes during intervention better respects routing compared to naive refusal ablation.
(I would have tagged their org earlier, but that feature seemed to be broken via "@")
Implemented a proof of concept sampler in pure PyTorch and transformers.
Max P consists of a dynamic token filter which applies Winsorization to cap the probabilties of top tokens. Specifically, a base probability in the range of [0,1] is used to cap individual token probability; the sampler then redistributes excess proportionally.
π AutoXLA - Accelerating Large Models on TPU AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster. With quantization and Splash Attention kernels, AutoXLA achieves up to 4Γ speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads. Whether youβre experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless. β οΈ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues. π GitHub Repository: https://github.com/Locutusque/AutoXLA
Instead of architectural upgade, each major model drop nowadays perfects a regional innovation. What Kimi brought to spot light this time is quantization aware training (QAT). I wrote an article to explain it and why it matters to reasoning models.
Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.
Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.