Latest
-
Qwen3-Reranker CLS-style Refactor Explained: From LM Head to Score Head
A system-level deep dive into refactoring Qwen3-Reranker from an LM head (H→V) to a score head (H→1): correct decoder-only pooling, 0.6B FLOPs math, ~296 MiB VRAM savings, bandwidth effects, pitfalls, and a practical deployment checklist - plus open-source repos and a preview of Triton deployment with an OpenAI-style API.
25 年 8 月 13 日 Wednesday 961 words5 minContinue reading -
Welcome to My Blog!
Hi! I'm an AI developer who loves building real-time systems and writing about practical machine learning, systems, and tooling. Glad you're here.
25 年 8 月 6 日 Wednesday 207 words2 minContinue reading