Category: LLM

1 posts here.

2025

08-13

Qwen3-Reranker CLS-style Refactor Explained: From LM Head to Score Head

A system-level deep dive into refactoring Qwen3-Reranker from an LM head (H→V) to a score head (H→1): correct decoder-only pooling, 0.6B FLOPs math, ~296 MiB VRAM savings, bandwidth effects, pitfalls, and a practical deployment checklist - plus open-source repos and a preview of Triton deployment with an OpenAI-style API.