Category: LLM

1 posts here.

2025

  • Qwen3-Reranker CLS-style Refactor Explained: From LM Head to Score Head
    08-13

    Qwen3-Reranker CLS-style Refactor Explained: From LM Head to Score Head

    A system-level deep dive into refactoring Qwen3-Reranker from an LM head (H→V) to a score head (H→1): correct decoder-only pooling, 0.6B FLOPs math, ~296 MiB VRAM savings, bandwidth effects, pitfalls, and a practical deployment checklist - plus open-source repos and a preview of Triton deployment with an OpenAI-style API.

©2025 Shawn. RSS Sitemap
| Already wrote 1.2k words