TopK Language Models

1Tohoku University, 2RIKEN, 3MBZUAI

TL;DR

We replace the activation function in Transformer-based LMs with the TopK function,
resulting in SAE-like highly interpretable sparse activations.

Why Sparsely-activated LMs?

Sparse Autoencoders (SAEs) are one of go-to tools for interpreting the hidden states of LMs, since their sparse activations often (but not always) exhibit interpretable patterns. However, SAEs are trained post-hoc, which comes with several drawbacks such as additional training cost, inconsistency of learned SAE features across random seeds, and the possibility that SAEs do not learn all features represented in the underlying LM.

That's where we asked: Why don't we just put the SAE into the LM right from the start? By replacing the LM's original activation function with the TopK activation function popularized by SAEs, we get a new LM architecture that combines the performance benefits of transformer-based LMs with the interpretability advantages of sparse autoencoders, without requiring post-hoc training.

Architecture

TopKLM Architecture
  • Early and Middle Layers: Apply TopK activation to create sparse, interpretable representations
  • Final Layers: Maintain dense processing to preserve model expressivity and performance

The architecture integrates TopK activation functions in selected layers to achieve sparse activation patterns while maintaining model expressivity. The key insight is to apply sparsity selectively - using TopK activation in early layers for interpretability, while keeping final layers dense for performance.

Interpretability Advantages

TopK LM Overview

TopK LMs enjoy SAE-like interpretability with three key benefits:

  1. Improved Neuron Interpretability: Individual neurons can be clearly interpreted as representing specific concepts (e.g., "Work")
  2. Concept Steering: Single-neuron interventions can steer text output towards specific concepts (e.g., steering towards "work" concepts)
  3. Traceability of Neuron Specialization: The formation of specialized neurons can be traced across training checkpoints and across different layers

BibTeX

              
@article{takahashi-2025-topklm,
  author = {Takahashi, Ryosuke and Inaba, Tatsuro and Inui, Kentaro and Heinzerling, Benjamin},
  title = {TopK Language Models},
  year = {2025},
  journal = {arXiv [cs.CL]},
  url = {https://arxiv.org/abs/2506.21468v1}
}