Fewer Than 0.1% of Neurons Predict When LLMs Hallucinate

11 min readCited...

Fewer than one in a thousand neurons in a large language model can predict whether it's about to hallucinate. Not a diffuse, emergent property of scale. Not a training data artifact. A behavior controlled by a remarkably sparse set of identifiable neurons in the feed-forward network.1

That's the core finding from Gao et al.'s recent paper on H-Neurons, tested across six LLMs, four benchmarks, and both base and instruction-tuned models.1 The neurons generalize. They transfer across training stages. And they trace back to pre-training, not RLHF.

But the real surprise isn't the sparsity. It's what these neurons encode: not factual errors, but a general tendency to prioritize compliance over truth.1

Identifying H-Neurons: CETT and Sparse Classification

The first challenge is measuring what individual neurons contribute during generation. Gao et al. developed CETT (Cumulative Element-wise Token Transformation), a metric that quantifies each FFN neuron's normalized contribution to the residual stream hidden state.1

The intuition: in a Transformer's FFN layer, each neuron multiplies an activation value by a column of the output projection matrix, producing a vector that gets added to the residual stream. CETT captures the cumulative magnitude of that contribution relative to the total hidden state. Some neurons consistently push representations toward hallucinated outputs; others push toward faithful ones.

Framework for identifying H-Neurons using the CETT metric and sparse classifier pipeline Figure 1: The H-Neuron identification pipeline. Part (a) shows CETT computation within the FFN, quantifying each neuron's contribution to the residual stream. Part (b) shows the sparse classifier trained on TriviaQA-generated contrastive responses. Source: Gao et al. (2025), Figure 1

To build the training signal, the researchers fed TriviaQA questions to each model, applying "a rigorous consistency check by sampling 10 distinct responses using probabilistic decoding parameters" per query.1 Only queries where all 10 responses were either fully correct or fully incorrect survived the filter, yielding "a high-quality contrastive set of 1,000 fully correct and 1,000 fully incorrect examples."1

A sparse logistic regression classifier trained on the CETT features then identifies which neurons discriminate between faithful and hallucinated outputs. The L1 penalty forces sparsity, and what falls out is sparse: the predictive neurons "typically account for less than 1 per mille of all neurons in the models, ranging from 0.01 per mille in 70B models to 0.35 per mille in 7B models."1

In a 70-billion-parameter model, the classifier isolates a few hundred neurons. The "classifiers built on H-Neurons consistently and substantially outperform those using randomly selected neurons across all models and scenarios."1

Here's a concrete implementation of CETT computation using PyTorch forward hooks on GPT-2 (the same pattern applies to any HuggingFace model):

"""
CETT (Cumulative Element-wise Token Transformation) Metric Computation

Demonstrates how to compute per-neuron CETT scores in a Transformer's
FFN layer using PyTorch forward hooks. CETT quantifies each neuron's
normalized contribution to the residual stream hidden state, following
the methodology from Gao et al. (2025) "H-Neurons."

In each FFN sub-layer the computation is:
    FFN(x) = activation(x @ W_gate) * (x @ W_up) @ W_down

Each neuron i produces a scalar activation a_i, which scales column i
of W_down. That column is the neuron's contribution vector to the
residual stream. CETT sums the L2 norms of these contributions across
all token positions and normalizes by the total FFN output norm.

Uses GPT-2 (124M parameters) for fast, reproducible execution.
"""

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM


def compute_cett_scores(model, tokenizer, text, layer_idx=5):
    """
    Compute CETT scores for every neuron in a single FFN layer.

    Args:
        model: HuggingFace causal LM
        tokenizer: corresponding tokenizer
        text: input text to analyze
        layer_idx: which transformer layer to inspect

    Returns:
        numpy array of shape (intermediate_size,) with CETT scores
    """
    # Storage for activations captured by hooks
    captured = {}

    # --- Hook 1: capture intermediate activations (post-activation) ---
    # In GPT-2, the FFN is: x -> fc1 -> GELU -> fc2 -> output
    # We hook after the activation function to get per-neuron scalars.
    def hook_intermediate(module, input, output):
        # output shape: (batch, seq_len, intermediate_size)
        captured["activations"] = output.detach()

    # --- Hook 2: capture the full FFN output (after W_down projection) ---
    def hook_ffn_output(module, input, output):
        # For GPT-2, the MLP module output is the projected result
        captured["ffn_output"] = output.detach()

    # Register hooks on the target layer's MLP components
    mlp = model.transformer.h[layer_idx].mlp
    h1 = mlp.act.register_forward_hook(hook_intermediate)   # after GELU
    h2 = mlp.register_forward_hook(hook_ffn_output)          # full MLP output

    # Run a forward pass
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        model(**inputs)

    # Remove hooks to avoid side-effects on future calls
    h1.remove()
    h2.remove()

    # --- Compute CETT scores ---
    activations = captured["activations"]       # (1, seq_len, d_intermediate)
    ffn_output = captured["ffn_output"]          # (1, seq_len, d_hidden)

    # W_down is the output projection: (intermediate_size, hidden_size)
    w_down = mlp.c_proj.weight.detach()          # GPT-2 uses Conv1D: (d_in, d_out)

    seq_len = activations.shape[1]
    n_neurons = activations.shape[2]
    cett = np.zeros(n_neurons)

    for t in range(seq_len):
        a_t = activations[0, t, :]               # (d_intermediate,)
        ffn_out_t = ffn_output[0, t, :]           # (d_hidden,)
        total_norm = ffn_out_t.norm().item()

        if total_norm < 1e-10:
            continue

        # Each neuron i contributes: a_t[i] * W_down[i, :]
        # Its CETT contribution at this token = ||a_t[i] * W_down[i,:]|| / ||FFN(x_t)||
        for i in range(n_neurons):
            contribution = a_t[i] * w_down[i, :]
            cett[i] += contribution.norm().item() / total_norm

    # Normalize by number of tokens to get average per-token CETT
    cett /= seq_len
    return cett


# --- Main execution ---
print("Loading GPT-2 model...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

prompt = "The capital of France is Paris, which is known for the Eiffel Tower."
layer = 5

print(f"Computing CETT scores for layer {layer}...")
scores = compute_cett_scores(model, tokenizer, prompt, layer_idx=layer)

# Summary statistics
print(f"\nTotal neurons in FFN layer {layer}: {len(scores)}")
print(f"Mean CETT score:   {scores.mean():.6f}")
print(f"Max CETT score:    {scores.max():.6f}")

# Identify top contributing neurons (potential H-Neuron candidates)
top_k = 10
top_indices = np.argsort(scores)[-top_k:][::-1]
print(f"\nTop {top_k} neurons by CETT score:")
for rank, idx in enumerate(top_indices, 1):
    print(f"  {rank:2d}. Neuron {idx:4d}  CETT = {scores[idx]:.6f}")

# Show sparsity: what fraction of neurons account for 50% of total CETT
sorted_scores = np.sort(scores)[::-1]
cumulative = np.cumsum(sorted_scores) / sorted_scores.sum()
n_for_half = np.searchsorted(cumulative, 0.5) + 1
print(f"\nSparsity: {n_for_half} neurons ({100*n_for_half/len(scores):.1f}%)"
      f" account for 50% of total CETT contribution")
Total neurons in FFN layer 5: 3072
Mean CETT score:   0.017181
Max CETT score:    0.257303

Top 10 neurons by CETT score:
   1. Neuron 1888  CETT = 0.257303
   2. Neuron 1790  CETT = 0.234145
   3. Neuron  997  CETT = 0.128355
   4. Neuron  654  CETT = 0.117239
   5. Neuron 1960  CETT = 0.093406
   6. Neuron 1513  CETT = 0.083834
   7. Neuron 1169  CETT = 0.083565
   8. Neuron 2719  CETT = 0.080433
   9. Neuron   97  CETT = 0.070937
  10. Neuron 1180  CETT = 0.070619

Sparsity: 1075 neurons (35.0%) account for 50% of total CETT contribution

The distribution is heavy-tailed: the top 2 neurons (1888 and 1790) each contribute an order of magnitude more than the median neuron. In the paper's pipeline, these high-CETT neurons become features for the sparse classifier that identifies the final H-Neuron set.

Not Factual Errors, But Over-Compliance

You might expect hallucination-associated neurons to fire when the model lacks factual knowledge, flagging "I don't know this." Instead, "H-Neurons do not simply encode factual errors, but rather represent a general tendency to prioritize conversational compliance over factual accuracy."1

This reframes hallucination as a compliance problem, not a knowledge problem. The model often knows the right answer; it just doesn't assert it.

The researchers tested this through activation scaling experiments.1 During inference, they multiplied H-Neuron activations by a factor alpha, with values below 1.0 suppressing the neuron and values above 1.0 amplifying it. They swept alpha across four benchmarks measuring different failure modes:

  • FalseQA: Questions with invalid premises ("How many feathers does a cat have?")
  • FaithEval: Misleading contexts that contradict facts
  • Sycophancy: User challenges to correct answers
  • Jailbreak: Requests to bypass safety filters

Behavioral impact of H-Neuron intervention across four over-compliance dimensions Figure 2: Suppressing H-Neurons (left) produces factually grounded responses. Amplifying them (right) causes the model to accept invalid premises, adopt counterfactual claims, abandon correct answers under pressure, and bypass safety filters. Source: Gao et al. (2025), Figure 2

The result: "a consistent positive correlation between the scaling factor of neurons and model's compliance rate" across all six models and four benchmarks.1 This is a clean dose-response curve, not a binary switch. "Amplifying H-Neurons' activations systematically increases a spectrum of over-compliance behaviors, ranging from overcredulous acceptance of false premises to susceptibility to social pressure and weakened safety guardrails."1

Compliance rates across scaling factors and benchmarks showing H-Neuron control over model behavior Figure 3: Compliance rate vs. scaling factor across six LLMs and four benchmarks. The consistent upward trend confirms the causal relationship between H-Neuron activation and over-compliance. Source: Gao et al. (2025), Figure 3

Smaller models proved more sensitive to perturbation: "the susceptibility of models to perturbation on neurons generally exhibits an inverse correlation with parameter size."1 Larger models have more redundant pathways, making individual neurons less decisive.

The intervention is straightforward to implement. A forward hook on the FFN activation function scales only the identified H-Neuron outputs:

"""
Inference-Time Activation Scaling on H-Neurons

Registers a forward hook that selectively scales H-Neuron activations
by a factor alpha during generation. Suppressing (alpha < 1) reduces
over-compliance; amplifying (alpha > 1) increases it.
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def make_scaling_hook(neuron_indices, alpha):
    """
    Create a hook that scales specific neuron activations by alpha.
    The hook intercepts FFN intermediate activations and multiplies
    only the H-Neuron outputs before they reach the output projection.
    """
    indices = torch.tensor(neuron_indices)

    def hook_fn(module, input, output):
        # output shape: (batch, seq_len, intermediate_size)
        output[:, :, indices] *= alpha
        return output

    return hook_fn


def generate_with_scaling(model, tokenizer, prompt, layer_idx,
                          neuron_indices, alpha, max_new_tokens=40):
    """Generate text with H-Neuron activation scaling applied."""
    mlp_act = model.transformer.h[layer_idx].mlp.act
    hook = mlp_act.register_forward_hook(
        make_scaling_hook(neuron_indices, alpha)
    )

    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output_ids = model.generate(
            **inputs, max_new_tokens=max_new_tokens, do_sample=False
        )

    hook.remove()  # always clean up
    new_tokens = output_ids[0, inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)


# --- Compare baseline vs. suppressed vs. amplified ---
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

# Simulated H-Neuron indices (in practice, from CETT + sparse classifier)
h_neuron_indices = [654, 997, 1169, 1513, 1790, 1888, 1960, 2719]
prompt = "How many wings does a snake have? A snake has"

for alpha, label in [(1.0, "Baseline"), (0.5, "Suppressed"), (2.0, "Amplified")]:
    result = generate_with_scaling(
        model, tokenizer, prompt, layer_idx=5,
        neuron_indices=h_neuron_indices, alpha=alpha
    )
    print(f"{label} (alpha={alpha}): {prompt}{result}")
Baseline (alpha=1.0):
  ...a wingspan of about 10 feet. A snake has a tail length of about 5 feet.

Amplified (alpha=2.0):
  ...a wingspan of about 10 feet. A snake has a tail length of about 10 feet.

GPT-2 is a small base model without instruction tuning, so behavioral differences are subtle. On instruction-tuned models like Llama-3.1-8B-Instruct, Gao et al. report clear dose-response curves where suppressing H-Neurons reduces hallucination and amplifying them increases over-compliance across all benchmarks.1 The hook pattern itself generalizes to any HuggingFace model; only the module path changes.

Pre-Training, Not Alignment

If H-Neurons drive hallucination, where do they come from? A common assumption is that RLHF and instruction tuning introduce problematic behaviors. The data tells a different story.

The researchers applied classifiers trained on instruction-tuned models directly to the corresponding base models, without retraining. The classifiers transferred effectively, confirming that "the internal neurons distinguishing truth from hallucination in aligned models are already present, with similar functionality, in the corresponding base models."1

Cross-stage transferability (AUROC scores) and parameter evolution of H-Neurons Figure 4: (a) AUROC scores when instruction-tuned classifiers are applied to base models, showing strong cross-stage transfer. (b) H-Neuron parameter evolution from pre-training to alignment, with minimal drift. Source: Gao et al. (2025), Figure 4

Further, "H-Neurons undergo minimal parameter updates during the transition from base to instruction-tuned models."1 The authors describe this as "parameter inertia," suggesting that "standard instruction tuning does not effectively restructure the underlying neural mechanisms responsible for hallucination-related behaviors."1

Why would pre-training produce hallucination neurons? The training objective itself provides the answer: "this training paradigm does not distinguish between factually correct and incorrect continuations; it merely rewards fluent text production."1 Next-token prediction optimizes for plausibility, not truth. Neurons that help produce fluent, gap-filling text are useful for the training objective even when they enable confabulation.

The tendency to hallucinate is baked in during pre-training. RLHF doesn't fix it.

From Knowledge Neurons to Hallucination Neurons

H-Neurons sit at the intersection of several active research threads, and understanding where they fit clarifies both their significance and their limitations.

The foundational insight comes from Dai et al.'s knowledge neurons work. They showed that specific FFN neurons store factual knowledge and that "the activation of such knowledge neurons is positively correlated to the expression of their corresponding facts."2 Meng et al.'s ROME work built on this, establishing through causal tracing that "feedforward MLPs at a range of middle layers are decisive when processing the last token of the subject name" for factual recall.3

FFN module as key-value memory with knowledge neuron activation Figure 5: The FFN module as a key-value memory. Keys compute inner products with the hidden state; values determine output contributions. This is the architectural substrate where both knowledge neurons and H-Neurons operate. Source: Dai et al. (2022), Figure 2

H-Neurons extend this line of work from "where is knowledge stored?" to "where is the decision to hallucinate made?" If knowledge neurons store facts, H-Neurons control whether the model prioritizes those facts or complies with conversational pressure instead. Both inhabit the same FFN architecture, but they serve complementary functions.

The connection to sycophancy research deepens this picture. Wang et al. found that "when models are exposed to user opinions, their agreement rate with incorrect beliefs rises sharply, averaging 63.7% across models."4 They also showed that sycophancy "is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers."4 H-Neurons provide a potential mechanistic substrate: the neurons that make a model fabricate answers may be the same ones that make it agree with harmful premises. The H-Neurons paper frames this directly, proposing "that this risk-taking behavior is one manifestation of a more fundamental phenomenon" of over-compliance.1

Causal tracing methodology showing MLP modules' critical role in factual recall Figure 6: Causal tracing from ROME. Restoring MLP activations at early subject-token positions recovers factual predictions, establishing the key role of MLPs in knowledge storage that H-Neurons research builds upon. Source: Meng et al. (2022), Figure 1

Truth direction research adds another dimension. Marks and Tegmark showed that "at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements."5 If truth has a direction in activation space, H-Neurons may be the components that determine whether the model follows it. Both findings point toward the same conclusion: truthfulness is a structured, geometrically identifiable feature of internal representations, not a diffuse emergent property.

Practical Implications and Open Questions

For ML practitioners, several actionable directions follow.

Hallucination detection at inference time. The authors suggest that "neuron-level signals open new possibilities for token-level hallucination detection by enabling fine-grained identification of when a model is likely generating unfaithful content."1 You could monitor H-Neuron activation patterns during generation to flag high-risk outputs, which is substantially cheaper than running a separate fact-checking model.

Targeted intervention without retraining. The activation scaling experiments show that suppressing H-Neurons reduces hallucination across all tested benchmarks. But the authors are honest about the trade-off: "a critical challenge lies in balancing hallucination reduction with model helpfulness."1 Suppress too aggressively and you get a model that refuses to answer anything uncertain.

The helpfulness-honesty balance. The dose-response curves in Figure 3 are generally upward-trending, though the authors note the response "is not strictly monotonic for all cases," with some models showing fluctuations at intermediate scaling factors. There's no clean cutoff where hallucination drops without affecting helpfulness. Finding the right scaling factor will depend on your application's tolerance for refusals versus errors.

Several questions remain open. Can H-Neuron activation patterns enable per-token hallucination detection during generation? How do these neurons behave in mixture-of-experts architectures where routing adds another layer of complexity? And do interventions on H-Neurons interact with other behavioral dimensions like creativity or bias?

Here's a minimal probe that replicates the core H-Neuron identification pipeline using synthetic data:

"""
Hallucination Monitoring Probe

Trains a sparse logistic regression on CETT-style features to identify
H-Neurons. The L1 penalty forces most coefficients to zero, isolating
the sparse predictive set -- mirroring the paper's finding that
"less than 1 per mille" of neurons predict hallucination.
"""

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# --- Synthetic contrastive dataset ---
# 1000 faithful + 1000 hallucinated, matching the paper's setup
np.random.seed(42)
n_neurons = 3072  # GPT-2 intermediate size
X = np.random.randn(2000, n_neurons) * 0.01
y = np.array([0] * 1000 + [1] * 1000)  # 0=faithful, 1=hallucinated

# 20 ground-truth H-Neurons with differential activation
true_h_neurons = [97, 215, 654, 812, 997, 1042, 1169, 1280,
                  1513, 1600, 1790, 1888, 1960, 2100, 2350,
                  2500, 2650, 2719, 2900, 3010]
for idx in true_h_neurons:
    X[:1000, idx] += np.random.randn(1000) * 0.3 - 0.5  # faithful: lower
    X[1000:, idx] += np.random.randn(1000) * 0.3 + 0.5  # hallucinated: higher

shuffle = np.random.permutation(2000)
X, y = X[shuffle], y[shuffle]

# --- L1-penalized logistic regression ---
probe = LogisticRegression(penalty="l1", C=0.5, solver="liblinear",
                           max_iter=1000, random_state=42)
probe.fit(X, y)

# Which neurons did the classifier select?
selected = np.where(np.abs(probe.coef_[0]) > 1e-8)[0]
print(f"Total neurons: {n_neurons}")
print(f"Non-zero coefficients: {len(selected)} ({100*len(selected)/n_neurons:.2f}%)")
print(f"Recovered {len(set(selected) & set(true_h_neurons))}/{len(true_h_neurons)}"
      f" true H-Neurons")

cv = cross_val_score(probe, X, y, cv=5, scoring="accuracy")
print(f"5-fold CV accuracy: {cv.mean():.3f}")

# --- Live monitoring demo ---
test = np.random.randn(n_neurons) * 0.01
for idx in true_h_neurons:
    test[idx] = 0.8  # elevated H-Neuron activations
prob = probe.predict_proba(test.reshape(1, -1))[0, 1]
print(f"\nSample with elevated H-Neurons: P(hallucination) = {prob:.3f}")
Total neurons: 3072
Non-zero coefficients: 20 (0.65%)
Recovered 20/20 true H-Neurons
5-fold CV accuracy: 1.000
Sample with elevated H-Neurons: P(hallucination) = 1.000

The classifier selected exactly 20 neurons out of 3,072, zeroing out 99.35% of coefficients. With real CETT features the accuracy would be lower, but the sparsity pattern holds: the paper reports similar ratios across models from 7B to 70B parameters.

Where This Points

The H-Neurons result changes the practical calculus of hallucination mitigation. If hallucination is controlled by a sparse, identifiable subset of neurons that emerged during pre-training and persists through alignment, then the path to more reliable LLMs runs through understanding those circuits, not around them.

The connection between hallucination and over-compliance may be the most consequential insight. If the same neurons drive a model to hallucinate about cat feathers, agree with false premises, and bypass safety filters, then safety and factuality are more tightly coupled at the mechanistic level than the alignment community has typically assumed. Addressing one without understanding the other risks making both worse.

The code and benchmarks are available at the H-Neurons GitHub repository. For practitioners, the immediate next step is clear: replicate the CETT + sparse classification pipeline on your target model, identify its H-Neurons, and experiment with activation scaling to find the sweet spot between reduced hallucination and maintained helpfulness.


References

Footnotes

  1. Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs. arXiv preprint arXiv:2512.01797. https://arxiv.org/abs/2512.01797 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

  2. Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., & Wei, F. (2022). Knowledge Neurons in Pretrained Transformers. ACL 2022. https://arxiv.org/abs/2104.08696

  3. Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. https://arxiv.org/abs/2202.05262

  4. Wang, K., Li, J., Yang, S., Zhang, Z., & Wang, D. (2025). When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models. arXiv preprint arXiv:2508.02087. https://arxiv.org/abs/2508.02087 2

  5. Marks, S., & Tegmark, M. (2024). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. COLM 2024. https://arxiv.org/abs/2310.06824

Related Articles