Probe Concepts Inside LLMs

A config-driven Python library for training concept directions in open-weights language models — then scoring, visualizing, and steering behavior, all in a few lines of code.

0
Lines to start
0
Readout modes
0
Config sections

LLMs encode rich concepts in their hidden layers.
This library helps you extract and use them.

Concept probing lets researchers discover directional vectors inside transformer layers that correspond to abstract behaviors — and then leverage those vectors for analysis and control.

🔬

Interpretability

Identify which layers and directions encode specific behaviors like formality, empathy, or truthfulness.

🎯

Quantitative Scoring

Score any text token-by-token against learned concept vectors — see exactly where a concept activates.

🧭

Behavioral Steering

Inject concept vectors during generation to steer a model toward or away from a target behavior.

📊

Reproducible Experiments

Every run logs configs, metrics, tensors, and plots. Full artifact pipeline for reproducible research.

Four phases, one function call

workspace.train_concept(concept) runs a complete pipeline — from generation through statistical evaluation — automatically.

1

Generate Training Completions

The model generates responses using positive and negative system prompts — creating two behaviorally opposed sets of completions.

2

Extract Hidden States

A forward pass collects hidden state representations from every transformer layer for all completions. Configurable readout modes (mean over assistant tokens, last token, etc.) reduce states to fixed-size vectors.

3

Compute Concept Vectors

For each layer, compute v = normalize(μpos − μneg). Then evaluate with Cohen's d effect size and t-test p-values to find the most discriminative layer.

4

Save Artifacts & Visualizations

The probe, sweep plots, score histograms, configs, logs, and tensors are saved in a timestamped output directory — ready for analysis.

📄 config.json 📄 concept.json 📄 metrics.json 📦 tensors.npz 📝 log.jsonl 📈 sweep.png 📊 score_hist.png

Automatically find the best layer

The library scans every transformer layer, computing Cohen's d effect size and p-values to identify where a concept is most strongly encoded. Optional best_layer_search intervals let you focus the search.

Layer 0 (embedding) ★ Best layer: — Layer 31 (output)

See concepts light up, token by token

Interactive HTML heatmaps highlight every token with its concept projection score. Chat-aware rendering separates system, user, and assistant blocks. Hover for exact values.

negative
positive sad_vs_happy probe

Steer model behavior with concept vectors

Inject learned directions into hidden states during generation. Sigma-scaled alpha provides a comparable, model-adaptive steering strength. Multi-layer injection with per-layer vectors maximizes precision.

α = 0.0σ (neutral)
The ocean stretches out before me, vast and timeless. Its waves carry stories from distant shores, each crest a brief moment of light before settling back into the deep.

Layer Modes

"probe" — best layer only
"window" — radius around best
[10, 15, 20] — explicit list

Alpha Units

"raw" — direct value
"sigma" — scaled by projection σ at best layer for interpretable, model-adaptive strength

Score multiple concepts in one pass

Generate text once, then score the same tokens against many concept probes simultaneously. Interactive HTML heatmaps include a probe-selector dropdown. Ideal for comparing how different concepts manifest in the same output.

  • empathy vs detachment
  • formal vs casual
  • truthful vs deceptive
  • confident vs uncertain

End-to-end eval pipeline with statistical analysis

Generate items, build prompts, score with multiple alphas, evaluate correctness, rate coherence via LLM, and produce publication-quality plots — all from a single function call.

📝

Generate Items

Custom generator or explicit list

🔨

Build Prompts

Template, prefix, or builder fn

🧮

Score + Steer

Multiple alphas & steer layers

Evaluate

Custom or marker-based evaluator

📊

Analyze

Stats, plots, coherence ratings

Auto-generated plots

Accuracy vs Alpha
Mean Score vs Alpha
Score by Correctness

From zero to concept vectors in minutes

Define your concept with two opposing behaviors. Everything else uses smart defaults — but every parameter is overridable.

Python — minimal example
from concept_probe import ConceptSpec, ProbeWorkspace

workspace = ProbeWorkspace(model_id="meta-llama/Llama-3.2-3B-Instruct")

concept = ConceptSpec(
    name="sad_vs_happy",
    pos_label="sad",  neg_label="happy",
    pos_system="You are a sad, melancholic assistant...",
    neg_system="You are a cheerful, upbeat assistant...",
    eval_pos_texts=["I feel a heavy emptiness."],
    eval_neg_texts=["I feel light and excited!"],
)

probe = workspace.train_concept(concept)  # trains + sweeps + saves

probe.score_prompts(
    prompts=["Write about the ocean."],
    alphas=[0.0, 6.0, -6.0],
    alpha_unit="sigma",
)
Python — reloading a trained probe
from concept_probe import ProbeWorkspace

# No retraining needed — load from a previous run directory
workspace = ProbeWorkspace(
    project_directory="outputs/sad_vs_happy/20260109_150734"
)
probe = workspace.get_probe(name="sad_vs_happy")

probe.score_texts(texts=["Life feels meaningless today."])

Everything you need for probing research

Zero-Config Defaults

Smart defaults for everything — 20 training questions, readout modes, sweep settings. Override any field via JSON deep-merge.

🔄

6 Readout Modes

assistant_all_mean, assistant_last, sequence_last, sequence_all_mean, and last-k variants for both training and reading.

📐

Layer Sweep

Automatic Cohen's d + p-value sweep across all layers. Configurable search intervals with visual shading on plots.

🎛️

Multi-Layer Steering

Inject concept vectors via PyTorch hooks. Per-layer vectors, distribute alpha across layers, sigma-scaled for interpretability.

🔬

Multi-Probe Scoring

One forward pass, N concept probes. Interactive HTML heatmaps with probe-selector dropdown. Compare concepts on the same text.

Auto-Graded Evals

End-to-end evaluation with custom generators, evaluators, marker extraction, LLM coherence rating, and publication-quality plots.

💬

Conversation Format

Prompts can be plain strings or full multi-turn conversations. System prompt precedence handled intelligently with warnings.

📊

Turn-Level Segmentation

Save token scores segmented by role and turn for chat prompts. Per-message averages and prompt vs completion separation.

🔗

Batch Aggregation

Merge multiple evaluation batches into unified cross-batch analyses. Rehydrate analysis on existing batches at any time.

📋

JSON-Driven Training

Define concepts in JSON files and train directly with train_concept_from_json(). Perfect for batch experiments.

🧩

4-bit Quantization

Supports bitsandbytes 4-bit loading for large models on limited hardware. Graceful fallback if unavailable.

📁

Full Artifact Pipeline

Every run saves configs, JSONL logs, NPZ tensors, metrics, plots, and HTML — fully reproducible and inspectable.

Start probing in seconds

$ pip install git+https://github.com/mneuronico/concept-probe.git

Minimal deps: torch, transformers, numpy. Optional: scipy (p-values), matplotlib (plots), bitsandbytes (4-bit).