concept-probe — Config-Driven Concept Probing for LLMs

Why Concept Probing?

LLMs encode rich concepts in their hidden layers.
This library helps you extract and use them.

Concept probing lets researchers discover directional vectors inside transformer layers that correspond to abstract behaviors — and then leverage those vectors for analysis and control.

🔬

Interpretability

Identify which layers and directions encode specific behaviors like formality, empathy, or truthfulness.

🎯

Quantitative Scoring

Score any text token-by-token against learned concept vectors — see exactly where a concept activates.

🧭

Behavioral Steering

Inject concept vectors during generation to steer a model toward or away from a target behavior.

📊

Reproducible Experiments

Every run logs configs, metrics, tensors, and plots. Full artifact pipeline for reproducible research.

How It Works

Four phases, one function call

workspace.train_concept(concept) runs a complete pipeline — from generation through statistical evaluation — automatically.

1

Generate Training Completions

The model generates responses using positive and negative system prompts — creating two behaviorally opposed sets of completions.

2

Extract Hidden States

A forward pass collects hidden state representations from every transformer layer for all completions. Configurable readout modes (mean over assistant tokens, last token, etc.) reduce states to fixed-size vectors.

3

Compute Concept Vectors

For each layer, compute v = normalize(μ_pos − μ_neg). Then evaluate with Cohen's d effect size and t-test p-values to find the most discriminative layer.

4

Save Artifacts & Visualizations

The probe, sweep plots, score histograms, configs, logs, and tensors are saved in a timestamped output directory — ready for analysis.

📄 config.json 📄 concept.json 📄 metrics.json 📦 tensors.npz 📝 log.jsonl 📈 sweep.png 📊 score_hist.png

Layer Sweep

Automatically find the best layer

The library scans every transformer layer, computing Cohen's d effect size and p-values to identify where a concept is most strongly encoded. Optional best_layer_search intervals let you focus the search.

Layer 0 (embedding) ★ Best layer: — Layer 31 (output)

Token Heatmaps

See concepts light up, token by token

Interactive HTML heatmaps highlight every token with its concept projection score. Chat-aware rendering separates system, user, and assistant blocks. Hover for exact values.

negative

positive sad_vs_happy probe

Activation Steering

Steer model behavior with concept vectors

Inject learned directions into hidden states during generation. Sigma-scaled alpha provides a comparable, model-adaptive steering strength. Multi-layer injection with per-layer vectors maximizes precision.

α = 0.0σ (neutral)

happy sad

The ocean stretches out before me, vast and timeless. Its waves carry stories from distant shores, each crest a brief moment of light before settling back into the deep.

Layer Modes

"probe" — best layer only
"window" — radius around best
[10, 15, 20] — explicit list

Alpha Units

"raw" — direct value
"sigma" — scaled by projection σ at best layer for interpretable, model-adaptive strength

Multi-Probe Scoring

Score multiple concepts in one pass

Generate text once, then score the same tokens against many concept probes simultaneously. Interactive HTML heatmaps include a probe-selector dropdown. Ideal for comparing how different concepts manifest in the same output.

empathy vs detachment
formal vs casual
truthful vs deceptive
confident vs uncertain

Auto-Graded Evaluations

End-to-end eval pipeline with statistical analysis

Generate items, build prompts, score with multiple alphas, evaluate correctness, rate coherence via LLM, and produce publication-quality plots — all from a single function call.

📝

Generate Items

Custom generator or explicit list

→

🔨

Build Prompts

Template, prefix, or builder fn

→

🧮

Score + Steer

Multiple alphas & steer layers

→

✅

Evaluate

Custom or marker-based evaluator

→

📊

Analyze

Stats, plots, coherence ratings

Auto-generated plots

Accuracy vs Alpha

Mean Score vs Alpha

Score by Correctness

Quickstart

From zero to concept vectors in minutes

Define your concept with two opposing behaviors. Everything else uses smart defaults — but every parameter is overridable.

Python — minimal example

from concept_probe import ConceptSpec, ProbeWorkspace

workspace = ProbeWorkspace(model_id="meta-llama/Llama-3.2-3B-Instruct")

concept = ConceptSpec(
    name="sad_vs_happy",
    pos_label="sad",  neg_label="happy",
    pos_system="You are a sad, melancholic assistant...",
    neg_system="You are a cheerful, upbeat assistant...",
    eval_pos_texts=["I feel a heavy emptiness."],
    eval_neg_texts=["I feel light and excited!"],
)

probe = workspace.train_concept(concept)  # trains + sweeps + saves

probe.score_prompts(
    prompts=["Write about the ocean."],
    alphas=[0.0, 6.0, -6.0],
    alpha_unit="sigma",
)

Python — reloading a trained probe

from concept_probe import ProbeWorkspace

# No retraining needed — load from a previous run directory
workspace = ProbeWorkspace(
    project_directory="outputs/sad_vs_happy/20260109_150734"
)
probe = workspace.get_probe(name="sad_vs_happy")

probe.score_texts(texts=["Life feels meaningless today."])

Features

Everything you need for probing research

⚡

Zero-Config Defaults

Smart defaults for everything — 20 training questions, readout modes, sweep settings. Override any field via JSON deep-merge.

🔄

6 Readout Modes

assistant_all_mean, assistant_last, sequence_last, sequence_all_mean, and last-k variants for both training and reading.

📐

Layer Sweep

Automatic Cohen's d + p-value sweep across all layers. Configurable search intervals with visual shading on plots.

🎛️

Multi-Layer Steering

Inject concept vectors via PyTorch hooks. Per-layer vectors, distribute alpha across layers, sigma-scaled for interpretability.

🔬

Multi-Probe Scoring

One forward pass, N concept probes. Interactive HTML heatmaps with probe-selector dropdown. Compare concepts on the same text.

✅

Auto-Graded Evals

End-to-end evaluation with custom generators, evaluators, marker extraction, LLM coherence rating, and publication-quality plots.

💬

Conversation Format

Prompts can be plain strings or full multi-turn conversations. System prompt precedence handled intelligently with warnings.

📊

Turn-Level Segmentation

Save token scores segmented by role and turn for chat prompts. Per-message averages and prompt vs completion separation.

🔗

Batch Aggregation

Merge multiple evaluation batches into unified cross-batch analyses. Rehydrate analysis on existing batches at any time.

📋

JSON-Driven Training

Define concepts in JSON files and train directly with train_concept_from_json(). Perfect for batch experiments.

🧩

4-bit Quantization

Supports bitsandbytes 4-bit loading for large models on limited hardware. Graceful fallback if unavailable.

📁

Full Artifact Pipeline

Every run saves configs, JSONL logs, NPZ tensors, metrics, plots, and HTML — fully reproducible and inspectable.

Installation

Start probing in seconds

$ pip install git+https://github.com/mneuronico/concept-probe.git

Minimal deps: torch, transformers, numpy. Optional: scipy (p-values), matplotlib (plots), bitsandbytes (4-bit).

Star on GitHub Read Full Docs

Probe Concepts Inside LLMs

LLMs encode rich concepts in their hidden layers.This library helps you extract and use them.

Interpretability

Quantitative Scoring

Behavioral Steering

Reproducible Experiments

Four phases, one function call

Generate Training Completions

Extract Hidden States

Compute Concept Vectors

Save Artifacts & Visualizations

Automatically find the best layer

See concepts light up, token by token

Steer model behavior with concept vectors

Layer Modes

Alpha Units

Score multiple concepts in one pass

End-to-end eval pipeline with statistical analysis

Generate Items

Build Prompts

Score + Steer

Evaluate

Analyze

Auto-generated plots

From zero to concept vectors in minutes

Everything you need for probing research

Zero-Config Defaults

6 Readout Modes

Layer Sweep

Multi-Layer Steering

Multi-Probe Scoring

Auto-Graded Evals

Conversation Format

Turn-Level Segmentation

Batch Aggregation

JSON-Driven Training

4-bit Quantization

Full Artifact Pipeline

Start probing in seconds

LLMs encode rich concepts in their hidden layers.
This library helps you extract and use them.