A config-driven Python library for training concept directions in open-weights language models — then scoring, visualizing, and steering behavior, all in a few lines of code.
Concept probing lets researchers discover directional vectors inside transformer layers that correspond to abstract behaviors — and then leverage those vectors for analysis and control.
Identify which layers and directions encode specific behaviors like formality, empathy, or truthfulness.
Score any text token-by-token against learned concept vectors — see exactly where a concept activates.
Inject concept vectors during generation to steer a model toward or away from a target behavior.
Every run logs configs, metrics, tensors, and plots. Full artifact pipeline for reproducible research.
workspace.train_concept(concept) runs a complete pipeline — from generation through statistical evaluation — automatically.
The model generates responses using positive and negative system prompts — creating two behaviorally opposed sets of completions.
A forward pass collects hidden state representations from every transformer layer for all completions. Configurable readout modes (mean over assistant tokens, last token, etc.) reduce states to fixed-size vectors.
For each layer, compute v = normalize(μpos − μneg). Then evaluate with Cohen's d effect size and t-test p-values to find the most discriminative layer.
The probe, sweep plots, score histograms, configs, logs, and tensors are saved in a timestamped output directory — ready for analysis.
The library scans every transformer layer, computing Cohen's d effect size and p-values to identify where a concept is most strongly encoded. Optional best_layer_search intervals let you focus the search.
Interactive HTML heatmaps highlight every token with its concept projection score. Chat-aware rendering separates system, user, and assistant blocks. Hover for exact values.
Inject learned directions into hidden states during generation. Sigma-scaled alpha provides a comparable, model-adaptive steering strength. Multi-layer injection with per-layer vectors maximizes precision.
"probe" — best layer only"window" — radius around best[10, 15, 20] — explicit list
"raw" — direct value"sigma" — scaled by projection σ at best layer for interpretable, model-adaptive strength
Generate text once, then score the same tokens against many concept probes simultaneously. Interactive HTML heatmaps include a probe-selector dropdown. Ideal for comparing how different concepts manifest in the same output.
Generate items, build prompts, score with multiple alphas, evaluate correctness, rate coherence via LLM, and produce publication-quality plots — all from a single function call.
Custom generator or explicit list
Template, prefix, or builder fn
Multiple alphas & steer layers
Custom or marker-based evaluator
Stats, plots, coherence ratings
Define your concept with two opposing behaviors. Everything else uses smart defaults — but every parameter is overridable.
from concept_probe import ConceptSpec, ProbeWorkspace workspace = ProbeWorkspace(model_id="meta-llama/Llama-3.2-3B-Instruct") concept = ConceptSpec( name="sad_vs_happy", pos_label="sad", neg_label="happy", pos_system="You are a sad, melancholic assistant...", neg_system="You are a cheerful, upbeat assistant...", eval_pos_texts=["I feel a heavy emptiness."], eval_neg_texts=["I feel light and excited!"], ) probe = workspace.train_concept(concept) # trains + sweeps + saves probe.score_prompts( prompts=["Write about the ocean."], alphas=[0.0, 6.0, -6.0], alpha_unit="sigma", )
from concept_probe import ProbeWorkspace # No retraining needed — load from a previous run directory workspace = ProbeWorkspace( project_directory="outputs/sad_vs_happy/20260109_150734" ) probe = workspace.get_probe(name="sad_vs_happy") probe.score_texts(texts=["Life feels meaningless today."])
Smart defaults for everything — 20 training questions, readout modes, sweep settings. Override any field via JSON deep-merge.
assistant_all_mean, assistant_last, sequence_last, sequence_all_mean, and last-k variants for both training and reading.
Automatic Cohen's d + p-value sweep across all layers. Configurable search intervals with visual shading on plots.
Inject concept vectors via PyTorch hooks. Per-layer vectors, distribute alpha across layers, sigma-scaled for interpretability.
One forward pass, N concept probes. Interactive HTML heatmaps with probe-selector dropdown. Compare concepts on the same text.
End-to-end evaluation with custom generators, evaluators, marker extraction, LLM coherence rating, and publication-quality plots.
Prompts can be plain strings or full multi-turn conversations. System prompt precedence handled intelligently with warnings.
Save token scores segmented by role and turn for chat prompts. Per-message averages and prompt vs completion separation.
Merge multiple evaluation batches into unified cross-batch analyses. Rehydrate analysis on existing batches at any time.
Define concepts in JSON files and train directly with train_concept_from_json(). Perfect for batch experiments.
Supports bitsandbytes 4-bit loading for large models on limited hardware. Graceful fallback if unavailable.
Every run saves configs, JSONL logs, NPZ tensors, metrics, plots, and HTML — fully reproducible and inspectable.
pip install git+https://github.com/mneuronico/concept-probe.git
Minimal deps: torch, transformers, numpy. Optional: scipy (p-values), matplotlib (plots), bitsandbytes (4-bit).