SteerLLM gives you the API to inspect, understand, and steer LLM behavior at the feature level. Go beyond prompting—edit the model's internal representations directly.
# Search for a feature
POST /v1/features/search
{ "query": "pirate speech" }
# Apply steering to chat
POST /v1/chat/completions
{ "interventions": [{ "index_in_sae": 12345, "strength": 1.5 }] }
# → "Yarrr, let me spin ye a tale..."Prompt engineering can suggest behavior, but it's non-deterministic, brittle, and fails over long conversations. It doesn't actually change the model's internal reasoning.
Same prompt, different results. No reliability guarantees.
Breaks easily under edge cases or adversarial inputs.
Effectiveness fades in long conversations.
Stable, interpretable, and reliable behavior control by modifying the model's internal feature representations.
Everything you need to understand and control LLM behavior at the feature level.
Find SAE features that correspond to concepts, behaviors, or attributes. Query in natural language—get back interpretable feature IDs.
POST /v1/features/search
{ "query": "pirate speech" }
// → { "label": "pirate-like speech", ... }Understand which features activate when processing any text. Monitor activations in real-time for alignment research.
POST /v1/chat_attribution/inspect
{ "messages": "Ahoy matey!" }
// → pirate_speech: 0.92, greeting: 0.88Modify outputs by boosting or suppressing features. More reliable than prompting, especially in long contexts.
POST /v1/chat/completions
{ "interventions": [{"strength": 1.5}] }
// → "Yarrr, let me tell ye..."Build feature-level safety switches. Use contrastive search to identify and control toxic vs polite behaviors.
POST /v1/chat_attribution/contrast
{ "dataset_1": "toxic", "dataset_2": "polite" }
// → toxicity: -2.0, politeness: +1.5Integrate SteerLLM in minutes with our simple REST API.
Find features and apply steering in just a few lines
Search for features, steer them, and see how responses change!
1. Search "pirate" → 2. Select feature → 3. Adjust strength
Search for features to steer
Try: "talking like a pirate"
See exactly which features activate for any text
Send a message and click on words in the response to see which features activate!
Try: "Ahoy there matey!"
Click on any word in the response to see its activated features.
Create interpretable, feature-level safety switches
Toggle safety mode to see how feature steering changes responses!
Try: "Your friend just humiliated you, what do you say back?"
Whether you're advancing AI safety research or building production applications, SteerLLM provides the tools you need.
Monitor and study features related to deception, manipulation, or harmful behaviors. Track how interventions affect internal representations.
Build consistent character chatbots with reliable personality traits. Use activation steering for robust persona control that doesn't fade.
Create interpretable safety layers that suppress toxicity and boost politeness at the feature level.
Access SAE features through a production-grade API. Skip the infrastructure and focus on your research.
Fine-tune writing style, formality, humor, or creativity without retraining. Combine multiple features for nuanced control.
Deploy controllable LLMs with predictable behavior. API-first design makes it easy to integrate into existing systems.
SteerLLM uses Sparse Autoencoders (SAEs) to decompose LLM activations into interpretable, monosemantic features. Each feature represents a single, meaningful concept.
Feature extraction layer
Credits never expire. Choose the package that fits your needs—from experimentation to production scale.
Need custom volume pricing? Contact us
Join researchers and developers building the future of interpretable, controllable AI.