Mechanistic Interpretability API

Unlock the
Mind of AI

SteerLLM gives you the API to inspect, understand, and steer LLM behavior at the feature level. Go beyond prompting—edit the model's internal representations directly.

See How It Works

api_example.http

# Search for a feature
POST /v1/features/search
{ "query": "pirate speech" }

# Apply steering to chat
POST /v1/chat/completions
{ "interventions": [{ "index_in_sae": 12345, "strength": 1.5 }] }

# → "Yarrr, let me spin ye a tale..."

Scroll to explore

Prompting is Broken

Prompt engineering can suggest behavior, but it's non-deterministic, brittle, and fails over long conversations. It doesn't actually change the model's internal reasoning.

Non-Deterministic

Same prompt, different results. No reliability guarantees.

Brittle

Breaks easily under edge cases or adversarial inputs.

Decays Over Time

Effectiveness fades in long conversations.

SteerLLM: Edit Activations Directly

Stable, interpretable, and reliable behavior control by modifying the model's internal feature representations.

Core Capabilities

The Complete Interpretability Toolkit

Everything you need to understand and control LLM behavior at the feature level.

Feature Search

Find SAE features that correspond to concepts, behaviors, or attributes. Query in natural language—get back interpretable feature IDs.

POST /v1/features/search
{ "query": "pirate speech" }
// → { "label": "pirate-like speech", ... }

Feature Inspection

Understand which features activate when processing any text. Monitor activations in real-time for alignment research.

POST /v1/chat_attribution/inspect
{ "messages": "Ahoy matey!" }
// → pirate_speech: 0.92, greeting: 0.88

Activation Steering

Modify outputs by boosting or suppressing features. More reliable than prompting, especially in long contexts.

POST /v1/chat/completions
{ "interventions": [{"strength": 1.5}] }
// → "Yarrr, let me tell ye..."

Safety Controls

Build feature-level safety switches. Use contrastive search to identify and control toxic vs polite behaviors.

POST /v1/chat_attribution/contrast
{ "dataset_1": "toxic", "dataset_2": "polite" }
// → toxicity: -2.0, politeness: +1.5

Code Examples

See It In Action

Integrate SteerLLM in minutes with our simple REST API.

Search & Steer Features

Find features and apply steering in just a few lines

Try Pirate Steering

Search for features, steer them, and see how responses change!

1. Search "pirate" → 2. Select feature → 3. Adjust strength

Feature Control

Search for features to steer

Try: "talking like a pirate"

Inspect Feature Activations

See exactly which features activate for any text

Try Feature Inspection

Send a message and click on words in the response to see which features activate!

Try: "Ahoy there matey!"

Feature Inspector

Select a word

Click on any word in the response to see its activated features.

Try clicking:

ahoymateytreasure

Build Safety Controls

Create interpretable, feature-level safety switches

Try Safety Steering

Toggle safety mode to see how feature steering changes responses!

Try: "Your friend just humiliated you, what do you say back?"

Safety ON

Aggressive Language-2.0

Sarcasm & Mockery-1.5

Personal Attacks-2.0

Empathetic Response+1.5

De-escalation+1.5

Constructive Framing+1.0

Contrastive Features

Features found by comparing toxic vs. polite responses:

Suppress (negative weight)

Aggressive Language-2.0

Confrontational and hostile speech patterns

Sarcasm & Mockery-1.5

Dismissive and mocking tone

Personal Attacks-2.0

Ad hominem and character-based insults

Boost (positive weight)

Empathetic Response+1.5

Understanding and validating emotions

De-escalation+1.5

Calming and conflict-reducing language

Constructive Framing+1.0

Solution-oriented and positive perspective

✓ Toxic features suppressed, polite features boosted

Use Cases

Built For Researchers & Builders

Whether you're advancing AI safety research or building production applications, SteerLLM provides the tools you need.

🔬

AI Safety Research

Monitor and study features related to deception, manipulation, or harmful behaviors. Track how interventions affect internal representations.

🎭

Character AI & Personas

Build consistent character chatbots with reliable personality traits. Use activation steering for robust persona control that doesn't fade.

🛡️

Content Moderation

Create interpretable safety layers that suppress toxicity and boost politeness at the feature level.

🔍

Interpretability Research

Access SAE features through a production-grade API. Skip the infrastructure and focus on your research.

🎨

Style & Tone Control

Fine-tune writing style, formality, humor, or creativity without retraining. Combine multiple features for nuanced control.

⚡

Production Applications

Deploy controllable LLMs with predictable behavior. API-first design makes it easy to integrate into existing systems.

Under The Hood

Powered by Sparse Autoencoders

SteerLLM uses Sparse Autoencoders (SAEs) to decompose LLM activations into interpretable, monosemantic features. Each feature represents a single, meaningful concept.

Interpretable: Features have clear, human-understandable meanings
Steerable: Boost or suppress individual features to control behavior
Composable: Combine multiple feature edits for complex behaviors
Production-Ready: API-first design for easy integration

SAE

Sparse Autoencoder

Feature extraction layer

Feature #123

0.92

Feature #456

0.76

Feature #789

0.54

Simple Pricing

Pay Only for What You Use

Credits never expire. Choose the package that fits your needs—from experimentation to production scale.

Starter

$10

$10.00 per 1K credits

1,000

credits

Never expires
All API endpoints
Full documentation

Popular

Pro

$40

$8.00 per 1K credits

5,000

credits

Never expires
All API endpoints
Full documentation

Enterprise

$100

$6.67 per 1K credits

15,000

credits

Never expires
All API endpoints
Full documentation

Unlimited

$300

$6.00 per 1K credits

50,000

credits

Never expires
All API endpoints
Full documentation

Need custom volume pricing? Contact us

Ready to Unlock the Mind of AI?

Join researchers and developers building the future of interpretable, controllable AI.

Unlock theMind of AI

Prompting is Broken

Non-Deterministic

Brittle

Decays Over Time

SteerLLM: Edit Activations Directly

The Complete Interpretability Toolkit

Feature Search

Feature Inspection

Activation Steering

Safety Controls

See It In Action

Search & Steer Features

Try Pirate Steering

Feature Control

Inspect Feature Activations

Try Feature Inspection

Feature Inspector

Select a word

Build Safety Controls

Try Safety Steering

Contrastive Features

Built For Researchers & Builders

AI Safety Research

Character AI & Personas

Content Moderation

Interpretability Research

Style & Tone Control

Production Applications

Powered by Sparse Autoencoders

Sparse Autoencoder

Pay Only for What You Use

Ready to Unlock the Mind of AI?

Unlock the
Mind of AI