For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.
Virtual key management
Verified Code examples on this page have been automatically tested and verified.Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).
Issue API keys to users or applications and control token usage (also known as virtual keys).
About
Virtual key management allows you to issue API keys to users or applications, each with independent tracking and cost controls. Agentgateway achieves this by composing existing capabilities:
- API key authentication: Identify incoming requests by API key
- Token-based rate limiting: Enforce token budgets
- Observability metrics: Track per-key spending and usage
How virtual keys work
flowchart TD
A[Request arrives with API key] --> B[Validate API key]
B --> C{Key valid?}
C -->|Yes| D[Check token budget]
D --> E{Budget available?}
E -->|Yes| F[Forward to LLM]
F --> G[Track token usage]
G --> H[Deduct from budget]
E -->|No| I[Reject with 429]
C -->|No| J[Reject with 401]
subgraph refill["Budget refills periodically"]
H
end
Before you begin
Install theagentgateway binary.Set up virtual keys
Step 1: Configure API key authentication
Create a configuration with API key authentication. This example creates two virtual keys for Alice and Bob.
cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
llm:
policies:
apiKey:
mode: strict
keys:
- key: sk-alice-abc123def456
metadata:
user: alice
- key: sk-bob-xyz789uvw012
metadata:
user: bob
models:
- name: "*"
provider: openAI
params:
apiKey: "$OPENAI_API_KEY"
EOF| Setting | Description |
|---|---|
apiKey.mode | Set to strict to require a valid API key for all requests. Use optional to allow unauthenticated requests. |
apiKey.keys | List of API keys. Each key has a key value and optional metadata. |
key | The API key value that users include in the Authorization: Bearer <key> header. |
metadata | Optional metadata associated with the key, such as a user identifier or tier. |
Step 2: Start agentgateway
agentgateway -f config.yamlStep 3: Test the virtual keys
Send a request with Alice’s API key. Verify that the request succeeds.
curl -s http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer sk-alice-abc123def456" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}] }' | jq .Example successful response:
{ "choices": [{ "message": { "role": "assistant", "content": "Hello! How can I help you today?" } }], "usage": { "prompt_tokens": 10, "completion_tokens": 9, "total_tokens": 19 } }Send a request without a valid API key. Verify that the request is rejected with a 401 status.
curl -s -o /dev/null -w "%{http_code}" http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer invalid-key" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}] }'Expected response:
HTTP/1.1 401 Unauthorized
Configure token budgets
LLMs typically charge per input and output token. Without spending control, users can quickly generate large bills by submitting long prompts, streaming or retrying requests, or running recursive agent loops. To protect against unexpected bills, scaling surprises, and abuse, use token-based rate limits to cap the number of tokens that can be used.
How rate limiting works
Agentgateway checks token-based rate limits in two phases:
At request time:
- When
tokenize: trueis not set or is set tofalseon the AI backend, the number of tokens that are used for the request cannot be calculated. Because of this, the request is always allowed, unless the rate limit is set to 0 tokens. The LLM typically returns the number of tokens that were used for the request when sending the response. Agentgateway verifies the number of tokens that were used in the request and the response to determine whether the rate limit was reached. By default,tokenizeis set to false. - When
tokenize: trueis set, agentgateway estimates the number of tokens at request time. Because of that, the request is only allowed if the estimated number of tokens does not exceed the set rate limit.
At response time:
When the LLM returns a response, it typically provides the number of tokens that were used during the request and response. Agentgateway uses these numbers to determine if the rate limit was reached.
Note that this determination happens after the response is returned. Even, if the number of tokens that are used in the response exceeds the number of allowed tokens, the response is still returned to the user. Only subsequent requests are rate limited. If tokenize: true is set on the AI backend and tokens were estimated during the request, agentgateway verifies the actual number of tokens that were used for the request when the LLM returns its response. In the case the initial estimation was off, agentgateway adjusts the number of used tokens to count these against the set rate limit.
Step 1: Add a token budget
Update your configuration to include a localRateLimit policy. The following example builds on the virtual keys configuration from the previous section and adds a token budget.
cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
llm:
policies:
apiKey:
mode: strict
keys:
- key: sk-alice-abc123def456
metadata:
user: alice
- key: sk-bob-xyz789uvw012
metadata:
user: bob
localRateLimit:
- maxTokens: 10
tokensPerFill: 1
fillInterval: 60s
type: tokens
models:
- name: "*"
provider: openAI
params:
apiKey: "$OPENAI_API_KEY"
EOF| Setting | Description |
|---|---|
localRateLimit | Applies a token-based rate limit to all incoming LLM requests. |
maxTokens | The maximum number of tokens that are available to use. |
tokensPerFill | The number of tokens that are added during a refill. |
fillInterval | The number of seconds after which the token bucket is refilled. |
type | The type of rate limiting to apply. Use tokens for token-based rate limiting, or requests for request-based rate limiting. |
Step 2: Verify rate limits
Start agentgateway with the updated configuration.
agentgateway -f config.yamlSend a prompt to the LLM. At the time the prompt is sent, the number of tokens required for the completion is unknown. Because
tokenize: trueis not set on the model, the prompt count is not estimated. As a result, the prompt is allowed.The LLM typically returns the number of tokens required for completion in its response. Agentgateway uses this number and counts it against the rate limit.curl http://localhost:4000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "Tell me a short story" } ] }'Example output:
{ "choices": [ { "message": { "content": "Once upon a time, in a small village nestled between towering mountains...", "role": "assistant" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 12, "completion_tokens": 248, "total_tokens": 260 } }Repeat the same request. This time, the request is rate limited because the tokens used in the first request exceeded the budget.
curl http://localhost:4000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "Tell me a short story" } ] }'Example output:
rate limit exceeded
Step 3: Enable request-time token estimation
By default, agentgateway does not estimate token counts at request time. To reject requests before they reach the LLM, set tokenize: true on your model.
cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
llm:
policies:
apiKey:
mode: strict
keys:
- key: sk-alice-abc123def456
metadata:
user: alice
- key: sk-bob-xyz789uvw012
metadata:
user: bob
localRateLimit:
- maxTokens: 10
tokensPerFill: 1
fillInterval: 60s
type: tokens
models:
- name: "*"
provider: openAI
params:
apiKey: "$OPENAI_API_KEY"
tokenize: true
EOFWith this setting, requests are denied immediately if the estimated prompt token count exceeds the available budget.
Add a global token budget
localRateLimit is a gateway-wide limit, not a per-key limit. It enforces a single shared token budget across all requests and API keys.To add a token budget that limits total token usage across all requests using more advanced routing options, use the routing-based configuration format with localRateLimit.
binds/listeners/routes configuration format because localRateLimit is an HTTP-level policy. For more information, see the Routing-based configuration guide.cat <<'EOF' > config.yaml
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
binds:
- port: 4000
listeners:
- routes:
- backends:
- ai:
name: openai
provider:
openAI:
model: gpt-3.5-turbo
policies:
apiKey:
mode: strict
keys:
- key: sk-alice-abc123def456
metadata:
user: alice
- key: sk-bob-xyz789uvw012
metadata:
user: bob
backendAuth:
key: "$OPENAI_API_KEY"
localRateLimit:
- maxTokens: 100000
tokensPerFill: 100000
fillInterval: 86400s
type: tokens
EOF| Setting | Description |
|---|---|
backendAuth | The API key used to authenticate with the LLM provider backend. For configuration options, see Manage API keys. |
localRateLimit | Token-based rate limiting applied globally to all requests through this route, regardless of which API key is used. |
maxTokens | The maximum number of tokens available in the shared budget. |
tokensPerFill | The number of tokens added during each refill. |
fillInterval | The interval between refills. Use 86400s for a daily budget. |
type | Set to tokens for token-based limits. Use requests for request-based limits. |
For more information about rate limiting configuration options, see Rate limits.
Monitor per-key spending
Track token usage and spending for each virtual key using Prometheus metrics exposed by agentgateway.
Access the agentgateway metrics endpoint.
curl http://localhost:15000/metricsQuery token usage metrics.
# Total tokens consumed over the last 24 hours sum( increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) + increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h]) )Calculate costs by multiplying token counts by your provider’s pricing. For example, with OpenAI GPT-3.5:
# Estimated cost (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens) sum( ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h]) / 1000000) * 0.50) + ((rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h]) / 1000000) * 1.50) )
What’s next
- Manage API keys for detailed authentication configuration
- Rate limits for advanced rate limiting configuration
- Set up observability to view token usage metrics and logs