AWS Budgets Has an 8-Hour Delay. Your Bedrock Bill Doesn't.

Published: May 30, 2026 · 6 min read

You set a $100 AWS Budget alert for your Bedrock usage. You feel safe. Then you wake up to a $2,300 bill.

This is not a hypothetical. It's the failure mode that AWS Budgets was designed to handle — and fundamentally cannot.

How AWS Budgets Actually Works

AWS Budgets monitors your account spend and sends alerts when you cross a threshold. Sounds exactly right for controlling Bedrock costs. Here's the catch buried in the AWS documentation:

"Budget alerts are typically delivered within 24 hours of your spending crossing the threshold."

In practice, the lag is 8–24 hours. AWS aggregates billing data in batch jobs. The notification pipeline has no concept of "real time."

This means the sequence of events is:

Your code enters a runaway loop at 11:47 PM
Claude Sonnet runs 900 iterations at ~$0.003/request
You accumulate $2,700 in spend over 3 hours
You go to sleep
AWS sends you an email at 9 AM
The damage is done

AWS Budgets is a reporting tool wearing the costume of an enforcement tool. It doesn't stop anything. It tells you what already happened.

Why Bedrock Is Especially Dangerous

Most compute services have natural rate limits — you can only spin up so many EC2 instances before you hit account limits. Bedrock has no such floor.

Claude Sonnet 3.7 costs $3.00 per million input tokens and $15.00 per million output tokens. A single request with 10,000 tokens in and 4,000 out costs about $0.09. That seems small.

Now consider:

A retry loop with no backoff: 1 request every 200ms
300 requests/minute × $0.09 = $27/minute
One hour of this: $1,620

This happens. A missing await, an exception handler that retries in a loop, a batch job that gets misconfigured — these are real bugs that hit developers every week.

And if you're using claude-opus-4-7 at $75/M output tokens, multiply those numbers by 5.

The Three Failures of the Standard Advice

When developers ask "how do I control Bedrock costs," the standard advice is:

1. Set AWS Budgets alerts Already covered — these are retroactive, not preventive.

2. Set service quotas AWS Bedrock service quotas are measured in requests-per-minute, not dollars. You can cap throughput, but not spend. A slow leak at 5 RPM with a large context can still cost hundreds per day.

3. Monitor CloudWatch metrics and set alarms CloudWatch's InvokeModel metrics have a delay too. You can alarm on invocation count, but mapping invocations to dollars requires knowing the exact token counts per request — which CloudWatch doesn't report in real time.

None of these intercept a request before it hits Bedrock. They all tell you something happened. None of them stop it from happening.

What Actual Enforcement Looks Like

The only way to guarantee a dollar cap is to intercept at the request level.

When your code calls Bedrock, the request has to pass through something that knows:

How much you've spent so far (in real time)
How much this request will approximately cost (token estimate)
Whether allowing this request would exceed your cap

If the answer to (3) is yes, the request never reaches Bedrock. You get a 429. The token is never consumed. The money is never spent.

This is the difference between a smoke detector and a sprinkler system. AWS Budgets is a smoke detector. You need a sprinkler.

How We Built This for Bedrock

At LLMCap, we built a transparent proxy that sits between your code and AWS Bedrock Runtime. Your code changes exactly one line:

# Before
import boto3
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

# After — via LLMCap proxy
import httpx

# Pass your AWS credentials in headers, LLMCap signs the request
response = httpx.post(
    "https://proxy.llmcap.io/bedrock/us-east-1/model/anthropic.claude-sonnet-4-6/invoke",
    headers={
        "X-LLMCap-Key": "tg_live_...",
        "X-AWS-Access-Key-ID": "AKIA...",
        "X-AWS-Secret-Access-Key": "...",
    },
    json={"...": "..."}
)

Every request goes through three checks before reaching AWS:

1. Token estimation (<10ms) We estimate input tokens from the request body. For Anthropic Claude on Bedrock, we use the Anthropic token counting endpoint. For other models, we use character-count heuristics. This gives us a pre-request cost estimate.

2. Budget check (<5ms) We query Redis for your current spend in the active window (daily/weekly/monthly). If current_spend + estimated_cost > limit, we return 429 immediately. The request never leaves our servers.

3. SigV4 signing Your AWS credentials pass through per-request in headers and are discarded after signing. We never store them. LLMCap holds only your token counts and costs.

Total added latency: <35ms in the median case.

What Happens Mid-Stream

Bedrock supports streaming responses via the binary event stream format. A streaming request can run for seconds and generate thousands of output tokens — more cost exposure than a single-turn request.

LLMCap handles streaming by checking budget periodically as chunks arrive. If spending crosses the cap mid-stream, we close the connection and send a final error event. Your code sees a disconnection (or a budget_exceeded error, depending on how you handle it). Anthropic's servers close the generation.

The tokens already generated are charged. Everything after the connection close is not.

The Developer Experience

Setting a $50/day cap on Bedrock looks like this in the LLMCap dashboard:

Provider: Bedrock
Window: Daily
Limit: $50.00
Action: Block

Once that rule is set, no Bedrock request through your proxy key can push you past $50 for the day. The cap is hard. Not a notification — a wall.

You can set separate caps per API key, per provider, per model, per time window. A staging key can have a $5/day limit while your production key has a $200/day limit. They're isolated.

The Honest Tradeoffs

LLMCap adds a network hop. That's ~35ms of latency in the happy path. For interactive applications, this is usually invisible. For batch workloads that run millions of requests, it's worth measuring.

LLMCap also requires your AWS credentials to pass through the proxy on each request. We sign them and discard them — we never store credentials, and this is auditable in our code. But you're trusting a third party with temporary credential access on each call. Some security policies won't allow this.

For those cases, self-hosted deployment is on our roadmap.

Summary

AWS Budgets is not a spending cap. It's a bill notification with a 24-hour delay. For Bedrock workloads where a runaway loop can cost thousands per hour, that's not protection — it's a post-mortem.

Real enforcement requires interception at the request level, before the token is consumed.

That's what we built. If you're running Bedrock in production and you don't have a hard cap in place, try LLMCap free for 3 days.

LLMCap supports Anthropic, OpenAI, Google Gemini, Mistral, Cohere, and AWS Bedrock. Setup takes under 15 minutes.

Command Palette