<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[llmcap]]></title><description><![CDATA[llmcap]]></description><link>https://blog.llmcap.io</link><generator>RSS for Node</generator><lastBuildDate>Sat, 30 May 2026 23:25:41 GMT</lastBuildDate><atom:link href="https://blog.llmcap.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[AWS Budgets Has an 8-Hour Delay. Your Bedrock Bill Doesn't.]]></title><description><![CDATA[Published: May 30, 2026 · 6 min read

You set a \(100 AWS Budget alert for your Bedrock usage. You feel safe. Then you wake up to a \)2,300 bill.
This is not a hypothetical. It's the failure mode that]]></description><link>https://blog.llmcap.io/aws-budgets-has-an-8-hour-delay-your-bedrock-bill-doesn-t</link><guid isPermaLink="true">https://blog.llmcap.io/aws-budgets-has-an-8-hour-delay-your-bedrock-bill-doesn-t</guid><category><![CDATA[AWS Bedrock]]></category><category><![CDATA[AWS]]></category><category><![CDATA[llm]]></category><category><![CDATA[Devops]]></category><category><![CDATA[cloud cost optimization  ]]></category><category><![CDATA[Cloud Cost Management]]></category><dc:creator><![CDATA[Faruk Celikkanat]]></dc:creator><pubDate>Sat, 30 May 2026 21:39:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6a1b197a3e923106b7ff7433/221bf8e2-c1b9-419a-a27a-0e914def3368.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Published: May 30, 2026 · 6 min read</em></p>
<hr />
<p>You set a \(100 AWS Budget alert for your Bedrock usage. You feel safe. Then you wake up to a \)2,300 bill.</p>
<p>This is not a hypothetical. It's the failure mode that AWS Budgets was designed to handle — and fundamentally cannot.</p>
<hr />
<h2>How AWS Budgets Actually Works</h2>
<p>AWS Budgets monitors your account spend and sends alerts when you cross a threshold. Sounds exactly right for controlling Bedrock costs. Here's the catch buried in the AWS documentation:</p>
<blockquote>
<p><em>"Budget alerts are typically delivered within 24 hours of your spending crossing the threshold."</em></p>
</blockquote>
<p>In practice, the lag is 8–24 hours. AWS aggregates billing data in batch jobs. The notification pipeline has no concept of "real time."</p>
<p>This means the sequence of events is:</p>
<ol>
<li><p>Your code enters a runaway loop at 11:47 PM</p>
</li>
<li><p>Claude Sonnet runs 900 iterations at ~$0.003/request</p>
</li>
<li><p>You accumulate $2,700 in spend over 3 hours</p>
</li>
<li><p>You go to sleep</p>
</li>
<li><p>AWS sends you an email at 9 AM</p>
</li>
<li><p>The damage is done</p>
</li>
</ol>
<p>AWS Budgets is a <em>reporting</em> tool wearing the costume of an <em>enforcement</em> tool. It doesn't stop anything. It tells you what already happened.</p>
<hr />
<h2>Why Bedrock Is Especially Dangerous</h2>
<p>Most compute services have natural rate limits — you can only spin up so many EC2 instances before you hit account limits. Bedrock has no such floor.</p>
<p>Claude Sonnet 3.7 costs \(3.00 per million input tokens and \)15.00 per million output tokens. A single request with 10,000 tokens in and 4,000 out costs about $0.09. That seems small.</p>
<p>Now consider:</p>
<ul>
<li><p>A retry loop with no backoff: 1 request every 200ms</p>
</li>
<li><p>300 requests/minute × \(0.09 = <strong>\)27/minute</strong></p>
</li>
<li><p>One hour of this: <strong>$1,620</strong></p>
</li>
</ul>
<p>This happens. A missing <code>await</code>, an exception handler that retries in a loop, a batch job that gets misconfigured — these are real bugs that hit developers every week.</p>
<p>And if you're using <code>claude-opus-4-7</code> at $75/M output tokens, multiply those numbers by 5.</p>
<hr />
<h2>The Three Failures of the Standard Advice</h2>
<p>When developers ask "how do I control Bedrock costs," the standard advice is:</p>
<p><strong>1. Set AWS Budgets alerts</strong> Already covered — these are retroactive, not preventive.</p>
<p><strong>2. Set service quotas</strong> AWS Bedrock service quotas are measured in requests-per-minute, not dollars. You can cap throughput, but not spend. A slow leak at 5 RPM with a large context can still cost hundreds per day.</p>
<p><strong>3. Monitor CloudWatch metrics and set alarms</strong> CloudWatch's <code>InvokeModel</code> metrics have a delay too. You can alarm on invocation count, but mapping invocations to dollars requires knowing the exact token counts per request — which CloudWatch doesn't report in real time.</p>
<p>None of these intercept a request <em>before</em> it hits Bedrock. They all tell you something happened. None of them stop it from happening.</p>
<hr />
<h2>What Actual Enforcement Looks Like</h2>
<p>The only way to guarantee a dollar cap is to intercept at the request level.</p>
<p>When your code calls Bedrock, the request has to pass through something that knows:</p>
<ol>
<li><p>How much you've spent so far (in real time)</p>
</li>
<li><p>How much this request will approximately cost (token estimate)</p>
</li>
<li><p>Whether allowing this request would exceed your cap</p>
</li>
</ol>
<p>If the answer to (3) is yes, the request never reaches Bedrock. You get a 429. The token is never consumed. The money is never spent.</p>
<p>This is the difference between a smoke detector and a sprinkler system. AWS Budgets is a smoke detector. You need a sprinkler.</p>
<hr />
<h2>How We Built This for Bedrock</h2>
<p>At LLMCap, we built a transparent proxy that sits between your code and AWS Bedrock Runtime. Your code changes exactly one line:</p>
<pre><code class="language-python"># Before
import boto3
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

# After — via LLMCap proxy
import httpx

# Pass your AWS credentials in headers, LLMCap signs the request
response = httpx.post(
    "https://proxy.llmcap.io/bedrock/us-east-1/model/anthropic.claude-sonnet-4-6/invoke",
    headers={
        "X-LLMCap-Key": "tg_live_...",
        "X-AWS-Access-Key-ID": "AKIA...",
        "X-AWS-Secret-Access-Key": "...",
    },
    json={"...": "..."}
)
</code></pre>
<p>Every request goes through three checks before reaching AWS:</p>
<p><strong>1. Token estimation</strong> (&lt;10ms) We estimate input tokens from the request body. For Anthropic Claude on Bedrock, we use the Anthropic token counting endpoint. For other models, we use character-count heuristics. This gives us a pre-request cost estimate.</p>
<p><strong>2. Budget check</strong> (&lt;5ms) We query Redis for your current spend in the active window (daily/weekly/monthly). If <code>current_spend + estimated_cost &gt; limit</code>, we return 429 immediately. The request never leaves our servers.</p>
<p><strong>3. SigV4 signing</strong> Your AWS credentials pass through per-request in headers and are discarded after signing. We never store them. LLMCap holds only your token counts and costs.</p>
<p>Total added latency: <strong>&lt;35ms</strong> in the median case.</p>
<hr />
<h2>What Happens Mid-Stream</h2>
<p>Bedrock supports streaming responses via the binary event stream format. A streaming request can run for seconds and generate thousands of output tokens — more cost exposure than a single-turn request.</p>
<p>LLMCap handles streaming by checking budget periodically as chunks arrive. If spending crosses the cap mid-stream, we close the connection and send a final error event. Your code sees a disconnection (or a <code>budget_exceeded</code> error, depending on how you handle it). Anthropic's servers close the generation.</p>
<p>The tokens already generated are charged. Everything after the connection close is not.</p>
<hr />
<h2>The Developer Experience</h2>
<p>Setting a $50/day cap on Bedrock looks like this in the LLMCap dashboard:</p>
<ul>
<li><p>Provider: Bedrock</p>
</li>
<li><p>Window: Daily</p>
</li>
<li><p>Limit: $50.00</p>
</li>
<li><p>Action: Block</p>
</li>
</ul>
<p>Once that rule is set, no Bedrock request through your proxy key can push you past $50 for the day. The cap is hard. Not a notification — a wall.</p>
<p>You can set separate caps per API key, per provider, per model, per time window. A staging key can have a \(5/day limit while your production key has a \)200/day limit. They're isolated.</p>
<hr />
<h2>The Honest Tradeoffs</h2>
<p>LLMCap adds a network hop. That's ~35ms of latency in the happy path. For interactive applications, this is usually invisible. For batch workloads that run millions of requests, it's worth measuring.</p>
<p>LLMCap also requires your AWS credentials to pass through the proxy on each request. We sign them and discard them — we never store credentials, and this is auditable in our code. But you're trusting a third party with temporary credential access on each call. Some security policies won't allow this.</p>
<p>For those cases, self-hosted deployment is on our roadmap.</p>
<hr />
<h2>Summary</h2>
<p>AWS Budgets is not a spending cap. It's a bill notification with a 24-hour delay. For Bedrock workloads where a runaway loop can cost thousands per hour, that's not protection — it's a post-mortem.</p>
<p>Real enforcement requires interception at the request level, before the token is consumed.</p>
<p>That's what we built. If you're running Bedrock in production and you don't have a hard cap in place, <a href="https://llmcap.io">try LLMCap free for 3 days</a>.</p>
<hr />
<p><em>LLMCap supports Anthropic, OpenAI, Google Gemini, Mistral, Cohere, and AWS Bedrock. Setup takes under 15 minutes.</em></p>
]]></content:encoded></item></channel></rss>