·6 min read

The cost shape of an AI content pipeline

Where tokens actually go in a Branchpost generation run, why retries dominate the bill more than people expect, and how prompt caching changes the math.

BA
Beka A.
Founder
$ / 1k
tokens.log

A founder asked me last month what it costs Branchpost to generate one blog post. I gave the honest answer: between 0.04and0.04 and 0.71, depending almost entirely on whether the cache hits and how many times the draft retries. That 18x spread is the whole story of running an AI content pipeline in production. If your cost model treats generation as a flat per-post number, you are going to be wrong about your margins by an order of magnitude.

This post is a breakdown of where the tokens actually go in a single Branchpost run, what we cache, what we retry, and how those three variables compose into a bill.

The anatomy of one generation

A Branchpost run that ends in an opened pull request is not one model call. It's a small pipeline. Here's the rough shape of a single draft, with token counts from a representative run against Claude Sonnet:

1. Repo context assembly        (no model)        0 in / 0 out
2. Topic + context prompt       cached input      ~8,400 in
3. Outline generation           non-cached input  ~1,200 in / ~600 out
4. Draft generation             non-cached input  ~2,100 in / ~2,800 out
5. Self-check pass              non-cached input  ~3,400 in / ~400 out
6. Frontmatter validation       (no model)        0 in / 0 out
7. PR commit + open             (no model)        0 in / 0 out

The interesting number isn't the total. It's the split between the 8,400 cached input tokens and everything else. That cached block is the publication context, the style guide, the internal link catalog, the claims-to-avoid list, and a handful of example posts. It barely changes between runs for the same repo. Anthropic's prompt caching prices cache reads at roughly 10% of the standard input rate, which means that block costs us about what 840 fresh input tokens would cost. The first run pays full price to write the cache. Every run after that, for the next five minutes, reads from it.

If you don't use prompt caching, or if your traffic pattern doesn't keep the cache warm, the publication context becomes the dominant line item. We learned this by accident: a quiet weekend with one generation every two hours had a per-post cost about 6x higher than a busy Tuesday with eight runs back to back. Same prompts, same model, same output. The cache was just cold most of the time on the weekend.

Retries are the part that ruins your spreadsheet

Naive cost modeling looks at the table above and multiplies. Real cost modeling adds a retry distribution on top of it.

Branchpost retries in three places, each with a different budget:

  • Transient API errors (rate limits, 529s, network blips). Cheap. The request never completed, so you pay nothing for the failed attempt. Budget: unlimited, with exponential backoff.
  • Schema validation failures on the structured output. The model returned, but the JSON was malformed or the frontmatter was missing a required field. You paid for the failed output tokens. Budget: 2 retries.
  • Self-check rejections. The draft generated successfully but failed our quality eval (a structure ghost, a hedge avalanche, a missing thesis, the patterns I wrote about in notes against AI slop in developer blogs). You paid for the full draft and the eval. Budget: 1 retry, then the draft goes to the PR with a warning label on it.

A run that succeeds on the first try costs us roughly 0.04withawarmcache.Arunthathitsoneschemaretryandoneselfcheckretrycostsabout0.04 with a warm cache. A run that hits one schema retry and one self-check retry costs about 0.19. A run that exhausts both retry budgets and still ships a flagged draft costs $0.71. The distribution is roughly 78% first-try, 18% one-retry, 4% worst-case, measured across our internal generation traffic over the last sixty days.

Two things follow from this:

  1. You cannot price per-post without pricing per-retry. If your eval gets stricter, your retry rate goes up, and your unit cost goes up with it. Quality and cost are coupled.
  2. The cheapest improvement is usually making the first try better, not making retries cheaper. Cutting our first-try failure rate from 22% to 15% saved more money than any prompt compression we tried.

The line items most people forget

Three costs hide outside the model call:

Embedding the repo context. We embed selected files (readmes, recent commit messages, post archive summaries) to retrieve the right context for a given topic. Embeddings are cheap per call but they happen on every push to the watched branch. For an active repo, this is a small but steady background cost, somewhere around 8 to 15% of total spend.

Failed PR opens. Octokit calls don't cost tokens, but they cost time, and time on a Vercel function is metered. A draft that generates fine but fails to commit (merge conflict, branch protection, expired installation token) wastes the generation cost entirely. We had a week where 6% of successful drafts failed at the PR step. That's a 6% tax on the model bill that doesn't show up in your Anthropic dashboard.

The eval model itself. Our self-check uses a smaller, cheaper model, but it still costs something. We considered using the same model that wrote the draft to grade the draft, and the cost went up about 4x for a quality improvement we couldn't measure. Cheaper graders, used carefully, are almost always the right call.

A defensible model for your own pipeline

If you're building something similar, here's the math I'd start with. Pick a target post. Measure:

  • Cached input tokens (publication context, style, examples)
  • Non-cached input tokens per stage
  • Output tokens per stage
  • First-try success rate
  • Retry distribution at each stage
  • Embedding and infra overhead as a percentage of model spend

Multiply through with your provider's pricing. Then multiply the result by 1.5 to account for the retries you haven't measured yet, because you will discover new failure modes the moment you ship to real users. The structured output that worked perfectly on your test set will fail in interesting ways on a repo whose readme is in Portuguese. The self-check will reject drafts that are actually fine. Budget for the surprise.

The thing I'd push back on, if you're evaluating AI tooling as a buyer, is any pricing page that quotes a single per-post number without explaining the variance. The variance is the product. A vendor who hasn't thought about retry budgets and cache warmth hasn't run the pipeline at scale yet.

Branchpost watches your repo and opens PRs with blog drafts. You review them like code. branchpost.dev

posts/cost-shape-ai-content-pipeline.mdx