Building AI-native products on Claude: a Honolulu studio's playbook
Six products live, all built against the Anthropic Claude API. Here's how we structure AI calls in production, what we cache and what we don't, how we keep latency under 2 seconds for user-facing flows, and the patterns that have actually shipped — not the hype.
The studio ships six products against Anthropic’s Claude API. They span B2B SaaS (BlueWave Projects, an AI-native platform for general contractors), consumer subscription (Property Brief, a weekly homeowner brief on Hawaii real estate), iOS (ProBuildCalc, a LiDAR scanner with AI estimating), and several smaller tools. Every one of them makes Claude calls in production paths that real users wait on.
This is the playbook we’ve converged on. None of it is theoretical. All of it is in shipping code.
Where AI lives in a product, and where it doesn’t
The first decision when integrating Claude into a product is which surface the model touches. We have a rule that’s served us well:
Claude lives behind the deterministic path, not in front of it. User intents that have a clear right answer — log in, view a record, save a form, list items — get handled by traditional code. Claude shows up only where the work is inherently ambiguous, generative, or judgment-heavy: writing a project scope from a sketch, summarizing a week of permit activity, drafting a contractor estimate from a LiDAR scan.
This isn’t a purity argument. It’s about user expectations. Users tolerate a 2-second wait when they know the system is thinking. They don’t tolerate it when they pressed “save.” Keeping AI off the deterministic paths preserves the snap of the rest of the product.
The four call patterns we actually use
After eighteen months and roughly 400,000 Claude API calls across our portfolio, we use four distinct integration patterns. Each fits a different latency and reliability profile.
Pattern 1: Inline generative (user is waiting)
Used when the user has explicitly asked Claude to write or think — generating a project scope, writing a weekly summary, drafting an email. Latency budget: 2–8 seconds.
The pattern:
- Single non-streaming call for outputs under ~500 tokens
- Streaming call (server-sent events) for anything longer, so the user sees progress
- Aggressive use of prompt caching on the system prompt and any large context (project history, document base) — typical cache hit rates run 70-95% in production
- Hard timeout at 15 seconds with a fallback message (“Claude is taking longer than usual — try again or refresh”)
The system prompt for inline calls is typically 800–1,500 tokens and includes role definition, output format, and constraints. We cache it. The user-specific context is usually 1,000–4,000 tokens — also cached after the first call of a session.
Pattern 2: Async generative (user moves on)
Used when the work is bigger but the user doesn’t need to watch it happen — generating a contractor’s weekly client report from logs, summarizing a month of permit activity for a Property Brief email, processing a large LiDAR scan into a draft estimate. Latency budget: 30 seconds to 5 minutes.
The pattern:
- Enqueue the job to a Postgres-backed queue (we use a simple
jobstable withpg_notifyfor triggering — no Redis needed for our scale) - A worker process picks it up, calls Claude with appropriate model selection (Sonnet for most of these, Opus when output quality matters more than cost)
- Results write back to the relevant record
- User gets a real-time notification or sees the result on next page load
Async lets us use higher-quality (slower) models without blocking anyone. We also use this pattern for batch operations — generating 100 Property Brief weekly emails on Tuesday night, for instance.
Pattern 3: Inline classification (user is waiting, but barely)
Used when Claude is making a decision rather than writing prose — categorizing a lead intent, classifying a permit type, deciding which template to apply. Latency budget: 400ms to 1.5 seconds.
The pattern:
- Haiku 4.5 almost always, unless the classification needs Opus-level reasoning (rare)
- Tight system prompt (300–800 tokens) with explicit output format (“Respond with exactly one of: [a, b, c, d]”)
- Use of
response_formator structured output when applicable - Hard timeout at 2 seconds with a deterministic fallback (apply the default template, log the timeout, move on)
The key with classification is keeping the prompt tight and using the smallest model that produces correct output. We benchmark every classification surface against a labeled set of 50–200 real production examples and only ship when accuracy is above 95% for the target class.
Pattern 4: Tool-using agent (background, multi-step)
Used for our most complex flows — the AI scope generator in BlueWave that reads a LiDAR scan, looks up the parcel TMK in hawaii-as-code, checks recent comparable projects, queries the supplier catalog, and produces a draft estimate with line items. Latency budget: 15 seconds to 2 minutes, async.
The pattern:
- Multi-turn loop with Claude calling tools defined via the Anthropic SDK’s tool-use schema
- Tools are thin wrappers over our internal API endpoints —
lookup_parcel,get_recent_projects,query_supplier_catalog,calculate_subtotals - Each tool returns structured JSON
- The loop terminates when Claude says it’s done or when we hit a max-iteration cap (we use 8 — anything more usually means the agent is stuck)
- We always log the full trace for debugging and quality review
Agentic loops are powerful but expensive in tokens. We use them when the work genuinely requires multi-step reasoning and the latency budget allows for it. Otherwise we prefer to do the orchestration in code and use single-shot Claude calls for the reasoning steps.
Prompt caching: the highest-leverage optimization
Anthropic’s prompt caching is the single highest-leverage optimization we’ve made. Most users underestimate it.
Our typical production call looks like:
- System prompt: 1,200 tokens, cached
- User context (project history, document base, etc.): 2,500 tokens, cached
- User input (the actual request): 200 tokens, not cached
Without caching: ~3,900 input tokens per call.
With caching (after the first call of a session): ~200 input tokens billed at full rate, ~3,700 tokens billed at the cached rate (90% discount).
Effective cost reduction: about 80% on input tokens for repeat calls in a session. For products like BlueWave where a user makes 20–50 Claude calls in a session, the savings compound enormously.
The cache TTL is 5 minutes by default and you can extend to 1 hour with the right header. We tune per-product based on session patterns.
Model selection by surface
We don’t use one model for everything. Our typical distribution:
- Haiku 4.5: classification, simple summarization, fast inline operations. Roughly 60% of our call volume.
- Sonnet 4.6: most generative work — scope writing, email drafting, report generation. Roughly 30% of our call volume.
- Opus 4.7: complex reasoning, agentic loops, anything where output quality is the bottleneck. Roughly 10% of our call volume, but a larger share of our cost.
We picked these allocations empirically. For each new surface, we start on Sonnet, then test against Haiku for cost reduction (does accuracy hold?) or against Opus for quality (does the upgrade matter?). About 40% of the surfaces we initially built on Sonnet eventually moved to Haiku once we tightened the prompts.
What we don’t do
A few patterns we explicitly avoid:
No LangChain, no LangGraph, no autogen. Direct SDK calls (@anthropic-ai/sdk on Node, anthropic on Python). Every layer of framework between us and the API is a layer that can break in a new SDK version, a new model release, or a corner case the framework didn’t anticipate. We’ve debugged enough framework issues to prefer the boring direct path.
No LLM-routing layers like OpenRouter or PortKey. We pay Anthropic directly. The reliability and rate-limit transparency of going direct is worth more than the unification of having one bill for multiple providers.
No vector databases for RAG unless we genuinely need them. Most of our “give the AI access to context” needs are served by passing the relevant text directly in the prompt (cached) and letting Claude work with it. Vector DBs add complexity and an extra failure mode. We use them in two places — Property Brief’s recommendation engine and ProBuildCalc’s similar-scan retrieval — and that’s it. The default is “just put the text in the prompt.”
No streaming for outputs under 200 tokens. The streaming overhead exceeds the latency savings for short responses. We benchmark per surface.
Reliability patterns
In production, Claude calls fail. Not often, but predictably. Our standard error handling:
- Retry on 5xx and 429 with exponential backoff: 3 retries, 1s/2s/4s intervals
- Retry on overloaded errors with a fallback to a smaller model: if Opus is overloaded, fall through to Sonnet
- Don’t retry on 4xx other than 429: those are usually our fault (bad prompt, exceeded context window) and retrying won’t help
- Circuit breaker per surface: if the surface has had 5 failures in the last 60 seconds, briefly degrade (show a “AI temporarily unavailable” message, fall back to non-AI path)
- Log every call: request, response, latency, token counts, cache hit rate. We use a simple Postgres table with an index on
created_atandsurface_id. Two months of logs at our volume is ~3GB. Worth every byte.
We learned the value of logging the hard way. The third time you debug a “Claude said something weird” report by trying to reconstruct what the prompt was, you start logging.
Cost in practice
For a working baseline: BlueWave Projects in private beta makes roughly 8,000 Claude calls per month across all tenants. The split is roughly:
- 60% Haiku ($0.80/M input + $4/M output, with cache discount)
- 30% Sonnet ($3/M input + $15/M output, with cache discount)
- 10% Opus ($15/M input + $75/M output, with cache discount)
Monthly Anthropic bill at this volume: about $180. That’s with aggressive caching. Without caching it would be roughly 3x.
By the time BlueWave is at GA with 200 tenants, we project the bill at $3,500–$5,500/month. Still trivial compared to the alternative of hand-rolling these capabilities or running self-hosted models for marginal quality gains.
What we’d tell a founder building their first Claude integration
Three things:
-
Start with a single Claude call in a single surface. Don’t try to AI-everything. Pick the one place where Claude is obviously the right tool — usually a generative task where the user is waiting — and ship that well before extending.
-
Read the Anthropic prompt engineering docs once a quarter. They get updated. The model-specific guidance for Sonnet 4.6 and Opus 4.7 is more current than most third-party tutorials.
-
Cache aggressively, log everything, and benchmark on real data. These three habits will outperform almost any framework choice or architectural pattern.
If you’re building something AI-native and want to walk through your architecture, the studio takes those calls. Bring your current stack, your latency targets, and your first user flow that needs AI. We’ll map it.
Ikena Design & Build is a software studio in Honolulu, Hawaiʻi. We design and build AI-native products on Anthropic Claude — every product in our portfolio runs against the Claude API in production. See the portfolio.
Building Hawaii Insurability Brief: A Free Property Hazard Lookup for Hawaii Homeowners
A builder's account of why we built a free 10-field hazard lookup for Hawaii properties — lava zone, FEMA flood, tsunami, coastline distance, hurricane wind, wildfire risk, BFE, shoreline management area, roof age, and carrier availability — and how the freemium model works.
Hawaii GET tax on SaaS subscriptions: a practical guide
Hawaii taxes SaaS revenue differently than most US states — and very few founders know the rules until their first G-45 return is overdue. Here's how Hawaii's General Excise Tax actually works for subscription software businesses, with the rates, the forms, and the traps.
From hand-cut joinery to AI-native software
Ikena Design & Build started in Alaska timber-frame and Honolulu finish carpentry. Today it ships software. The standard didn't change — the substrate did. Here's the through-line, and why a contractor's instincts make a good operator-engineer.