Leapfrog

The Labs

Experiments to run

Forty-six small, real tasks — one set per chapter — drawn straight from the book. No task needs more than an afternoon.

Part I

Why & Where

Why AI is a must, and where it actually belongs.

Chapter 1

The Convergence

Spend $5 and a lunch break. Get an API key from any one provider your organization already touches, and make a single call from your laptop that sends a real internal question and gets an answer back. The goal is not the answer — it is to feel, viscerally, how low the barrier to entry actually is compared to what the "yes, but" culture told you.
Find your smallest real task. Identify one genuinely small, genuinely real piece of work in your world — a report nobody enjoys writing, a triage step, a lookup — that a model could plausibly help with. Do not build it yet. Just write down what "delivered" would mean: who would use it, and how you'd know it worked. You are practicing the scarce skill — *knowing what to build* — before the easy one.
Break something on purpose. Ask a model the same question five times and keep the answers side by side. Watch it be non-deterministic. Sit with the discomfort. This is the new muscle; better to meet it in a throwaway experiment than in production.
Write your own "yes, but," then answer it. Take the objection you personally expect to hear first when you propose building something with AI at work, and draft the two-sentence rebuttal. You will use it sooner than you think.

Chapter 2

Where AI Belongs

Run the spellchecker test. Take an AI idea someone in your organization is excited about and ask: would this end up with a Chief Something Officer and a budget line, or would it just disappear into how the work gets done? If the former, you're probably staring at a chatbot. Go find its embedded version.
Map a workflow to the jagged frontier. Pick one real workflow, break it into its tasks, and label each: AI clearly helps, AI clearly hurts, or human-only. The labels will not match your first instinct — and that surprise is the entire point.
Find the redesign, not the add-on. For that same workflow, write how it would run if it were *rebuilt* around AI — different steps, different handoffs — rather than AI stapled onto today's version. That description is the co-invention. It's the actual work, and most teams never write it down.
Name the rung. For a feature you're considering, decide honestly which rung it should live on — assistant, embedded, or agentic — and whether what you're about to build is aimed there or quietly stuck on rung one.

Chapter 3

The Landscape Without the Hype

Same prompt, two providers, one interface. Put a gateway (hosted or a lightweight self-hosted one) in front of two different providers and send the same real internal prompt to each. Compare the answers, the latency, and the cost side by side. The goal is not to crown a winner — it's to feel how little your code cares which one answered.
Route by difficulty. Send an easy, high-volume task (say, classifying or summarizing) to a cheap small model, and a genuinely hard one to a frontier model. Measure the cost difference. That gap, multiplied by your real volume, is the money a multi-model strategy leaves on the table when you hard-code one model.
Run the exit test. Ask honestly: if your main provider doubled its price or pulled your model tomorrow, is switching a config change or a multi-sprint project? Then actually try to reroute one workload and time it. Discover the answer now, on your terms, not during an incident.
Break a provider on purpose. Wire up a fallback chain, then pull the credentials for your primary provider and watch traffic fail over to the backup. Resilience you haven't tested is a resilience you don't have.

Chapter 4

Cloud Foundations for the AI Engineer

Run a model on your own machine tonight. Install a local runtime, pull a small open-weight model, and call its local endpoint from a script. Total cost: zero. The point is to feel, physically, how low the on-ramp is.
Turn "local" into "hosted." Take that same little app and deploy it as a container behind a stable endpoint somewhere it stays up — however small. Notice the exact moment it stops being local, and notice that your application code barely changed. That's the "there is always a host" lesson, felt rather than read.
Version a prompt like code. Put a prompt in a repository, change it in a pull request, and require one automated check to pass before it can merge. That's governance-as-code in its smallest possible form — build the habit at toy scale.
Destroy it and bring it back. Define your environment entirely in infrastructure-as-code, tear the whole thing down, and recreate it identically from the code. The confidence that gives you *is* your license to experiment freely.

Chapter 5

How Models Are Served and Consumed

Read your own meter. Run one real task and log the input tokens, the output tokens, and the cost separately. Then ask the model for a concise or structured answer and run it again. Watch the bill fall from the output side. You'll never again forget which direction is expensive.
Pull all three levers. Take a workload you run repeatedly. Cache the system prompt, route the easy cases to a cheaper model, and move the non-urgent ones to batch. Measure cost per unit of work before and after — that delta is the difference between a pilot and a product.
Pay the reasoning tax on purpose. Run the same hard task with reasoning off, low, and high. Line up quality, tokens, latency, and cost side by side, and find the point where more thinking stops earning its keep. That point is a design parameter, not a mystery.
Find your real ceiling. Take a model with a big advertised window, bury one specific fact in the middle of a long input, and ask for it back. Whether it comes back tells you your *effective* context — the number the datasheet won't.
Meter an image. Send the same picture to a vision model at high detail and at low detail, and log the token cost of each. Then compare a high-volume image workload against its text-only equivalent. That gap is why resolution is a cost lever, not a formatting choice.

Part II

Build the Thing

Patterns that hold up, and grounding a model in your data.

Chapter 6

Patterns That Hold Up

Assemble the context on purpose. Take one task and deliberately design what goes into the window — the instructions, the few facts it actually needs, the output format — then strip half of it out and watch the quality move. That swing is context engineering, felt rather than read.
Enforce a contract. Make a model call return strict, schema-valid JSON that your code parses with no defensive gymnastics. You've just converted a text generator into a typed component you can build on.
Give it one tool — then sabotage it. Connect a single tool through the open protocol and watch the model call it. Then vague out the prompt until the model *has* the tool but stops using it well. That's the difference between integration and intelligence, in your hands.
Resist the agent. Take the thing you're itching to build as an autonomous agent and build it as a fixed workflow instead. Ship that first. Reach for agency only when the workflow provably can't handle the variance — and notice how rarely that turns out to be true.

Chapter 7

Data, Retrieval, and Grounding

Watch naive RAG fail. Build the simplest embed → top-k → stuff → generate pipeline on your own documents, then hunt for a query it gets wrong — one with an exact term, a part number, or a name, usually does it. That failure is the whole reason the rest of the stack exists.
Add hybrid search and a reranker. Put keyword search alongside your vector search, fuse the results, and rerank the top candidates. Measure retrieval quality before and after on a set of real questions. Feel where the twenty-to-thirty-five percent comes from.
Break the permission boundary — on purpose. Put two different users' documents in one index and check, honestly, whether user A can retrieve user B's data. Then fix it at retrieval, not at the interface. Far better you find this than an auditor does.
Ground it, then check faithfulness. Make the model cite its sources, and then verify that the citations actually support the answer. The gap between "cited" and "supported" is the gap between genuinely grounded and merely confident.
Run the fine-tuning decision honestly. Take a case where someone wants to fine-tune and sort the goal into *form* or *facts*. If it's facts, prove prompting and retrieval can't do it first. If it's form, check whether a stricter prompt and structured outputs (Chapter 6) already get you there. You'll usually talk yourself out of a training pipeline — which is the point.

Interlude

Ship Something Tiny This Week

Stop reading. Close one small loop end to end, before the deep chapters explain what you just did.

The Tiny Build

One workflow, one afternoon, training wheels on

Pick the workflow & define "better." One real task you're allowed to touch and can measure. Write down the baseline and the target first.
Make one real call. Send a real input to a hosted model and read your meter — tokens in, tokens out, cost.
Ground it in a little real data. Feed it a handful of your own documents with the simplest retrieval. Watch where it helps and where it fails.
Make it do one real thing. A structured output your code acts on, or one tool call. Now it's embedded, not a chat.
Add a tiny check & a tiny bound. One eval on a few cases; least privilege, a cap, a human on anything irreversible.
Show one real user. The person who does this workflow. Their reaction is your truest eval.

That's the whole loop this book is about — access, cost, grounding, action, evaluation, control — small enough to hold in your head. The gap between the 5% and the 95% was never knowledge. It was having shipped.

Part III

Survive Monday

Deploy it, keep it in control, and know it works.

Chapter 8

Deploying and Operating AI Workloads

Put your build behind a deployed endpoint. Take the thing you built in Chapters 6–7, containerize it, deploy it somewhere it stays up unattended, and call it from outside. Cross the local-to-hosted line from Chapter 4 for real, and notice everything that "worked in the notebook" now needs a second look.
Add an eval gate. Build a tiny golden dataset and wire it into your pipeline so a prompt change can't merge unless it passes. Then break a prompt on purpose and watch the gate stop you. That gate is what stands between you and a silent regression.
Shadow a model swap. Route a slice of traffic to a different model or version, compare its outputs against the current one without showing them to users, and decide with data instead of a hunch. You've just rehearsed your defense against prompt drift.
Give an agent a kill switch and a leash. Take a simple agent, scope its tools to least privilege, cap its steps and its spend, and add a stop you can hit. Then try to make it misbehave — and confirm the blast radius is exactly as small as you designed it. That confirmation is what "in control" feels like.

Chapter 9

Observability, Evaluation, and Debugging

Trace one real request end to end. Instrument a single request and look at the whole tree — generation, retrieval, tool calls, each with its inputs, outputs, and cost. Notice how much your old logs never showed you.
Score a golden set. Build a small set of representative inputs, pick one metric — faithfulness is a good first — and score your outputs to get a real number. Then curate five genuine failures back into the set. You've just started the feedback loop.
Alert on quality, not latency. Wire an alert that fires when an evaluation score drops rather than when a request is slow. Break something on purpose and watch it catch what your infrastructure dashboards never would.
Root-cause a bad answer. Take one wrong output and walk its trace to name the actual culprit. Bet, before you look, that it's the model — and notice how often it isn't.

Part IV

At Scale & Forward

Cost, performance, security, and the road ahead.

Chapter 10

Scaling, Performance, and Cost Control

Attribute one feature's spend. Pass team-and-feature metadata through your gateway, join it to the bill, and produce a single number: the cost per transaction of one real feature. Then say that number out loud in your CFO's vocabulary, not in tokens.
Find your value per token. For one workload, compute the cost per unit of value it produces — per resolved ticket, per booking, per document. Decide, with a straight face, whether it's worth it. That decision is the whole discipline in miniature.
Cache, then verify. Add semantic caching to a repeated workload and measure the cost drop — then run the Chapter 9 eval to confirm quality held. A saving that degrades the product isn't a saving.
Do the edge math honestly. Pick one latency-sensitive, high-volume workload and estimate centralized versus edge: latency, egress cost, residency — and the operational overhead of running a fleet. Let the full picture, not the latency number alone, make the call.

Chapter 11

Security, Governance, and the Road Ahead

Red-team your own thing. Take something you built and attack it. Type a direct injection; then hide an instruction inside a document it will retrieve and watch whether it obeys. Fifteen minutes here teaches more than any threat report.
Write the one-page risk note. For one system, on a single page: what it can do (its blast radius), what could go wrong (the two or three OWASP risks that actually apply), what controls bound it (Chapter 8), and who holds stop authority. That page is governance made real.
Inventory your AI. List every AI system and feature running in your corner of the organization and classify each by risk tier. You will find shadow AI you didn't know about — and that list is where governance actually begins.
Place yourself on the roadmap. Rate honestly where you and your organization are — one workflow, or attribution, or guardrails, or agents — then name the single next rung and go ship it. The roadmap only works if you take the first step on it.

The book is the why.This is where you build.

A map, and a place to walk it

The book is the shape

The labs are the reps

The pointers move

Experiments to run

Why & Where

The Convergence

Where AI Belongs

The Landscape Without the Hype

Cloud Foundations for the AI Engineer

How Models Are Served and Consumed

Build the Thing

Patterns That Hold Up

Data, Retrieval, and Grounding

Ship Something Tiny This Week

One workflow, one afternoon, training wheels on

Survive Monday

Deploying and Operating AI Workloads

Observability, Evaluation, and Debugging

At Scale & Forward

Scaling, Performance, and Cost Control

Security, Governance, and the Road Ahead

A companion that computes

Token & Cost Calculator

Workflow → Jagged-Frontier Mapper

One-Page Risk Note

Follow primary sources, not the hype

Stanford HAI — AI Index

MIT Sloan — Ideas Made to Matter

OWASP — LLM & Agentic Top 10

NIST — AI Risk Management Framework

Artificial Analysis

EU AI Act — official