One-sentence summary: Prompt engineering works because LLMs are next-token predictors---a good prompt creates a context where the correct answer is the most likely continuation; Few-shot, CoT, Self-Consistency, and ReAct are the engineering tools that build that context systematically.

Chapter 28 overview: prompt engineering as context construction — Zero-shot, Few-shot, Chain-of-Thought, Self-Consistency, and ReAct as systematic patterns that shape what continuation a next-token predictor finds most likely

28.1 Why Prompting Still Matters

28.1.1 Same model, different outcomes

Consider asking a model to solve a multi-step scheduling problem.

Prompt A (direct):

A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint. How many sprints to finish?

Prompt B (scaffolded):

A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint.

Let's work through this:
1. How many full sprints cover the work?
2. Is there a partial sprint remaining?

On straightforward arithmetic GPT-class models get both right. On harder problems---multi-hop reasoning, constraint satisfaction, anything requiring several intermediate steps---Prompt B can outperform Prompt A by 20-50 percentage points. The difference is not the model; it is the context you give the model to work from.

28.1.2 The model does not "understand"---it continues

The mental model that makes all prompt techniques click:

An LLM is a conditional probability distribution. Given everything so far, it predicts the most likely next token.

The model is not solving problems in the human sense. It is continuing text. A good prompt is one where the correct answer happens to be the most probable continuation. A bad prompt is one where many continuations are plausible, including wrong ones.

This means:

Good prompt = well-defined context where the answer is the natural next thing
Bad prompt = ambiguous context where the model can "complete" in too many ways

28.1.3 Three layers of prompting skill

Layer	Techniques	Typical improvement
Basic	Clear wording, format constraints	Reduces ambiguity, stabilizes output
Intermediate	Few-shot, CoT, role prompting	+10-40% on complex tasks
Advanced	Self-Consistency, ToT, ReAct, agent design	Approaches human expert level on structured tasks

This chapter covers all three, with working code for each.

28.2 Zero-Shot and Few-Shot Prompting

28.2.1 The three prompt types

Type	Examples given	When to use
Zero-shot	0	Task is well-understood by the model
One-shot	1	You need to set the output format
Few-shot	2-10	Classification, complex formatting, judgment with edge cases

28.2.2 Zero-shot

Direct request, no examples:

prompt = """
Classify the following pull request comment as: bug, feature, refactor, or question.

Comment: "The retry logic doesn't handle the case where the upstream returns 429 before the connection is established."
Category:
"""
# Model output: bug

Works well when the task is within the model's training distribution and the category names are self-explanatory. Fails when the output format matters or when the categories are domain-specific.

28.2.3 One-shot: setting the format

prompt = """
Classify the following pull request comment. Output exactly one word: bug, feature, refactor, or question.

Example:
Comment: "Can we add a timeout parameter here?"
Category: question

Now classify:
Comment: "The retry logic doesn't handle 429 before connection is established."
Category:
"""
# The example locks in the output format.

One example does a lot of work. It demonstrates the output format, signals what level of detail you want, and gives the model a template to continue.

28.2.4 Few-shot: covering the space

prompt = """
Classify each pull request comment. Output exactly one word: bug, feature, refactor, or question.

Comment: "Can we add a timeout parameter here?"
Category: question

Comment: "Connection pool leaks when the server closes the socket unexpectedly."
Category: bug

Comment: "Extract the validation logic into a separate function."
Category: refactor

Comment: "Add a --dry-run flag so engineers can preview changes."
Category: feature

Now classify:
Comment: "The agent loop exits without flushing the write buffer."
Category:
"""

Few-shot design rules:

At least one example per class you care about
3-8 examples is usually the sweet spot; beyond 10, context cost grows without proportional gain
Keep examples high quality---bad examples teach bad patterns
The last example or two have slightly more influence; put your hardest class there

28.2.5 When to use which

Scenario	Recommendation
Simple generation task (translation, summary)	Zero-shot
Strict output format required	One-shot at minimum
Multi-class classification	Few-shot with all classes represented
Complex reasoning	Few-shot + CoT (next section)
Context budget is tight	Zero-shot or one-shot

import openai

def test_prompts(prompts, test_cases):
    """Compare prompt strategies on the same test cases."""
    for name, prompt in prompts.items():
        correct = 0
        for text, expected in test_cases:
            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt + text}],
                max_tokens=5
            )
            answer = response.choices[0].message.content.strip()
            if expected.lower() in answer.lower():
                correct += 1
        print(f"{name}: {correct}/{len(test_cases)}")

# Typical result pattern:
# Zero-shot:  75%
# One-shot:   83%
# Few-shot:   91%

28.3 Chain-of-Thought (CoT)

28.3.1 The discovery

In 2022 Google researchers published a finding that surprised many people:

Adding "Let's think step by step" to a prompt improved accuracy on math and logical reasoning tasks by 20-50 percentage points on models that were already strong.

No additional training. No architectural change. One sentence.

28.3.2 Why it works

Return to the core model: it predicts the next token from context.

Without CoT:

Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
   How many requests did it handle?
A:

The model jumps from question to answer in one step. For complex arithmetic this means a single token prediction must compress the whole computation. Easy to get wrong.

With CoT:

Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
   How many requests did it handle?
Let's think step by step:

Now the model generates reasoning: "4 hours = 240 minutes. 240 + 20 = 260 minutes. 260 × 12 = 3,120 requests." Each intermediate result appears in the context and becomes the foundation for the next token. The model is effectively "writing in the margin" before committing to an answer.

Three mechanisms at work:

Intermediate outputs become inputs. Each computation step is in context for the next.
Smaller jumps. Breaking a problem into steps reduces the per-step complexity.
Self-consistency signal. If the visible reasoning contradicts the answer, that tension is detectable.

28.3.3 Zero-shot CoT

The minimal version: add a trigger phrase.

prompt_no_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?

Answer:
"""

prompt_with_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?

Let's think step by step:
"""

# Expected CoT output:
# Stage 1 must complete first: 8 minutes.
# Stages 2 and 3 can then run in parallel: 8 minutes.
# Stages 4 and 5 depend on... [continues]

Common zero-shot CoT triggers:

"Let's think step by step"
"Think carefully before answering"
"Work through this systematically"
"First, ... Then, ... Finally, ..."

28.3.4 Few-shot CoT

More powerful: show the model what a good reasoning trace looks like.

prompt = """
Q: An agent pipeline reads from 3 queues. Queue A delivers 5 msgs/s, Queue B delivers 8 msgs/s, Queue C delivers 3 msgs/s. After 10 seconds, how many messages have arrived?
Step-by-step:
1. Queue A: 5 × 10 = 50 messages
2. Queue B: 8 × 10 = 80 messages
3. Queue C: 3 × 10 = 30 messages
4. Total: 50 + 80 + 30 = 160 messages
Answer: 160 messages

Q: A pull request review cycle: author posts PR (day 0), first review in 1-3 days, fixes in 1 day, final review in 1 day, merge same day. What is the earliest day of merge?
Step-by-step:
1. PR posted: day 0
2. First review: earliest day 1
3. Fixes: day 2
4. Final review: day 3
5. Merge: day 3
Answer: day 3

Q: A team velocity is 22 points per sprint (2 weeks). They have a backlog of 80 points. One engineer goes on vacation for the first sprint, reducing velocity by 20%. How many total weeks until the backlog is cleared?
Step-by-step:
"""

# Model continues with a structured reasoning trace.

28.3.5 CoT comparison

Feature	Zero-shot CoT	Few-shot CoT
Examples needed	none	2-8
Reasoning quality	good	better
Preparation work	minimal	moderate
Context cost	low	medium
Best for	quick exploration	production-quality reasoning

28.3.6 Concrete example: multi-step reasoning

import openai

def solve_with_cot(problem: str) -> str:
    prompt = f"""
You are an experienced software architect. Solve the problem below by showing every step.
Use the format: Step N: [calculation or reasoning]
Finish with: Answer: [final answer]

Problem: {problem}

Step 1:"""

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0   # deterministic for math
    )
    return response.choices[0].message.content

problem = """
A microservice handles 1,000 requests per second at baseline.
During a traffic spike it needs to handle 3x load for 5 minutes,
then 2x load for the next 10 minutes, then returns to baseline.
If each instance can handle 250 RPS and startup takes 90 seconds,
how many extra instances must be running before the spike begins?
"""
print(solve_with_cot(problem))

28.4 Self-Consistency

28.4.1 The problem CoT does not fully solve

CoT dramatically improves accuracy, but a single CoT path can still go wrong. The model samples from a probability distribution. With temperature > 0, different runs produce different reasoning traces---some correct, some not.

Self-Consistency turns this variability from a problem into a strength.

28.4.2 The mechanism

Sample multiple independent reasoning paths for the same problem, then vote:

                  ┌──────────────────────┐
                  │     Same problem      │
                  └──────────┬───────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Path 1      │    │  Path 2      │    │  Path 3      │
│  Answer: 17  │    │  Answer: 17  │    │  Answer: 15  │
└──────────────┘    └──────────────┘    └──────────────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
                             ▼
                  ┌──────────────────────┐
                  │  Majority vote: 17 ✓ │
                  └──────────────────────┘

Steps:

Generate N CoT responses with temperature > 0 (diversity is desirable here)
Extract the final answer from each
Return the most frequent answer

28.4.3 Code implementation

import openai
from collections import Counter
import re

def self_consistency(
    problem: str,
    num_samples: int = 5,
    temperature: float = 0.7
) -> dict:
    """
    Run Self-Consistency over a reasoning problem.

    Args:
        problem:      the problem statement
        num_samples:  how many reasoning paths to sample
        temperature:  diversity of samples (0.5-0.8 recommended)

    Returns:
        dict with answer, confidence, and all sampled answers
    """
    prompt = f"""
Solve the following problem step by step.
At the end, write "Answer: X" where X is the numeric answer.

Problem: {problem}

Solution:
"""

    answers = []
    all_paths = []

    for _ in range(num_samples):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )
        text = response.choices[0].message.content
        all_paths.append(text)

        # Extract the numeric answer
        match = re.search(r'Answer:\s*(\d+(?:\.\d+)?)', text)
        if match:
            answers.append(float(match.group(1)))

    if not answers:
        return {"answer": None, "confidence": 0, "all_answers": []}

    counter = Counter(answers)
    best_answer, count = counter.most_common(1)[0]

    return {
        "answer": best_answer,
        "confidence": count / len(answers),
        "all_answers": answers,
        "sample_paths": all_paths
    }

# Usage
problem = """
An agent batch-processes 1,000 files. Each file takes 200ms.
With 4 workers in parallel the effective rate is 4x.
How many seconds to complete the full batch?
"""

result = self_consistency(problem, num_samples=7)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"All sampled answers: {result['all_answers']}")

# Expected output:
# Answer: 50.0
# Confidence: 100%
# All sampled answers: [50.0, 50.0, 50.0, 50.0, 50.0, 50.0, 50.0]

28.4.4 Empirical performance gains

From the original Self-Consistency paper (Wang et al., 2022), on top of Chain-of-Thought:

Benchmark	CoT single	CoT + Self-Consistency	Gain
GSM8K (math word problems)	56.5%	74.4%	+17.9%
SVAMP (arithmetic)	68.9%	81.6%	+12.7%
AQuA (algebraic)	48.3%	57.9%	+9.6%
StrategyQA (multihop reasoning)	73.4%	81.3%	+7.9%

These are significant gains from a technique that requires no fine-tuning.

28.4.5 Parameter choices

Parameter	Recommended	Rationale
`num_samples`	5-10	Diminishing returns above 10; 5 usually captures the distribution
`temperature`	0.5-0.8	Needs diversity; near-zero produces identical paths
Answer extraction	regex on "Answer: X"	Standardize the format in your prompt

28.4.6 Cost and when to use it

Self-Consistency multiplies your API cost by num_samples. Use it when:

The task has a verifiable correct answer (math, logic, code outputs)
Accuracy matters more than cost (production code review, financial calculations)
The base CoT error rate is already moderate (10-40%); if single-path accuracy is already 95%, Self-Consistency adds little

Optimization trick: run a single CoT pass first. If the model is highly confident and the answer is straightforward, stop. Only invoke Self-Consistency for cases where you detect uncertainty (multiple plausible answers, hedging language in the output).

28.5 Advanced Techniques

28.5.1 Role Prompting

Assigning a persona activates relevant knowledge patterns and shifts the register of the response.

# Generic prompt
prompt_generic = """
How should we handle database connection pooling in a high-traffic microservice?
"""

# Role-based prompt
prompt_role = """
You are a senior infrastructure engineer who has operated services at 100,000 RPS.
You have strong opinions about failure modes and have been burned by common mistakes.

A junior engineer asks you: "How should we handle database connection pooling
in a high-traffic microservice?"

Give practical advice, including the mistakes you've seen teams make.
"""

Role prompting does not grant the model capabilities it does not have. It does shift the style, the level of detail, the assumptions made about the audience, and the practical weight given to tradeoffs.

Useful roles for engineering contexts:

Task	Effective role
Code review	"Senior engineer who has shipped this exact pattern in production"
Architecture decision	"Experienced architect who has been burned by premature abstraction"
Debugging	"Engineer who has debugged this class of issue many times"
Documentation	"Technical writer who values precision and hates ambiguity"

28.5.2 Tree-of-Thought (ToT)

CoT is a single linear reasoning path. Tree-of-Thought explores multiple branches, evaluates each, and backtracks from dead ends.

                    Problem
                       │
          ┌────────────┼────────────┐
          │            │            │
       Approach A   Approach B   Approach C
          │            │            │
       Evaluate      Evaluate    Evaluate
       (score: 7)   (score: 3)  (score: 9)
          │                          │
       Explore                    Explore
          │                          │
       [sub-branches]           [sub-branches]

Core algorithm:

Generate N candidate next-steps from the current state
Evaluate each step's promise (another LLM call or heuristic)
Expand the most promising branch
Backtrack if a branch leads nowhere

def tree_of_thought(problem: str, depth: int = 3, branches: int = 3) -> str:
    """Simplified Tree-of-Thought."""

    def generate_candidates(context: str, n: int) -> list[str]:
        prompt = f"""
{context}

Generate {n} distinct approaches for the next step. One per line, numbered.
"""
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8
        )
        return [l.strip() for l in resp.choices[0].message.content.strip().split('\n') if l.strip()]

    def score_candidate(context: str, candidate: str) -> float:
        prompt = f"""
Problem: {problem}
Progress so far: {context}
Proposed next step: {candidate}

Rate this step's promise from 1-10. Reply with only the number.
"""
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        try:
            return float(resp.choices[0].message.content.strip())
        except ValueError:
            return 0.0

    context = f"Problem: {problem}\n"
    for step_num in range(depth):
        candidates = generate_candidates(context, branches)
        scored = [(c, score_candidate(context, c)) for c in candidates]
        best = max(scored, key=lambda x: x[1])[0]
        context += f"\nStep {step_num + 1}: {best}"

    return context

ToT is most useful for open-ended planning, creative tasks, and multi-constraint optimization where linear reasoning often gets stuck in a local optimum.

28.5.3 Prompt Chaining

Break a complex task into a pipeline of simpler prompts, feeding outputs forward.

def analyze_pull_request(diff: str) -> str:
    """Multi-step PR analysis using prompt chaining."""

    # Step 1: Extract structure
    step1 = call_llm(f"""
Extract from this diff:
1. Files changed
2. Type of change (bug fix / feature / refactor)
3. Any security-sensitive patterns

Diff:
{diff}
""")

    # Step 2: Assess risk using step 1 output
    step2 = call_llm(f"""
Based on this analysis:
{step1}

Rate the review risk (low / medium / high) and explain why.
Focus on: correctness, security, test coverage implications.
""")

    # Step 3: Generate review comment from steps 1 and 2
    step3 = call_llm(f"""
Write a concise PR review comment based on:

Analysis: {step1}
Risk assessment: {step2}

Be direct, specific, and actionable. Skip praise. Raise real concerns.
""")

    return step3

Prompt chaining is cleaner to debug than a single monolithic prompt. Each step has its own output you can inspect. You can also inject human review between steps if the stakes are high.

28.5.4 ReAct: Reasoning and Acting

ReAct (Reasoning and Acting) is the foundational pattern for tool-using agents. The model alternates between thinking and taking an action.

Thought: I need to check the current queue depth before deciding.
Action: check_queue_depth(queue_name="task-processor")
Observation: queue depth = 4,521 messages, consumer_lag = 12 seconds

Thought: The lag is moderate but not critical. I should also check error rates.
Action: get_error_rate(service="task-processor", last_n_minutes=5)
Observation: error_rate = 0.3%, below threshold of 1%

Thought: Queue is accumulating but error rate is fine. The issue is throughput, not errors.
Action: scale_service(service="task-processor", replicas=4)
Observation: Scaling initiated. ETA: 90 seconds.

Thought: Scaling is underway. I should monitor and report back.
Answer: Queue depth was 4,521 with 12s lag. Error rate was healthy at 0.3%.
Scaled task-processor from 2 to 4 replicas. ETA 90 seconds.

The structured Thought/Action/Observation loop keeps the model grounded in real-world state rather than hallucinating actions it cannot take.

def react_agent(task: str, tools: dict) -> str:
    """
    Minimal ReAct agent loop.

    tools: dict mapping tool_name -> callable
    """
    history = [f"Task: {task}"]
    max_steps = 10

    for _ in range(max_steps):
        prompt = "\n".join(history) + "\nThought:"
        response = call_llm(prompt, stop=["Observation:"])

        history.append(f"Thought:{response}")

        # Parse action from response
        if "Action:" in response:
            action_line = [l for l in response.split('\n') if 'Action:' in l][0]
            tool_name, *args = parse_action(action_line)

            if tool_name in tools:
                result = tools[tool_name](*args)
                history.append(f"Observation: {result}")
            else:
                history.append(f"Observation: Unknown tool: {tool_name}")

        if "Answer:" in response:
            return response.split("Answer:")[-1].strip()

    return "Max steps reached."

28.5.5 Structured output constraints

For programmatic consumption, lock down the format:

# JSON output with explicit schema
prompt = """
Analyze this deployment event and return JSON matching this exact schema:

{
  "severity": "low" | "medium" | "high" | "critical",
  "affected_services": ["service1", "service2"],
  "root_cause": "one sentence",
  "recommended_action": "one sentence",
  "requires_human_review": true | false
}

Event:
[2026-04-24T14:32:01Z] Service payment-processor: latency p99 = 2,340ms (threshold: 500ms)
[2026-04-24T14:32:15Z] Service order-fulfillment: dependency timeout on payment-processor
[2026-04-24T14:32:20Z] Error rate on order-fulfillment: 12% (threshold: 1%)

Return only valid JSON. No explanation, no markdown fences.
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    response_format={"type": "json_object"}  # enforce JSON mode
)

import json
result = json.loads(response.choices[0].message.content)

Best practices for structured output:

Give an exact schema, not just "use JSON"
Show the schema with types or enum values where precision matters
Use response_format={"type": "json_object"} when the API supports it
Validate the parsed output against your schema; do not trust implicit compliance

28.6 Tool Use and Agent Prompts

28.6.1 How prompting changed when tools became default

Before tool use, a prompt was: "here is information, produce an answer." After tool use, a prompt is: "here is a task, here are tools you can call, here is when to stop."

The three new responsibilities in an agent prompt:

1. Tool specification: what each tool does, what arguments it accepts, what it returns. Vague tool descriptions produce unreliable tool calls.

2. Decision logic: when to call a tool vs. reason from context vs. ask for clarification. The model needs a decision framework, not just a list of tools.

3. Stopping conditions: when is the task complete? What counts as success? What is out of scope? Without clear boundaries the model loops or goes off-task.

28.6.2 A production-style agent prompt

You are a deployment-health agent for the Shannon production cluster.

## Your job
Monitor alerts, diagnose root causes, take conservative remediation actions, and escalate when uncertain.

## Tools available
- `get_metrics(service, metric, time_range)` - returns time series data
- `get_logs(service, level, last_n_lines)` - returns recent log lines
- `scale_replicas(service, count)` - changes replica count (max: 10)
- `restart_service(service)` - rolling restart of a service
- `page_oncall(team, severity, message)` - pages the on-call team

## Decision rules
1. Diagnose before acting. Call `get_metrics` and `get_logs` before any state change.
2. Conservative escalation: if p99 latency is > 2x baseline but error rate is < 1%, scale up rather than restart.
3. Restart only when error rate is > 5% or logs show fatal exceptions.
4. Page on-call for: data loss risk, cascading failures, any action beyond scaling or restart.
5. If you are uncertain about the root cause after two rounds of investigation, page on-call.

## Stopping conditions
- Issue is resolved (metrics within normal bounds for 2 minutes).
- Action is taken and you are waiting for it to take effect (state this clearly).
- You have escalated to on-call (provide full context in the page).
- Task is outside your authority (state clearly what you cannot do).

## Output format
Always respond with:
Thought: [your reasoning]
Action: [tool call or "escalate" or "done"]
[if Action is a tool call, wait for Observation before continuing]

This is a different genre of writing from a chat prompt. It is a contract between the engineer and the model.

28.6.3 MCP-style tool integration

Modern tool ecosystems like MCP (Model Context Protocol) formalize tool descriptions as structured schemas. The prompt still matters---it defines the high-level policy for when and how to use the tools the schema exposes.

The design principle: tool schemas define capability; prompts define judgment. Put access control in the tool layer. Put decision logic in the prompt. Do not blur those boundaries.

28.7 Worked Examples

The three examples below put the earlier techniques into practice on engineering tasks. Each example shows both the prompt design and the code that drives it.

28.7.1 Example 1: Multi-Label Issue Triage (Few-Shot Classification)

GitHub issue triage is a classification task with real-world stakes: a wrong label delays routing, an unlabeled issue gets lost. Few-shot prompting gives the model enough structure to handle the taxonomy consistently.

Task: classify an incoming GitHub issue into one or more of: bug, feature, performance, documentation, question.

Zero-shot attempt (baseline):

import openai, json

def triage_zero_shot(issue_body: str) -> dict:
    prompt = f"""
Classify the following GitHub issue. Choose one or more labels from:
bug, feature, performance, documentation, question.

Issue:
{issue_body}

Output JSON: {{"labels": [...], "reasoning": "..."}}
"""
    resp = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

Few-shot version (production):

FEW_SHOT_EXAMPLES = """
Issue: "The agent loop crashes with KeyError when the tool returns an empty dict."
Labels: ["bug"]
Reasoning: Crash on specific input is a defect.

Issue: "Add support for streaming responses so we can render tokens as they arrive."
Labels: ["feature"]
Reasoning: New capability not currently supported.

Issue: "Processing 10,000 records takes 45 seconds. The old version did it in 8."
Labels: ["bug", "performance"]
Reasoning: Regression in throughput qualifies as both a bug and a performance issue.

Issue: "The README example for custom tools is broken — it references a method that was renamed."
Labels: ["documentation", "bug"]
Reasoning: Broken documentation caused by a code change.

Issue: "How do I configure the retry policy for transient HTTP errors?"
Labels: ["question"]
Reasoning: User seeking guidance, not reporting a problem or requesting a feature.
"""

def triage_few_shot(issue_body: str) -> dict:
    prompt = f"""
You are an experienced open-source maintainer. Classify the GitHub issue below using one or more of these labels: bug, feature, performance, documentation, question.

Examples:
{FEW_SHOT_EXAMPLES}

Now classify:
Issue: {issue_body}

Output JSON: {{"labels": [...], "reasoning": "..."}}
"""
    resp = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

# Example usage
issue = """
Calling agent.run() with a very long input (>32K tokens) causes the process to
OOM after 30 seconds. Smaller inputs work fine. This regressed in v0.9.2.
"""
result = triage_few_shot(issue)
# Expected: {"labels": ["bug", "performance"], "reasoning": "..."}
print(result)

Few-shot labels stabilize output format and teach the model the difference between overlapping categories (bug vs. performance) through concrete examples. Zero-shot frequently returns a single label even when multiple apply.

28.7.2 Example 2: Code Generation — LRU Cache (Zero-Shot vs Few-Shot)

The LRU cache problem is a good testbed for code generation because it has a canonical correct implementation (collections.OrderedDict or a doubly-linked list + hashmap), clear contracts, and edge cases that exercise correctness.

Zero-shot prompt:

ZERO_SHOT_PROMPT = """
Implement an LRU (Least Recently Used) cache class in Python that supports:
- get(key): return the value if it exists, else -1
- put(key, value): insert or update; evict the least recently used item when capacity is exceeded
- capacity is set at construction time

Output only the code, no explanation.
"""

def generate_lru_zero_shot() -> str:
    resp = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": ZERO_SHOT_PROMPT}],
        temperature=0,
    )
    return resp.choices[0].message.content

Few-shot prompt (shows a simpler data structure first as a warm-up):

The few-shot prompt includes one worked example (MinStack) before the LRU task. The embedded code block inside the prompt string is built with string concatenation to avoid nesting code fences in the MDX source:

MINSTACK_SOLUTION = (
    "class MinStack:\n"
    "    def __init__(self):\n"
    "        self._stack = []\n"
    "        self._min_stack = []\n\n"
    "    def push(self, val: int) -> None:\n"
    "        self._stack.append(val)\n"
    "        min_val = min(val, self._min_stack[-1]) if self._min_stack else val\n"
    "        self._min_stack.append(min_val)\n\n"
    "    def pop(self) -> None:\n"
    "        self._stack.pop()\n"
    "        self._min_stack.pop()\n\n"
    "    def top(self) -> int:\n"
    "        return self._stack[-1]\n\n"
    "    def get_min(self) -> int:\n"
    "        return self._min_stack[-1]\n"
)

FEW_SHOT_CODE_PROMPT = (
    "You are a senior Python engineer. For each task, implement a clean, correct solution.\n\n"
    "Task: Implement a stack with O(1) get_min() using only standard Python.\n"
    "Solution:\n"
    + MINSTACK_SOLUTION
    + "\n\nTask: Implement an LRU cache with get(key) and put(key, value). "
    "get returns -1 on miss. put evicts the least recently used item when over capacity.\n"
    "Solution:\n"
)

def generate_lru_few_shot() -> str:
    resp = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": FEW_SHOT_CODE_PROMPT}],
        temperature=0,
    )
    return resp.choices[0].message.content

# Expected output (few-shot):
# class LRUCache:
#     def __init__(self, capacity: int):
#         self.capacity = capacity
#         self.cache = {}              # key -> value
#         self.order = OrderedDict()   # key -> None, maintains LRU order
#
#     def get(self, key: int) -> int:
#         if key not in self.cache:
#             return -1
#         self.order.move_to_end(key)
#         return self.cache[key]
#
#     def put(self, key: int, value: int) -> None:
#         if key in self.cache:
#             self.order.move_to_end(key)
#         self.cache[key] = value
#         self.order[key] = None
#         if len(self.cache) > self.capacity:
#             oldest = next(iter(self.order))
#             del self.order[oldest]
#             del self.cache[oldest]

The few-shot warm-up example primes the model to write clean, idiomatic class-based Python with explicit data structure choices. Zero-shot output is correct more often than not, but occasionally produces a bare list-based implementation that is O(n) on get. The few-shot output consistently uses OrderedDict.

28.7.3 Example 3: Sprint Planning with CoT + Self-Consistency

Engineering estimation is a reasoning task with a verifiable answer and real-world uncertainty. Combining Chain-of-Thought with Self-Consistency samples multiple reasoning paths and votes on the most consistent estimate.

Problem: a team has 63 story points of backlog. Sprint velocity is 21 points per two-week sprint. One senior engineer (who contributes 8 points per sprint) is on leave for the first sprint. A new engineer joins at the start of sprint 3 and will contribute 6 points per sprint. How many sprints until the backlog is cleared?

import openai, re, asyncio
from collections import Counter

SPRINT_PROBLEM = """
A team has 63 story points of backlog.
Normal sprint velocity: 21 points per two-week sprint.
Constraint 1: One senior engineer (8 points/sprint) is on leave for sprint 1 only.
Constraint 2: A new engineer joins at the start of sprint 3, contributing 6 points/sprint.
How many complete sprints are needed to clear the backlog?
"""

COT_PROMPT = f"""
Solve the following sprint planning problem step by step.
Show each sprint's velocity, cumulative points completed, and remaining backlog.
End with: Answer: <integer number of sprints>

Problem: {SPRINT_PROBLEM}

Step 1:"""

async def sample_one(client, temperature: float) -> str:
    """Draw one CoT reasoning path asynchronously."""
    resp = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": COT_PROMPT}],
        temperature=temperature,
        max_tokens=600,
    )
    return resp.choices[0].message.content

async def sprint_self_consistency(num_samples: int = 7) -> dict:
    """
    Sample multiple CoT paths and vote on the sprint count.

    Returns answer, confidence, and all sampled answers.
    """
    client = openai.AsyncOpenAI()
    tasks = [sample_one(client, temperature=0.7) for _ in range(num_samples)]
    paths = await asyncio.gather(*tasks)

    answers = []
    for path in paths:
        match = re.search(r'Answer:\s*(\d+)', path)
        if match:
            answers.append(int(match.group(1)))

    if not answers:
        return {"answer": None, "confidence": 0.0, "all_answers": []}

    counter = Counter(answers)
    best, count = counter.most_common(1)[0]

    return {
        "answer": best,
        "confidence": count / len(answers),
        "all_answers": answers,
        "paths": paths,
    }

# Run
result = asyncio.run(sprint_self_consistency(num_samples=7))
print(f"Sprints needed: {result['answer']}")
print(f"Confidence:     {result['confidence']:.0%}")
print(f"All samples:    {result['all_answers']}")

# Manual verification:
# Sprint 1: velocity = 21 - 8 = 13.  Remaining: 63 - 13 = 50
# Sprint 2: velocity = 21.           Remaining: 50 - 21 = 29
# Sprint 3: velocity = 21 + 6 = 27.  Remaining: 29 - 27 = 2
# Sprint 4: velocity = 27.           Remaining: 2 - 27 = done
# Answer: 4 sprints
#
# Expected output:
# Sprints needed: 4
# Confidence:     86% (6/7 paths agree; one path occasionally gets Sprint 3 velocity wrong)
# All samples:    [4, 4, 4, 4, 4, 4, 3]

The async batch approach matters here: seven parallel API calls finish in roughly the same time as one sequential call. When Self-Consistency is used in production, always batch the samples concurrently to avoid paying the latency penalty serially.

28.8 Practical Pitfalls and Best Practices

28.8.1 Common mistakes

Mistake	Problem	Fix
Prompt as a wall of text	Model emphasizes the end; early instructions get diluted	Structure with headers; repeat critical constraints at the end
Negative-only instructions	"Do not hallucinate" does not specify what to do instead	"If you do not know, say 'I do not have enough information'"
Hiding application logic in the prompt	Prompt bloat, drift between prompt versions and code	Move hard rules to code; use prompt for task context
Fixing with more words	Adding more verbose instructions often makes things worse	Add a better example instead
Ignoring temperature	Using default temp for tasks that need determinism	Set `temperature=0` for code, math, structured output
Too many constraints	The model satisfies them in priority order; later ones get dropped	Rank and limit constraints; test what actually matters

28.8.2 Prompt structure that works

def build_prompt(
    role: str,
    context: str,
    task: str,
    examples: list[dict],
    constraints: list[str],
    output_format: str
) -> str:
    """
    Structured prompt builder.

    examples: list of {"input": ..., "output": ...} dicts
    constraints: list of constraint strings (ranked by importance)
    """
    parts = [f"## Role\n{role}", f"## Context\n{context}", f"## Task\n{task}"]

    if examples:
        ex_text = "\n\n".join(
            f"Input: {e['input']}\nOutput: {e['output']}" for e in examples
        )
        parts.append(f"## Examples\n{ex_text}")

    if constraints:
        c_text = "\n".join(f"- {c}" for c in constraints)
        parts.append(f"## Constraints\n{c_text}")

    parts.append(f"## Output format\n{output_format}")
    parts.append("---\nNow process the input:")

    return "\n\n".join(parts)

# Example use
prompt = build_prompt(
    role="Senior technical writer with a preference for concrete examples",
    context="Writing API documentation for a Python SDK",
    task="Generate a docstring for the provided function",
    examples=[
        {
            "input": "def connect(host, port, timeout=30): ...",
            "output": '"""Connect to the server at host:port.\n\n    Args:\n        host: hostname or IP\n        port: port number\n        timeout: seconds before connection attempt fails (default 30)\n\n    Returns:\n        Connection object ready for use\n\n    Raises:\n        ConnectionError: if the server is unreachable\n    """'
        }
    ],
    constraints=[
        "Use Google-style docstrings",
        "Include a Raises section if the function can raise exceptions",
        "Omit obvious information like 'returns None'",
    ],
    output_format="Only the docstring, wrapped in triple quotes."
)

28.8.3 Debugging prompts

When output is wrong, work through this checklist:

Is the prompt ambiguous? Can you read it two ways? The model may be reading it the other way.
Is the format shown? If you need a specific output format, show an example of it.
Is the task too complex? Break it into two prompts.
Is temperature too high? Set it to 0 for diagnostic runs.
Is the model right and your expectation wrong? Run three times and see if the outputs cluster.

def diagnose_prompt(prompt: str, n_runs: int = 5) -> None:
    """Run a prompt multiple times to assess consistency."""
    results = []
    for i in range(n_runs):
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3
        )
        results.append(resp.choices[0].message.content.strip())

    unique = set(results)
    print(f"Prompt consistency: {(n_runs - len(unique) + 1) / n_runs:.0%}")
    print(f"Unique outputs: {len(unique)}")
    for i, r in enumerate(results, 1):
        preview = r[:200] + "..." if len(r) > 200 else r
        print(f"--- Run {i} ---\n{preview}\n")

28.8.4 When to move logic out of the prompt

A prompt is not a good place for:

Access control rules ("never do X for user type Y") - put in code
Exact string matching ("if the user says exactly 'cancel', do Z") - put in routing logic
Multi-step transaction logic - put in orchestration code
Rules that change frequently - put in a config file

A prompt is good for:

Task framing ("you are helping an engineer review code")
Judgment calls ("prefer explicit error handling over silent fallbacks")
Tone and format constraints
Few-shot examples of desired behavior

28.9 Chapter Summary

28.9.1 Technique comparison

Technique	Core mechanism	Best for	Typical gain
Zero-shot	Direct task statement	Simple, well-understood tasks	baseline
Few-shot	Examples as templates	Format control, classification	+10-20%
Chain-of-Thought	Visible reasoning trace	Math, multi-step logic	+20-40%
Self-Consistency	Majority vote over N samples	High-accuracy requirements	+10-18%
Role Prompting	Persona activation	Tone, depth, domain register	varies
Tree-of-Thought	Branching + backtracking	Open-ended planning	+10-30%
ReAct	Reason → Act → Observe loop	Tool-using agents	enables new capabilities

28.9.2 Decision flow

Is the task well-understood by the model?
  Yes → Zero-shot. If format is specific, add one example.
  No  → Few-shot.

Does the task require multi-step reasoning?
  Yes → Add CoT ("step by step" or reasoning examples).

Is accuracy critical and is the answer verifiable?
  Yes → Add Self-Consistency (5-10 samples, majority vote).

Does the task need external data or actions?
  Yes → ReAct with defined tools and stopping conditions.

Is the task open-ended or involves backtracking?
  Yes → Tree-of-Thought.

28.9.3 Key parameters

Parameter	Good default	Notes
`temperature`	0 for math/code, 0.7 for generation	0 = deterministic, 1.0 = creative
Few-shot examples	3-8	Balanced across classes
Self-Consistency samples	5-10	Diminishing returns above 10
CoT trigger	"Let's think step by step"	Or show reasoning examples
Max tokens	set intentionally	Default is often too high for structured output

28.9.4 Core takeaway

Prompting is interface design for a next-token predictor. Few-shot examples show the model what correct output looks like. Chain-of-Thought creates space for intermediate computation. Self-Consistency samples multiple paths and takes the majority---cheap insurance on high-stakes outputs. ReAct and tool-use prompts define the agent's authority and stopping conditions. The common thread: every technique works by shaping the context so that the correct answer is the most natural continuation.

Chapter Checklist

After this chapter, you should be able to:

Explain why prompting works in terms of next-token prediction.
Design few-shot prompts with appropriate example count and diversity.
Apply zero-shot and few-shot Chain-of-Thought to reasoning tasks.
Implement Self-Consistency with majority voting and explain when it is worth the cost.
Describe Tree-of-Thought and the class of problems it addresses.
Write a ReAct-style agent prompt with clear tool definitions and stopping conditions.
Separate application logic from prompt instructions.
Diagnose a poorly performing prompt using the structured checklist.

See You in the Next Chapter

That is prompt engineering. If you can write a Self-Consistency loop from scratch and explain why temperature matters for it, you have the technique down.

Prompt engineering steers a model's behavior at inference time without changing its weights. The next chapter goes deeper: what if you want to change the model's values and preferences, not just its immediate behavior? Chapter 29 covers RLHF and DPO---the training-time alignment methods that made ChatGPT feel like it wants to help rather than just continue text.