it is task design for a next-token predictor. Few-shot examples set the format, Chain-of-Thought creates reasoning scaffolding, Self-Consistency samples multiple paths and votes, and modern agent prompts define tool boundaries and stopping conditions.'
One-sentence summary: Prompt engineering works because LLMs are next-token predictors---a good prompt creates a context where the correct answer is the most likely continuation; Few-shot, CoT, Self-Consistency, and ReAct are the engineering tools that build that context systematically.
28.1 Why Prompting Still Matters
28.1.1 Same model, different outcomes
Consider asking a model to solve a multi-step scheduling problem.
Prompt A (direct):
A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint. How many sprints to finish?
Prompt B (scaffolded):
A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint.
Let's work through this:
1. How many full sprints cover the work?
2. Is there a partial sprint remaining?
On straightforward arithmetic GPT-class models get both right. On harder problems---multi-hop reasoning, constraint satisfaction, anything requiring several intermediate steps---Prompt B can outperform Prompt A by 20-50 percentage points. The difference is not the model; it is the context you give the model to work from.
28.1.2 The model does not "understand"---it continues
The mental model that makes all prompt techniques click:
An LLM is a conditional probability distribution. Given everything so far, it predicts the most likely next token.
The model is not solving problems in the human sense. It is continuing text. A good prompt is one where the correct answer happens to be the most probable continuation. A bad prompt is one where many continuations are plausible, including wrong ones.
This means:
- Good prompt = well-defined context where the answer is the natural next thing
- Bad prompt = ambiguous context where the model can "complete" in too many ways
28.1.3 Three layers of prompting skill
| Layer | Techniques | Typical improvement |
|---|---|---|
| Basic | Clear wording, format constraints | Reduces ambiguity, stabilizes output |
| Intermediate | Few-shot, CoT, role prompting | +10-40% on complex tasks |
| Advanced | Self-Consistency, ToT, ReAct, agent design | Approaches human expert level on structured tasks |
This chapter covers all three, with working code for each.
28.2 Zero-Shot and Few-Shot Prompting
28.2.1 The three prompt types
| Type | Examples given | When to use |
|---|---|---|
| Zero-shot | 0 | Task is well-understood by the model |
| One-shot | 1 | You need to set the output format |
| Few-shot | 2-10 | Classification, complex formatting, judgment with edge cases |
28.2.2 Zero-shot
Direct request, no examples:
prompt = """
Classify the following pull request comment as: bug, feature, refactor, or question.
Comment: "The retry logic doesn't handle the case where the upstream returns 429 before the connection is established."
Category:
"""
# Model output: bug
Works well when the task is within the model's training distribution and the category names are self-explanatory. Fails when the output format matters or when the categories are domain-specific.
28.2.3 One-shot: setting the format
prompt = """
Classify the following pull request comment. Output exactly one word: bug, feature, refactor, or question.
Example:
Comment: "Can we add a timeout parameter here?"
Category: question
Now classify:
Comment: "The retry logic doesn't handle 429 before connection is established."
Category:
"""
# The example locks in the output format.
One example does a lot of work. It demonstrates the output format, signals what level of detail you want, and gives the model a template to continue.
28.2.4 Few-shot: covering the space
prompt = """
Classify each pull request comment. Output exactly one word: bug, feature, refactor, or question.
Comment: "Can we add a timeout parameter here?"
Category: question
Comment: "Connection pool leaks when the server closes the socket unexpectedly."
Category: bug
Comment: "Extract the validation logic into a separate function."
Category: refactor
Comment: "Add a --dry-run flag so engineers can preview changes."
Category: feature
Now classify:
Comment: "The agent loop exits without flushing the write buffer."
Category:
"""
Few-shot design rules:
- At least one example per class you care about
- 3-8 examples is usually the sweet spot; beyond 10, context cost grows without proportional gain
- Keep examples high quality---bad examples teach bad patterns
- The last example or two have slightly more influence; put your hardest class there
28.2.5 When to use which
| Scenario | Recommendation |
|---|---|
| Simple generation task (translation, summary) | Zero-shot |
| Strict output format required | One-shot at minimum |
| Multi-class classification | Few-shot with all classes represented |
| Complex reasoning | Few-shot + CoT (next section) |
| Context budget is tight | Zero-shot or one-shot |
import openai
def test_prompts(prompts, test_cases):
"""Compare prompt strategies on the same test cases."""
for name, prompt in prompts.items():
correct = 0
for text, expected in test_cases:
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt + text}],
max_tokens=5
)
answer = response.choices[0].message.content.strip()
if expected.lower() in answer.lower():
correct += 1
print(f"{name}: {correct}/{len(test_cases)}")
# Typical result pattern:
# Zero-shot: 75%
# One-shot: 83%
# Few-shot: 91%
28.3 Chain-of-Thought (CoT)
28.3.1 The discovery
In 2022 Google researchers published a finding that surprised many people:
Adding "Let's think step by step" to a prompt improved accuracy on math and logical reasoning tasks by 20-50 percentage points on models that were already strong.
No additional training. No architectural change. One sentence.
28.3.2 Why it works
Return to the core model: it predicts the next token from context.
Without CoT:
Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
How many requests did it handle?
A:
The model jumps from question to answer in one step. For complex arithmetic this means a single token prediction must compress the whole computation. Easy to get wrong.
With CoT:
Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
How many requests did it handle?
Let's think step by step:
Now the model generates reasoning: "4 hours = 240 minutes. 240 + 20 = 260 minutes. 260 × 12 = 3,120 requests." Each intermediate result appears in the context and becomes the foundation for the next token. The model is effectively "writing in the margin" before committing to an answer.
Three mechanisms at work:
- Intermediate outputs become inputs. Each computation step is in context for the next.
- Smaller jumps. Breaking a problem into steps reduces the per-step complexity.
- Self-consistency signal. If the visible reasoning contradicts the answer, that tension is detectable.
28.3.3 Zero-shot CoT
The minimal version: add a trigger phrase.
prompt_no_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?
Answer:
"""
prompt_with_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?
Let's think step by step:
"""
# Expected CoT output:
# Stage 1 must complete first: 8 minutes.
# Stages 2 and 3 can then run in parallel: 8 minutes.
# Stages 4 and 5 depend on... [continues]
Common zero-shot CoT triggers:
- "Let's think step by step"
- "Think carefully before answering"
- "Work through this systematically"
- "First, ... Then, ... Finally, ..."
28.3.4 Few-shot CoT
More powerful: show the model what a good reasoning trace looks like.
prompt = """
Q: An agent pipeline reads from 3 queues. Queue A delivers 5 msgs/s, Queue B delivers 8 msgs/s, Queue C delivers 3 msgs/s. After 10 seconds, how many messages have arrived?
Step-by-step:
1. Queue A: 5 × 10 = 50 messages
2. Queue B: 8 × 10 = 80 messages
3. Queue C: 3 × 10 = 30 messages
4. Total: 50 + 80 + 30 = 160 messages
Answer: 160 messages
Q: A pull request review cycle: author posts PR (day 0), first review in 1-3 days, fixes in 1 day, final review in 1 day, merge same day. What is the earliest day of merge?
Step-by-step:
1. PR posted: day 0
2. First review: earliest day 1
3. Fixes: day 2
4. Final review: day 3
5. Merge: day 3
Answer: day 3
Q: A team velocity is 22 points per sprint (2 weeks). They have a backlog of 80 points. One engineer goes on vacation for the first sprint, reducing velocity by 20%. How many total weeks until the backlog is cleared?
Step-by-step:
"""
# Model continues with a structured reasoning trace.
28.3.5 CoT comparison
| Feature | Zero-shot CoT | Few-shot CoT |
|---|---|---|
| Examples needed | none | 2-8 |
| Reasoning quality | good | better |
| Preparation work | minimal | moderate |
| Context cost | low | medium |
| Best for | quick exploration | production-quality reasoning |
28.3.6 Concrete example: multi-step reasoning
import openai
def solve_with_cot(problem: str) -> str:
prompt = f"""
You are an experienced software architect. Solve the problem below by showing every step.
Use the format: Step N: [calculation or reasoning]
Finish with: Answer: [final answer]
Problem: {problem}
Step 1:"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0 # deterministic for math
)
return response.choices[0].message.content
problem = """
A microservice handles 1,000 requests per second at baseline.
During a traffic spike it needs to handle 3x load for 5 minutes,
then 2x load for the next 10 minutes, then returns to baseline.
If each instance can handle 250 RPS and startup takes 90 seconds,
how many extra instances must be running before the spike begins?
"""
print(solve_with_cot(problem))
28.4 Self-Consistency
28.4.1 The problem CoT does not fully solve
CoT dramatically improves accuracy, but a single CoT path can still go wrong. The model samples from a probability distribution. With temperature > 0, different runs produce different reasoning traces---some correct, some not.
Self-Consistency turns this variability from a problem into a strength.
28.4.2 The mechanism
Sample multiple independent reasoning paths for the same problem, then vote:
┌──────────────────────┐
│ Same problem │
└──────────┬───────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Path 1 │ │ Path 2 │ │ Path 3 │
│ Answer: 17 │ │ Answer: 17 │ │ Answer: 15 │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└────────────────────┼────────────────────┘
│
▼
┌──────────────────────┐
│ Majority vote: 17 ✓ │
└──────────────────────┘
Steps:
- Generate N CoT responses with temperature > 0 (diversity is desirable here)
- Extract the final answer from each
- Return the most frequent answer
28.4.3 Code implementation
import openai
from collections import Counter
import re
def self_consistency(
problem: str,
num_samples: int = 5,
temperature: float = 0.7
) -> dict:
"""
Run Self-Consistency over a reasoning problem.
Args:
problem: the problem statement
num_samples: how many reasoning paths to sample
temperature: diversity of samples (0.5-0.8 recommended)
Returns:
dict with answer, confidence, and all sampled answers
"""
prompt = f"""
Solve the following problem step by step.
At the end, write "Answer: X" where X is the numeric answer.
Problem: {problem}
Solution:
"""
answers = []
all_paths = []
for _ in range(num_samples):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=500
)
text = response.choices[0].message.content
all_paths.append(text)
# Extract the numeric answer
match = re.search(r'Answer:\s*(\d+(?:\.\d+)?)', text)
if match:
answers.append(float(match.group(1)))
if not answers:
return {"answer": None, "confidence": 0, "all_answers": []}
counter = Counter(answers)
best_answer, count = counter.most_common(1)[0]
return {
"answer": best_answer,
"confidence": count / len(answers),
"all_answers": answers,
"sample_paths": all_paths
}
# Usage
problem = """
An agent batch-processes 1,000 files. Each file takes 200ms.
With 4 workers in parallel the effective rate is 4x.
How many seconds to complete the full batch?
"""
result = self_consistency(problem, num_samples=7)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"All sampled answers: {result['all_answers']}")
# Expected output:
# Answer: 50.0
# Confidence: 100%
# All sampled answers: [50.0, 50.0, 50.0, 50.0, 50.0, 50.0, 50.0]
28.4.4 Empirical performance gains
From the original Self-Consistency paper (Wang et al., 2022), on top of Chain-of-Thought:
| Benchmark | CoT single | CoT + Self-Consistency | Gain |
|---|---|---|---|
| GSM8K (math word problems) | 56.5% | 74.4% | +17.9% |
| SVAMP (arithmetic) | 68.9% | 81.6% | +12.7% |
| AQuA (algebraic) | 48.3% | 57.9% | +9.6% |
| StrategyQA (multihop reasoning) | 73.4% | 81.3% | +7.9% |
These are significant gains from a technique that requires no fine-tuning.
28.4.5 Parameter choices
| Parameter | Recommended | Rationale |
|---|---|---|
num_samples | 5-10 | Diminishing returns above 10; 5 usually captures the distribution |
temperature | 0.5-0.8 | Needs diversity; near-zero produces identical paths |
| Answer extraction | regex on "Answer: X" | Standardize the format in your prompt |
28.4.6 Cost and when to use it
Self-Consistency multiplies your API cost by num_samples. Use it when:
- The task has a verifiable correct answer (math, logic, code outputs)
- Accuracy matters more than cost (production code review, financial calculations)
- The base CoT error rate is already moderate (10-40%); if single-path accuracy is already 95%, Self-Consistency adds little
Optimization trick: run a single CoT pass first. If the model is highly confident and the answer is straightforward, stop. Only invoke Self-Consistency for cases where you detect uncertainty (multiple plausible answers, hedging language in the output).
28.5 Advanced Techniques
28.5.1 Role Prompting
Assigning a persona activates relevant knowledge patterns and shifts the register of the response.
# Generic prompt
prompt_generic = """
How should we handle database connection pooling in a high-traffic microservice?
"""
# Role-based prompt
prompt_role = """
You are a senior infrastructure engineer who has operated services at 100,000 RPS.
You have strong opinions about failure modes and have been burned by common mistakes.
A junior engineer asks you: "How should we handle database connection pooling
in a high-traffic microservice?"
Give practical advice, including the mistakes you've seen teams make.
"""
Role prompting does not grant the model capabilities it does not have. It does shift the style, the level of detail, the assumptions made about the audience, and the practical weight given to tradeoffs.
Useful roles for engineering contexts:
| Task | Effective role |
|---|---|
| Code review | "Senior engineer who has shipped this exact pattern in production" |
| Architecture decision | "Experienced architect who has been burned by premature abstraction" |
| Debugging | "Engineer who has debugged this class of issue many times" |
| Documentation | "Technical writer who values precision and hates ambiguity" |
28.5.2 Tree-of-Thought (ToT)
CoT is a single linear reasoning path. Tree-of-Thought explores multiple branches, evaluates each, and backtracks from dead ends.
Problem
│
┌────────────┼────────────┐
│ │ │
Approach A Approach B Approach C
│ │ │
Evaluate Evaluate Evaluate
(score: 7) (score: 3) (score: 9)
│ │
Explore Explore
│ │
[sub-branches] [sub-branches]
Core algorithm:
- Generate N candidate next-steps from the current state
- Evaluate each step's promise (another LLM call or heuristic)
- Expand the most promising branch
- Backtrack if a branch leads nowhere
def tree_of_thought(problem: str, depth: int = 3, branches: int = 3) -> str:
"""Simplified Tree-of-Thought."""
def generate_candidates(context: str, n: int) -> list[str]:
prompt = f"""
{context}
Generate {n} distinct approaches for the next step. One per line, numbered.
"""
resp = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.8
)
return [l.strip() for l in resp.choices[0].message.content.strip().split('\n') if l.strip()]
def score_candidate(context: str, candidate: str) -> float:
prompt = f"""
Problem: {problem}
Progress so far: {context}
Proposed next step: {candidate}
Rate this step's promise from 1-10. Reply with only the number.
"""
resp = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
return float(resp.choices[0].message.content.strip())
except ValueError:
return 0.0
context = f"Problem: {problem}\n"
for step_num in range(depth):
candidates = generate_candidates(context, branches)
scored = [(c, score_candidate(context, c)) for c in candidates]
best = max(scored, key=lambda x: x[1])[0]
context += f"\nStep {step_num + 1}: {best}"
return context
ToT is most useful for open-ended planning, creative tasks, and multi-constraint optimization where linear reasoning often gets stuck in a local optimum.
28.5.3 Prompt Chaining
Break a complex task into a pipeline of simpler prompts, feeding outputs forward.
def analyze_pull_request(diff: str) -> str:
"""Multi-step PR analysis using prompt chaining."""
# Step 1: Extract structure
step1 = call_llm(f"""
Extract from this diff:
1. Files changed
2. Type of change (bug fix / feature / refactor)
3. Any security-sensitive patterns
Diff:
{diff}
""")
# Step 2: Assess risk using step 1 output
step2 = call_llm(f"""
Based on this analysis:
{step1}
Rate the review risk (low / medium / high) and explain why.
Focus on: correctness, security, test coverage implications.
""")
# Step 3: Generate review comment from steps 1 and 2
step3 = call_llm(f"""
Write a concise PR review comment based on:
Analysis: {step1}
Risk assessment: {step2}
Be direct, specific, and actionable. Skip praise. Raise real concerns.
""")
return step3
Prompt chaining is cleaner to debug than a single monolithic prompt. Each step has its own output you can inspect. You can also inject human review between steps if the stakes are high.
28.5.4 ReAct: Reasoning and Acting
ReAct (Reasoning and Acting) is the foundational pattern for tool-using agents. The model alternates between thinking and taking an action.
Thought: I need to check the current queue depth before deciding.
Action: check_queue_depth(queue_name="task-processor")
Observation: queue depth = 4,521 messages, consumer_lag = 12 seconds
Thought: The lag is moderate but not critical. I should also check error rates.
Action: get_error_rate(service="task-processor", last_n_minutes=5)
Observation: error_rate = 0.3%, below threshold of 1%
Thought: Queue is accumulating but error rate is fine. The issue is throughput, not errors.
Action: scale_service(service="task-processor", replicas=4)
Observation: Scaling initiated. ETA: 90 seconds.
Thought: Scaling is underway. I should monitor and report back.
Answer: Queue depth was 4,521 with 12s lag. Error rate was healthy at 0.3%.
Scaled task-processor from 2 to 4 replicas. ETA 90 seconds.
The structured Thought/Action/Observation loop keeps the model grounded in real-world state rather than hallucinating actions it cannot take.
def react_agent(task: str, tools: dict) -> str:
"""
Minimal ReAct agent loop.
tools: dict mapping tool_name -> callable
"""
history = [f"Task: {task}"]
max_steps = 10
for _ in range(max_steps):
prompt = "\n".join(history) + "\nThought:"
response = call_llm(prompt, stop=["Observation:"])
history.append(f"Thought:{response}")
# Parse action from response
if "Action:" in response:
action_line = [l for l in response.split('\n') if 'Action:' in l][0]
tool_name, *args = parse_action(action_line)
if tool_name in tools:
result = tools[tool_name](*args)
history.append(f"Observation: {result}")
else:
history.append(f"Observation: Unknown tool: {tool_name}")
if "Answer:" in response:
return response.split("Answer:")[-1].strip()
return "Max steps reached."
28.5.5 Structured output constraints
For programmatic consumption, lock down the format:
# JSON output with explicit schema
prompt = """
Analyze this deployment event and return JSON matching this exact schema:
{
"severity": "low" | "medium" | "high" | "critical",
"affected_services": ["service1", "service2"],
"root_cause": "one sentence",
"recommended_action": "one sentence",
"requires_human_review": true | false
}
Event:
[2026-04-24T14:32:01Z] Service payment-processor: latency p99 = 2,340ms (threshold: 500ms)
[2026-04-24T14:32:15Z] Service order-fulfillment: dependency timeout on payment-processor
[2026-04-24T14:32:20Z] Error rate on order-fulfillment: 12% (threshold: 1%)
Return only valid JSON. No explanation, no markdown fences.
"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"} # enforce JSON mode
)
import json
result = json.loads(response.choices[0].message.content)
Best practices for structured output:
- Give an exact schema, not just "use JSON"
- Show the schema with types or enum values where precision matters
- Use
response_format={"type": "json_object"}when the API supports it - Validate the parsed output against your schema; do not trust implicit compliance
28.6 Tool Use and Agent Prompts
28.6.1 How prompting changed when tools became default
Before tool use, a prompt was: "here is information, produce an answer." After tool use, a prompt is: "here is a task, here are tools you can call, here is when to stop."
The three new responsibilities in an agent prompt:
1. Tool specification: what each tool does, what arguments it accepts, what it returns. Vague tool descriptions produce unreliable tool calls.
2. Decision logic: when to call a tool vs. reason from context vs. ask for clarification. The model needs a decision framework, not just a list of tools.
3. Stopping conditions: when is the task complete? What counts as success? What is out of scope? Without clear boundaries the model loops or goes off-task.
28.6.2 A production-style agent prompt
You are a deployment-health agent for the Shannon production cluster.
## Your job
Monitor alerts, diagnose root causes, take conservative remediation actions, and escalate when uncertain.
## Tools available
- `get_metrics(service, metric, time_range)` - returns time series data
- `get_logs(service, level, last_n_lines)` - returns recent log lines
- `scale_replicas(service, count)` - changes replica count (max: 10)
- `restart_service(service)` - rolling restart of a service
- `page_oncall(team, severity, message)` - pages the on-call team
## Decision rules
1. Diagnose before acting. Call `get_metrics` and `get_logs` before any state change.
2. Conservative escalation: if p99 latency is > 2x baseline but error rate is < 1%, scale up rather than restart.
3. Restart only when error rate is > 5% or logs show fatal exceptions.
4. Page on-call for: data loss risk, cascading failures, any action beyond scaling or restart.
5. If you are uncertain about the root cause after two rounds of investigation, page on-call.
## Stopping conditions
- Issue is resolved (metrics within normal bounds for 2 minutes).
- Action is taken and you are waiting for it to take effect (state this clearly).
- You have escalated to on-call (provide full context in the page).
- Task is outside your authority (state clearly what you cannot do).
## Output format
Always respond with:
Thought: [your reasoning]
Action: [tool call or "escalate" or "done"]
[if Action is a tool call, wait for Observation before continuing]
This is a different genre of writing from a chat prompt. It is a contract between the engineer and the model.
28.6.3 MCP-style tool integration
Modern tool ecosystems like MCP (Model Context Protocol) formalize tool descriptions as structured schemas. The prompt still matters---it defines the high-level policy for when and how to use the tools the schema exposes.
The design principle: tool schemas define capability; prompts define judgment. Put access control in the tool layer. Put decision logic in the prompt. Do not blur those boundaries.
28.7 Practical Pitfalls and Best Practices
28.7.1 Common mistakes
| Mistake | Problem | Fix |
|---|---|---|
| Prompt as a wall of text | Model emphasizes the end; early instructions get diluted | Structure with headers; repeat critical constraints at the end |
| Negative-only instructions | "Do not hallucinate" does not specify what to do instead | "If you do not know, say 'I do not have enough information'" |
| Hiding application logic in the prompt | Prompt bloat, drift between prompt versions and code | Move hard rules to code; use prompt for task context |
| Fixing with more words | Adding more verbose instructions often makes things worse | Add a better example instead |
| Ignoring temperature | Using default temp for tasks that need determinism | Set temperature=0 for code, math, structured output |
| Too many constraints | The model satisfies them in priority order; later ones get dropped | Rank and limit constraints; test what actually matters |
28.7.2 Prompt structure that works
def build_prompt(
role: str,
context: str,
task: str,
examples: list[dict],
constraints: list[str],
output_format: str
) -> str:
"""
Structured prompt builder.
examples: list of {"input": ..., "output": ...} dicts
constraints: list of constraint strings (ranked by importance)
"""
parts = [f"## Role\n{role}", f"## Context\n{context}", f"## Task\n{task}"]
if examples:
ex_text = "\n\n".join(
f"Input: {e['input']}\nOutput: {e['output']}" for e in examples
)
parts.append(f"## Examples\n{ex_text}")
if constraints:
c_text = "\n".join(f"- {c}" for c in constraints)
parts.append(f"## Constraints\n{c_text}")
parts.append(f"## Output format\n{output_format}")
parts.append("---\nNow process the input:")
return "\n\n".join(parts)
# Example use
prompt = build_prompt(
role="Senior technical writer with a preference for concrete examples",
context="Writing API documentation for a Python SDK",
task="Generate a docstring for the provided function",
examples=[
{
"input": "def connect(host, port, timeout=30): ...",
"output": '"""Connect to the server at host:port.\n\n Args:\n host: hostname or IP\n port: port number\n timeout: seconds before connection attempt fails (default 30)\n\n Returns:\n Connection object ready for use\n\n Raises:\n ConnectionError: if the server is unreachable\n """'
}
],
constraints=[
"Use Google-style docstrings",
"Include a Raises section if the function can raise exceptions",
"Omit obvious information like 'returns None'",
],
output_format="Only the docstring, wrapped in triple quotes."
)
28.7.3 Debugging prompts
When output is wrong, work through this checklist:
- Is the prompt ambiguous? Can you read it two ways? The model may be reading it the other way.
- Is the format shown? If you need a specific output format, show an example of it.
- Is the task too complex? Break it into two prompts.
- Is temperature too high? Set it to 0 for diagnostic runs.
- Is the model right and your expectation wrong? Run three times and see if the outputs cluster.
def diagnose_prompt(prompt: str, n_runs: int = 5) -> None:
"""Run a prompt multiple times to assess consistency."""
results = []
for i in range(n_runs):
resp = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
results.append(resp.choices[0].message.content.strip())
unique = set(results)
print(f"Prompt consistency: {(n_runs - len(unique) + 1) / n_runs:.0%}")
print(f"Unique outputs: {len(unique)}")
for i, r in enumerate(results, 1):
preview = r[:200] + "..." if len(r) > 200 else r
print(f"--- Run {i} ---\n{preview}\n")
28.7.4 When to move logic out of the prompt
A prompt is not a good place for:
- Access control rules ("never do X for user type Y") - put in code
- Exact string matching ("if the user says exactly 'cancel', do Z") - put in routing logic
- Multi-step transaction logic - put in orchestration code
- Rules that change frequently - put in a config file
A prompt is good for:
- Task framing ("you are helping an engineer review code")
- Judgment calls ("prefer explicit error handling over silent fallbacks")
- Tone and format constraints
- Few-shot examples of desired behavior
28.8 Chapter Summary
28.8.1 Technique comparison
| Technique | Core mechanism | Best for | Typical gain |
|---|---|---|---|
| Zero-shot | Direct task statement | Simple, well-understood tasks | baseline |
| Few-shot | Examples as templates | Format control, classification | +10-20% |
| Chain-of-Thought | Visible reasoning trace | Math, multi-step logic | +20-40% |
| Self-Consistency | Majority vote over N samples | High-accuracy requirements | +10-18% |
| Role Prompting | Persona activation | Tone, depth, domain register | varies |
| Tree-of-Thought | Branching + backtracking | Open-ended planning | +10-30% |
| ReAct | Reason → Act → Observe loop | Tool-using agents | enables new capabilities |
28.8.2 Decision flow
Is the task well-understood by the model?
Yes → Zero-shot. If format is specific, add one example.
No → Few-shot.
Does the task require multi-step reasoning?
Yes → Add CoT ("step by step" or reasoning examples).
Is accuracy critical and is the answer verifiable?
Yes → Add Self-Consistency (5-10 samples, majority vote).
Does the task need external data or actions?
Yes → ReAct with defined tools and stopping conditions.
Is the task open-ended or involves backtracking?
Yes → Tree-of-Thought.
28.8.3 Key parameters
| Parameter | Good default | Notes |
|---|---|---|
temperature | 0 for math/code, 0.7 for generation | 0 = deterministic, 1.0 = creative |
| Few-shot examples | 3-8 | Balanced across classes |
| Self-Consistency samples | 5-10 | Diminishing returns above 10 |
| CoT trigger | "Let's think step by step" | Or show reasoning examples |
| Max tokens | set intentionally | Default is often too high for structured output |
28.8.4 Core takeaway
Prompting is interface design for a next-token predictor. Few-shot examples show the model what correct output looks like. Chain-of-Thought creates space for intermediate computation. Self-Consistency samples multiple paths and takes the majority---cheap insurance on high-stakes outputs. ReAct and tool-use prompts define the agent's authority and stopping conditions. The common thread: every technique works by shaping the context so that the correct answer is the most natural continuation.
Chapter Checklist
After this chapter, you should be able to:
- Explain why prompting works in terms of next-token prediction.
- Design few-shot prompts with appropriate example count and diversity.
- Apply zero-shot and few-shot Chain-of-Thought to reasoning tasks.
- Implement Self-Consistency with majority voting and explain when it is worth the cost.
- Describe Tree-of-Thought and the class of problems it addresses.
- Write a ReAct-style agent prompt with clear tool definitions and stopping conditions.
- Separate application logic from prompt instructions.
- Diagnose a poorly performing prompt using the structured checklist.
See You in the Next Chapter
That is prompt engineering. If you can write a Self-Consistency loop from scratch and explain why temperature matters for it, you have the technique down.
Prompt engineering steers a model's behavior at inference time without changing its weights. The next chapter goes deeper: what if you want to change the model's values and preferences, not just its immediate behavior? Chapter 29 covers RLHF and DPO---the training-time alignment methods that made ChatGPT feel like it wants to help rather than just continue text.