it is task design for a next-token predictor. Few-shot examples set the format, Chain-of-Thought creates reasoning scaffolding, Self-Consistency samples multiple paths and votes, and modern agent prompts define tool boundaries and stopping conditions.'

One-sentence summary: Prompt engineering works because LLMs are next-token predictors---a good prompt creates a context where the correct answer is the most likely continuation; Few-shot, CoT, Self-Consistency, and ReAct are the engineering tools that build that context systematically.


28.1 Why Prompting Still Matters

28.1.1 Same model, different outcomes

Consider asking a model to solve a multi-step scheduling problem.

Prompt A (direct):

A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint. How many sprints to finish?

Prompt B (scaffolded):

A team has sprints of 2 weeks. They have 47 story points of work.
Their velocity is 18 points per sprint.

Let's work through this:
1. How many full sprints cover the work?
2. Is there a partial sprint remaining?

On straightforward arithmetic GPT-class models get both right. On harder problems---multi-hop reasoning, constraint satisfaction, anything requiring several intermediate steps---Prompt B can outperform Prompt A by 20-50 percentage points. The difference is not the model; it is the context you give the model to work from.

28.1.2 The model does not "understand"---it continues

The mental model that makes all prompt techniques click:

An LLM is a conditional probability distribution. Given everything so far, it predicts the most likely next token.

The model is not solving problems in the human sense. It is continuing text. A good prompt is one where the correct answer happens to be the most probable continuation. A bad prompt is one where many continuations are plausible, including wrong ones.

This means:

  • Good prompt = well-defined context where the answer is the natural next thing
  • Bad prompt = ambiguous context where the model can "complete" in too many ways

28.1.3 Three layers of prompting skill

LayerTechniquesTypical improvement
BasicClear wording, format constraintsReduces ambiguity, stabilizes output
IntermediateFew-shot, CoT, role prompting+10-40% on complex tasks
AdvancedSelf-Consistency, ToT, ReAct, agent designApproaches human expert level on structured tasks

This chapter covers all three, with working code for each.


28.2 Zero-Shot and Few-Shot Prompting

28.2.1 The three prompt types

TypeExamples givenWhen to use
Zero-shot0Task is well-understood by the model
One-shot1You need to set the output format
Few-shot2-10Classification, complex formatting, judgment with edge cases

28.2.2 Zero-shot

Direct request, no examples:

prompt = """
Classify the following pull request comment as: bug, feature, refactor, or question.

Comment: "The retry logic doesn't handle the case where the upstream returns 429 before the connection is established."
Category:
"""
# Model output: bug

Works well when the task is within the model's training distribution and the category names are self-explanatory. Fails when the output format matters or when the categories are domain-specific.

28.2.3 One-shot: setting the format

prompt = """
Classify the following pull request comment. Output exactly one word: bug, feature, refactor, or question.

Example:
Comment: "Can we add a timeout parameter here?"
Category: question

Now classify:
Comment: "The retry logic doesn't handle 429 before connection is established."
Category:
"""
# The example locks in the output format.

One example does a lot of work. It demonstrates the output format, signals what level of detail you want, and gives the model a template to continue.

28.2.4 Few-shot: covering the space

prompt = """
Classify each pull request comment. Output exactly one word: bug, feature, refactor, or question.

Comment: "Can we add a timeout parameter here?"
Category: question

Comment: "Connection pool leaks when the server closes the socket unexpectedly."
Category: bug

Comment: "Extract the validation logic into a separate function."
Category: refactor

Comment: "Add a --dry-run flag so engineers can preview changes."
Category: feature

Now classify:
Comment: "The agent loop exits without flushing the write buffer."
Category:
"""

Few-shot design rules:

  1. At least one example per class you care about
  2. 3-8 examples is usually the sweet spot; beyond 10, context cost grows without proportional gain
  3. Keep examples high quality---bad examples teach bad patterns
  4. The last example or two have slightly more influence; put your hardest class there

28.2.5 When to use which

ScenarioRecommendation
Simple generation task (translation, summary)Zero-shot
Strict output format requiredOne-shot at minimum
Multi-class classificationFew-shot with all classes represented
Complex reasoningFew-shot + CoT (next section)
Context budget is tightZero-shot or one-shot
import openai

def test_prompts(prompts, test_cases):
    """Compare prompt strategies on the same test cases."""
    for name, prompt in prompts.items():
        correct = 0
        for text, expected in test_cases:
            response = openai.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt + text}],
                max_tokens=5
            )
            answer = response.choices[0].message.content.strip()
            if expected.lower() in answer.lower():
                correct += 1
        print(f"{name}: {correct}/{len(test_cases)}")

# Typical result pattern:
# Zero-shot:  75%
# One-shot:   83%
# Few-shot:   91%

28.3 Chain-of-Thought (CoT)

28.3.1 The discovery

In 2022 Google researchers published a finding that surprised many people:

Adding "Let's think step by step" to a prompt improved accuracy on math and logical reasoning tasks by 20-50 percentage points on models that were already strong.

No additional training. No architectural change. One sentence.

28.3.2 Why it works

Return to the core model: it predicts the next token from context.

Without CoT:

Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
   How many requests did it handle?
A:

The model jumps from question to answer in one step. For complex arithmetic this means a single token prediction must compress the whole computation. Easy to get wrong.

With CoT:

Q: An agent processes 12 requests per minute and has been running for 4 hours and 20 minutes.
   How many requests did it handle?
Let's think step by step:

Now the model generates reasoning: "4 hours = 240 minutes. 240 + 20 = 260 minutes. 260 × 12 = 3,120 requests." Each intermediate result appears in the context and becomes the foundation for the next token. The model is effectively "writing in the margin" before committing to an answer.

Three mechanisms at work:

  1. Intermediate outputs become inputs. Each computation step is in context for the next.
  2. Smaller jumps. Breaking a problem into steps reduces the per-step complexity.
  3. Self-consistency signal. If the visible reasoning contradicts the answer, that tension is detectable.

28.3.3 Zero-shot CoT

The minimal version: add a trigger phrase.

prompt_no_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?

Answer:
"""

prompt_with_cot = """
A deployment pipeline has 5 stages. Each stage takes 8 minutes.
Two stages can run in parallel if they have no dependencies.
Stages 2 and 3 depend on stage 1 and can run in parallel with each other.
What is the minimum time to complete all stages?

Let's think step by step:
"""

# Expected CoT output:
# Stage 1 must complete first: 8 minutes.
# Stages 2 and 3 can then run in parallel: 8 minutes.
# Stages 4 and 5 depend on... [continues]

Common zero-shot CoT triggers:

  • "Let's think step by step"
  • "Think carefully before answering"
  • "Work through this systematically"
  • "First, ... Then, ... Finally, ..."

28.3.4 Few-shot CoT

More powerful: show the model what a good reasoning trace looks like.

prompt = """
Q: An agent pipeline reads from 3 queues. Queue A delivers 5 msgs/s, Queue B delivers 8 msgs/s, Queue C delivers 3 msgs/s. After 10 seconds, how many messages have arrived?
Step-by-step:
1. Queue A: 5 × 10 = 50 messages
2. Queue B: 8 × 10 = 80 messages
3. Queue C: 3 × 10 = 30 messages
4. Total: 50 + 80 + 30 = 160 messages
Answer: 160 messages

Q: A pull request review cycle: author posts PR (day 0), first review in 1-3 days, fixes in 1 day, final review in 1 day, merge same day. What is the earliest day of merge?
Step-by-step:
1. PR posted: day 0
2. First review: earliest day 1
3. Fixes: day 2
4. Final review: day 3
5. Merge: day 3
Answer: day 3

Q: A team velocity is 22 points per sprint (2 weeks). They have a backlog of 80 points. One engineer goes on vacation for the first sprint, reducing velocity by 20%. How many total weeks until the backlog is cleared?
Step-by-step:
"""

# Model continues with a structured reasoning trace.

28.3.5 CoT comparison

FeatureZero-shot CoTFew-shot CoT
Examples needednone2-8
Reasoning qualitygoodbetter
Preparation workminimalmoderate
Context costlowmedium
Best forquick explorationproduction-quality reasoning

28.3.6 Concrete example: multi-step reasoning

import openai

def solve_with_cot(problem: str) -> str:
    prompt = f"""
You are an experienced software architect. Solve the problem below by showing every step.
Use the format: Step N: [calculation or reasoning]
Finish with: Answer: [final answer]

Problem: {problem}

Step 1:"""

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0   # deterministic for math
    )
    return response.choices[0].message.content

problem = """
A microservice handles 1,000 requests per second at baseline.
During a traffic spike it needs to handle 3x load for 5 minutes,
then 2x load for the next 10 minutes, then returns to baseline.
If each instance can handle 250 RPS and startup takes 90 seconds,
how many extra instances must be running before the spike begins?
"""
print(solve_with_cot(problem))

28.4 Self-Consistency

28.4.1 The problem CoT does not fully solve

CoT dramatically improves accuracy, but a single CoT path can still go wrong. The model samples from a probability distribution. With temperature > 0, different runs produce different reasoning traces---some correct, some not.

Self-Consistency turns this variability from a problem into a strength.

28.4.2 The mechanism

Sample multiple independent reasoning paths for the same problem, then vote:

                  ┌──────────────────────┐
                       Same problem      
                  └──────────┬───────────┘
                             
        ┌────────────────────┼────────────────────┐
                                                
                                                
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
  Path 1            Path 2            Path 3      
  Answer: 17        Answer: 17        Answer: 15  
└──────────────┘    └──────────────┘    └──────────────┘
                                                
        └────────────────────┼────────────────────┘
                             
                             
                  ┌──────────────────────┐
                    Majority vote: 17  
                  └──────────────────────┘

Steps:

  1. Generate N CoT responses with temperature > 0 (diversity is desirable here)
  2. Extract the final answer from each
  3. Return the most frequent answer

28.4.3 Code implementation

import openai
from collections import Counter
import re

def self_consistency(
    problem: str,
    num_samples: int = 5,
    temperature: float = 0.7
) -> dict:
    """
    Run Self-Consistency over a reasoning problem.

    Args:
        problem:      the problem statement
        num_samples:  how many reasoning paths to sample
        temperature:  diversity of samples (0.5-0.8 recommended)

    Returns:
        dict with answer, confidence, and all sampled answers
    """
    prompt = f"""
Solve the following problem step by step.
At the end, write "Answer: X" where X is the numeric answer.

Problem: {problem}

Solution:
"""

    answers = []
    all_paths = []

    for _ in range(num_samples):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )
        text = response.choices[0].message.content
        all_paths.append(text)

        # Extract the numeric answer
        match = re.search(r'Answer:\s*(\d+(?:\.\d+)?)', text)
        if match:
            answers.append(float(match.group(1)))

    if not answers:
        return {"answer": None, "confidence": 0, "all_answers": []}

    counter = Counter(answers)
    best_answer, count = counter.most_common(1)[0]

    return {
        "answer": best_answer,
        "confidence": count / len(answers),
        "all_answers": answers,
        "sample_paths": all_paths
    }

# Usage
problem = """
An agent batch-processes 1,000 files. Each file takes 200ms.
With 4 workers in parallel the effective rate is 4x.
How many seconds to complete the full batch?
"""

result = self_consistency(problem, num_samples=7)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"All sampled answers: {result['all_answers']}")

# Expected output:
# Answer: 50.0
# Confidence: 100%
# All sampled answers: [50.0, 50.0, 50.0, 50.0, 50.0, 50.0, 50.0]

28.4.4 Empirical performance gains

From the original Self-Consistency paper (Wang et al., 2022), on top of Chain-of-Thought:

BenchmarkCoT singleCoT + Self-ConsistencyGain
GSM8K (math word problems)56.5%74.4%+17.9%
SVAMP (arithmetic)68.9%81.6%+12.7%
AQuA (algebraic)48.3%57.9%+9.6%
StrategyQA (multihop reasoning)73.4%81.3%+7.9%

These are significant gains from a technique that requires no fine-tuning.

28.4.5 Parameter choices

ParameterRecommendedRationale
num_samples5-10Diminishing returns above 10; 5 usually captures the distribution
temperature0.5-0.8Needs diversity; near-zero produces identical paths
Answer extractionregex on "Answer: X"Standardize the format in your prompt

28.4.6 Cost and when to use it

Self-Consistency multiplies your API cost by num_samples. Use it when:

  • The task has a verifiable correct answer (math, logic, code outputs)
  • Accuracy matters more than cost (production code review, financial calculations)
  • The base CoT error rate is already moderate (10-40%); if single-path accuracy is already 95%, Self-Consistency adds little

Optimization trick: run a single CoT pass first. If the model is highly confident and the answer is straightforward, stop. Only invoke Self-Consistency for cases where you detect uncertainty (multiple plausible answers, hedging language in the output).


28.5 Advanced Techniques

28.5.1 Role Prompting

Assigning a persona activates relevant knowledge patterns and shifts the register of the response.

# Generic prompt
prompt_generic = """
How should we handle database connection pooling in a high-traffic microservice?
"""

# Role-based prompt
prompt_role = """
You are a senior infrastructure engineer who has operated services at 100,000 RPS.
You have strong opinions about failure modes and have been burned by common mistakes.

A junior engineer asks you: "How should we handle database connection pooling
in a high-traffic microservice?"

Give practical advice, including the mistakes you've seen teams make.
"""

Role prompting does not grant the model capabilities it does not have. It does shift the style, the level of detail, the assumptions made about the audience, and the practical weight given to tradeoffs.

Useful roles for engineering contexts:

TaskEffective role
Code review"Senior engineer who has shipped this exact pattern in production"
Architecture decision"Experienced architect who has been burned by premature abstraction"
Debugging"Engineer who has debugged this class of issue many times"
Documentation"Technical writer who values precision and hates ambiguity"

28.5.2 Tree-of-Thought (ToT)

CoT is a single linear reasoning path. Tree-of-Thought explores multiple branches, evaluates each, and backtracks from dead ends.

                    Problem
                       
          ┌────────────┼────────────┐
                                  
       Approach A   Approach B   Approach C
                                  
       Evaluate      Evaluate    Evaluate
       (score: 7)   (score: 3)  (score: 9)
                                    
       Explore                    Explore
                                    
       [sub-branches]           [sub-branches]

Core algorithm:

  1. Generate N candidate next-steps from the current state
  2. Evaluate each step's promise (another LLM call or heuristic)
  3. Expand the most promising branch
  4. Backtrack if a branch leads nowhere
def tree_of_thought(problem: str, depth: int = 3, branches: int = 3) -> str:
    """Simplified Tree-of-Thought."""

    def generate_candidates(context: str, n: int) -> list[str]:
        prompt = f"""
{context}

Generate {n} distinct approaches for the next step. One per line, numbered.
"""
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8
        )
        return [l.strip() for l in resp.choices[0].message.content.strip().split('\n') if l.strip()]

    def score_candidate(context: str, candidate: str) -> float:
        prompt = f"""
Problem: {problem}
Progress so far: {context}
Proposed next step: {candidate}

Rate this step's promise from 1-10. Reply with only the number.
"""
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        try:
            return float(resp.choices[0].message.content.strip())
        except ValueError:
            return 0.0

    context = f"Problem: {problem}\n"
    for step_num in range(depth):
        candidates = generate_candidates(context, branches)
        scored = [(c, score_candidate(context, c)) for c in candidates]
        best = max(scored, key=lambda x: x[1])[0]
        context += f"\nStep {step_num + 1}: {best}"

    return context

ToT is most useful for open-ended planning, creative tasks, and multi-constraint optimization where linear reasoning often gets stuck in a local optimum.

28.5.3 Prompt Chaining

Break a complex task into a pipeline of simpler prompts, feeding outputs forward.

def analyze_pull_request(diff: str) -> str:
    """Multi-step PR analysis using prompt chaining."""

    # Step 1: Extract structure
    step1 = call_llm(f"""
Extract from this diff:
1. Files changed
2. Type of change (bug fix / feature / refactor)
3. Any security-sensitive patterns

Diff:
{diff}
""")

    # Step 2: Assess risk using step 1 output
    step2 = call_llm(f"""
Based on this analysis:
{step1}

Rate the review risk (low / medium / high) and explain why.
Focus on: correctness, security, test coverage implications.
""")

    # Step 3: Generate review comment from steps 1 and 2
    step3 = call_llm(f"""
Write a concise PR review comment based on:

Analysis: {step1}
Risk assessment: {step2}

Be direct, specific, and actionable. Skip praise. Raise real concerns.
""")

    return step3

Prompt chaining is cleaner to debug than a single monolithic prompt. Each step has its own output you can inspect. You can also inject human review between steps if the stakes are high.

28.5.4 ReAct: Reasoning and Acting

ReAct (Reasoning and Acting) is the foundational pattern for tool-using agents. The model alternates between thinking and taking an action.

Thought: I need to check the current queue depth before deciding.
Action: check_queue_depth(queue_name="task-processor")
Observation: queue depth = 4,521 messages, consumer_lag = 12 seconds

Thought: The lag is moderate but not critical. I should also check error rates.
Action: get_error_rate(service="task-processor", last_n_minutes=5)
Observation: error_rate = 0.3%, below threshold of 1%

Thought: Queue is accumulating but error rate is fine. The issue is throughput, not errors.
Action: scale_service(service="task-processor", replicas=4)
Observation: Scaling initiated. ETA: 90 seconds.

Thought: Scaling is underway. I should monitor and report back.
Answer: Queue depth was 4,521 with 12s lag. Error rate was healthy at 0.3%.
Scaled task-processor from 2 to 4 replicas. ETA 90 seconds.

The structured Thought/Action/Observation loop keeps the model grounded in real-world state rather than hallucinating actions it cannot take.

def react_agent(task: str, tools: dict) -> str:
    """
    Minimal ReAct agent loop.

    tools: dict mapping tool_name -> callable
    """
    history = [f"Task: {task}"]
    max_steps = 10

    for _ in range(max_steps):
        prompt = "\n".join(history) + "\nThought:"
        response = call_llm(prompt, stop=["Observation:"])

        history.append(f"Thought:{response}")

        # Parse action from response
        if "Action:" in response:
            action_line = [l for l in response.split('\n') if 'Action:' in l][0]
            tool_name, *args = parse_action(action_line)

            if tool_name in tools:
                result = tools[tool_name](*args)
                history.append(f"Observation: {result}")
            else:
                history.append(f"Observation: Unknown tool: {tool_name}")

        if "Answer:" in response:
            return response.split("Answer:")[-1].strip()

    return "Max steps reached."

28.5.5 Structured output constraints

For programmatic consumption, lock down the format:

# JSON output with explicit schema
prompt = """
Analyze this deployment event and return JSON matching this exact schema:

{
  "severity": "low" | "medium" | "high" | "critical",
  "affected_services": ["service1", "service2"],
  "root_cause": "one sentence",
  "recommended_action": "one sentence",
  "requires_human_review": true | false
}

Event:
[2026-04-24T14:32:01Z] Service payment-processor: latency p99 = 2,340ms (threshold: 500ms)
[2026-04-24T14:32:15Z] Service order-fulfillment: dependency timeout on payment-processor
[2026-04-24T14:32:20Z] Error rate on order-fulfillment: 12% (threshold: 1%)

Return only valid JSON. No explanation, no markdown fences.
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    response_format={"type": "json_object"}  # enforce JSON mode
)

import json
result = json.loads(response.choices[0].message.content)

Best practices for structured output:

  1. Give an exact schema, not just "use JSON"
  2. Show the schema with types or enum values where precision matters
  3. Use response_format={"type": "json_object"} when the API supports it
  4. Validate the parsed output against your schema; do not trust implicit compliance

28.6 Tool Use and Agent Prompts

28.6.1 How prompting changed when tools became default

Before tool use, a prompt was: "here is information, produce an answer." After tool use, a prompt is: "here is a task, here are tools you can call, here is when to stop."

The three new responsibilities in an agent prompt:

1. Tool specification: what each tool does, what arguments it accepts, what it returns. Vague tool descriptions produce unreliable tool calls.

2. Decision logic: when to call a tool vs. reason from context vs. ask for clarification. The model needs a decision framework, not just a list of tools.

3. Stopping conditions: when is the task complete? What counts as success? What is out of scope? Without clear boundaries the model loops or goes off-task.

28.6.2 A production-style agent prompt

You are a deployment-health agent for the Shannon production cluster.

## Your job
Monitor alerts, diagnose root causes, take conservative remediation actions, and escalate when uncertain.

## Tools available
- `get_metrics(service, metric, time_range)` - returns time series data
- `get_logs(service, level, last_n_lines)` - returns recent log lines
- `scale_replicas(service, count)` - changes replica count (max: 10)
- `restart_service(service)` - rolling restart of a service
- `page_oncall(team, severity, message)` - pages the on-call team

## Decision rules
1. Diagnose before acting. Call `get_metrics` and `get_logs` before any state change.
2. Conservative escalation: if p99 latency is > 2x baseline but error rate is < 1%, scale up rather than restart.
3. Restart only when error rate is > 5% or logs show fatal exceptions.
4. Page on-call for: data loss risk, cascading failures, any action beyond scaling or restart.
5. If you are uncertain about the root cause after two rounds of investigation, page on-call.

## Stopping conditions
- Issue is resolved (metrics within normal bounds for 2 minutes).
- Action is taken and you are waiting for it to take effect (state this clearly).
- You have escalated to on-call (provide full context in the page).
- Task is outside your authority (state clearly what you cannot do).

## Output format
Always respond with:
Thought: [your reasoning]
Action: [tool call or "escalate" or "done"]
[if Action is a tool call, wait for Observation before continuing]

This is a different genre of writing from a chat prompt. It is a contract between the engineer and the model.

28.6.3 MCP-style tool integration

Modern tool ecosystems like MCP (Model Context Protocol) formalize tool descriptions as structured schemas. The prompt still matters---it defines the high-level policy for when and how to use the tools the schema exposes.

The design principle: tool schemas define capability; prompts define judgment. Put access control in the tool layer. Put decision logic in the prompt. Do not blur those boundaries.


28.7 Practical Pitfalls and Best Practices

28.7.1 Common mistakes

MistakeProblemFix
Prompt as a wall of textModel emphasizes the end; early instructions get dilutedStructure with headers; repeat critical constraints at the end
Negative-only instructions"Do not hallucinate" does not specify what to do instead"If you do not know, say 'I do not have enough information'"
Hiding application logic in the promptPrompt bloat, drift between prompt versions and codeMove hard rules to code; use prompt for task context
Fixing with more wordsAdding more verbose instructions often makes things worseAdd a better example instead
Ignoring temperatureUsing default temp for tasks that need determinismSet temperature=0 for code, math, structured output
Too many constraintsThe model satisfies them in priority order; later ones get droppedRank and limit constraints; test what actually matters

28.7.2 Prompt structure that works

def build_prompt(
    role: str,
    context: str,
    task: str,
    examples: list[dict],
    constraints: list[str],
    output_format: str
) -> str:
    """
    Structured prompt builder.

    examples: list of {"input": ..., "output": ...} dicts
    constraints: list of constraint strings (ranked by importance)
    """
    parts = [f"## Role\n{role}", f"## Context\n{context}", f"## Task\n{task}"]

    if examples:
        ex_text = "\n\n".join(
            f"Input: {e['input']}\nOutput: {e['output']}" for e in examples
        )
        parts.append(f"## Examples\n{ex_text}")

    if constraints:
        c_text = "\n".join(f"- {c}" for c in constraints)
        parts.append(f"## Constraints\n{c_text}")

    parts.append(f"## Output format\n{output_format}")
    parts.append("---\nNow process the input:")

    return "\n\n".join(parts)

# Example use
prompt = build_prompt(
    role="Senior technical writer with a preference for concrete examples",
    context="Writing API documentation for a Python SDK",
    task="Generate a docstring for the provided function",
    examples=[
        {
            "input": "def connect(host, port, timeout=30): ...",
            "output": '"""Connect to the server at host:port.\n\n    Args:\n        host: hostname or IP\n        port: port number\n        timeout: seconds before connection attempt fails (default 30)\n\n    Returns:\n        Connection object ready for use\n\n    Raises:\n        ConnectionError: if the server is unreachable\n    """'
        }
    ],
    constraints=[
        "Use Google-style docstrings",
        "Include a Raises section if the function can raise exceptions",
        "Omit obvious information like 'returns None'",
    ],
    output_format="Only the docstring, wrapped in triple quotes."
)

28.7.3 Debugging prompts

When output is wrong, work through this checklist:

  1. Is the prompt ambiguous? Can you read it two ways? The model may be reading it the other way.
  2. Is the format shown? If you need a specific output format, show an example of it.
  3. Is the task too complex? Break it into two prompts.
  4. Is temperature too high? Set it to 0 for diagnostic runs.
  5. Is the model right and your expectation wrong? Run three times and see if the outputs cluster.
def diagnose_prompt(prompt: str, n_runs: int = 5) -> None:
    """Run a prompt multiple times to assess consistency."""
    results = []
    for i in range(n_runs):
        resp = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3
        )
        results.append(resp.choices[0].message.content.strip())

    unique = set(results)
    print(f"Prompt consistency: {(n_runs - len(unique) + 1) / n_runs:.0%}")
    print(f"Unique outputs: {len(unique)}")
    for i, r in enumerate(results, 1):
        preview = r[:200] + "..." if len(r) > 200 else r
        print(f"--- Run {i} ---\n{preview}\n")

28.7.4 When to move logic out of the prompt

A prompt is not a good place for:

  • Access control rules ("never do X for user type Y") - put in code
  • Exact string matching ("if the user says exactly 'cancel', do Z") - put in routing logic
  • Multi-step transaction logic - put in orchestration code
  • Rules that change frequently - put in a config file

A prompt is good for:

  • Task framing ("you are helping an engineer review code")
  • Judgment calls ("prefer explicit error handling over silent fallbacks")
  • Tone and format constraints
  • Few-shot examples of desired behavior

28.8 Chapter Summary

28.8.1 Technique comparison

TechniqueCore mechanismBest forTypical gain
Zero-shotDirect task statementSimple, well-understood tasksbaseline
Few-shotExamples as templatesFormat control, classification+10-20%
Chain-of-ThoughtVisible reasoning traceMath, multi-step logic+20-40%
Self-ConsistencyMajority vote over N samplesHigh-accuracy requirements+10-18%
Role PromptingPersona activationTone, depth, domain registervaries
Tree-of-ThoughtBranching + backtrackingOpen-ended planning+10-30%
ReActReason → Act → Observe loopTool-using agentsenables new capabilities

28.8.2 Decision flow

Is the task well-understood by the model?
  Yes  Zero-shot. If format is specific, add one example.
  No   Few-shot.

Does the task require multi-step reasoning?
  Yes  Add CoT ("step by step" or reasoning examples).

Is accuracy critical and is the answer verifiable?
  Yes  Add Self-Consistency (5-10 samples, majority vote).

Does the task need external data or actions?
  Yes  ReAct with defined tools and stopping conditions.

Is the task open-ended or involves backtracking?
  Yes  Tree-of-Thought.

28.8.3 Key parameters

ParameterGood defaultNotes
temperature0 for math/code, 0.7 for generation0 = deterministic, 1.0 = creative
Few-shot examples3-8Balanced across classes
Self-Consistency samples5-10Diminishing returns above 10
CoT trigger"Let's think step by step"Or show reasoning examples
Max tokensset intentionallyDefault is often too high for structured output

28.8.4 Core takeaway

Prompting is interface design for a next-token predictor. Few-shot examples show the model what correct output looks like. Chain-of-Thought creates space for intermediate computation. Self-Consistency samples multiple paths and takes the majority---cheap insurance on high-stakes outputs. ReAct and tool-use prompts define the agent's authority and stopping conditions. The common thread: every technique works by shaping the context so that the correct answer is the most natural continuation.


Chapter Checklist

After this chapter, you should be able to:

  • Explain why prompting works in terms of next-token prediction.
  • Design few-shot prompts with appropriate example count and diversity.
  • Apply zero-shot and few-shot Chain-of-Thought to reasoning tasks.
  • Implement Self-Consistency with majority voting and explain when it is worth the cost.
  • Describe Tree-of-Thought and the class of problems it addresses.
  • Write a ReAct-style agent prompt with clear tool definitions and stopping conditions.
  • Separate application logic from prompt instructions.
  • Diagnose a poorly performing prompt using the structured checklist.

See You in the Next Chapter

That is prompt engineering. If you can write a Self-Consistency loop from scratch and explain why temperature matters for it, you have the technique down.

Prompt engineering steers a model's behavior at inference time without changing its weights. The next chapter goes deeper: what if you want to change the model's values and preferences, not just its immediate behavior? Chapter 29 covers RLHF and DPO---the training-time alignment methods that made ChatGPT feel like it wants to help rather than just continue text.

Cite this page
Zhang, Wayland (2026). Chapter 28: Prompt Engineering In Practice. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-28-prompt-engineering
@incollection{zhang2026transformer_chapter_28_prompt_engineering,
  author = {Zhang, Wayland},
  title = {Chapter 28: Prompt Engineering In Practice},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-28-prompt-engineering}
}