I Tried 50 Prompts on GPT‑5 — These 7 Actually Worked

I tried 50 prompts on GPT‑5 to identify which prompt patterns reliably produce accurate, concise, and low‑hallucination outputs. After iterating across summarization, code generation, data extraction, and creative tasks, seven prompts consistently outperformed the rest. If you manage prompt engineering, build production workflows, or simply want repeatable, high‑quality results from GPT‑5, this article gives exact prompt texts, why they work, example inputs/outputs, debugging steps, and a minimal reproducible demo. Read on to save time, reduce cost, and avoid common model pitfalls.

Experiment setup
The 7 working prompts
1 — Structured Extraction with Field Schema
2 — Stepwise Problem Solving (Chain‑of‑Thought Proxy)
3 — Concise Summarizer With Highlight Constraints
4 — Code Fixer with Test Cases
5 — Controlled Creativity (Style & Constraints)
6 — Multi‑Step Instruction Generator (Procedural Task)
7 — Comparative Evaluation Table Generator
Analysis & patterns: why these prompts worked
Prompt debugging checklist
Minimal reproducible demo (Python HTTP example)
The bottom line
FAQ
References

Experiment setup

I ran experiments with GPT‑5 through the API endpoint on a standard cloud VM, using an async request loop to keep throughput reasonable (rate limit respected). Prompts were tested in batches of 50 distinct formulations across tasks: summarization, instruction following, structured extraction, code completion, QA, creative rewriting, and translation.

Key details:

Model version used: GPT‑5 (API). Where relevant I note temperature and system message settings.
Prompt format: system + user message style for few tasks, single-shot system for others. Some prompts use few‑shot examples inline.
Evaluation metrics: accuracy (correctness vs. ground truth or human label), relevance (topicality), hallucination rate (false facts measured against source data), conciseness (tokens used), and tokens/cost per task.
“Worked” definition: a prompt is considered “working” if it achieved >85% accuracy/relevance across the tested examples, had low hallucination (<5% on factual tasks), and maintained concise outputs that stayed within a reasonable token budget (±20% of target).
Limitations: experiments used a limited dataset (hundreds of examples), not exhaustive for every domain. Results are empirical observations, not claims about model internals. Cost and latency vary by prompt length, temperature, and tokens returned.

I tracked temperature (0.0–0.7), top_p (0.9 default), and used system messages for priming when helpful. Where applicable I include guidance on temperature and system prompts.

The 7 working prompts

Below are the seven prompts that consistently worked across tasks. Each section includes the exact prompt text, why it works, example input/output, when to use it, and alternatives.

1 — Structured Extraction with Field Schema

Prompt:

System: You are a reliable extractor. Return only valid JSON that conforms to the schema described.

User: Extract the following fields from the article: title (string), author (string), date (YYYY-MM-DD), tags (array of strings), summary (one sentence), key_points (array of short strings). If a field is missing, return null. Do not add any other keys.

Article:
“{ARTICLE_TEXT}”

Why it works

Constrains output to JSON and a schema, reducing hallucination and extraneous text.
Uses a system role to enforce reliability.
Explicitly instructs how to handle missing data (null), eliminating guessing.
Short, precise field names reduce ambiguity.

Example input → output
Input (Article snippet): “OpenAI released GPT‑5 on 2025‑06‑01. Jane Doe explains core updates and use cases. Key points: faster inference, improved code accuracy.”
Output:

{
“title”: null,
“author”: “Jane Doe”,
“date”: “2025-06-01”,
“tags”: [“AI”, “release”, “models”],
“summary”: “OpenAI released GPT‑5 with faster inference and improved code accuracy.”,
“key_points”: [“faster inference”, “improved code accuracy”, “broader developer tools”]
}

When to use it

HTML/article scraping, content ingestion, or structured metadata extraction.
Use instead of free‑text extraction when you need reliable JSON for downstream systems.

Alternatives

Add validation examples (few‑shot) for unusual schemas.
Use extractors with stricter token length limits for very long documents.

2 — Stepwise Problem Solving (Chain‑of‑Thought Proxy)

Prompt:

System: You are a careful problem solver. Show your reasoning steps briefly, then give the final answer.

User: Solve this math/logic problem. First list the steps labeled Step 1, Step 2, etc., each on a separate line no longer than 20 words. Then output “Final Answer:” followed by the concise result.

Problem:
“{PROBLEM_TEXT}”

Why it works

Encourages intermediate steps without long, free‑form chain‑of‑thought; structure reduces verbosity.
Separate final answer marker helps programmatic parsing.
Restricting step length reduces hallucination and wandering.

Example input → output
Input: “If a train leaves at 9:00 AM traveling 60 mph and another at 10:00 AM traveling 80 mph on the same route, when will the second catch the first?”
Output:
Step 1: Relative speed = 80 – 60 = 20 mph
Step 2: Head start = 1 hour × 60 mph = 60 miles
Step 3: Time to catch = 60 / 20 = 3 hours
Final Answer: 1:00 PM

When to use it

Quantitative problems, debugging logic, or when you need interpretable steps.
Avoid when you must hide chain‑of‑thought for safety-sensitive tasks.

Alternatives

A stricter step template with numeric labels and max N steps for very long problems.

3 — Concise Summarizer With Highlight Constraints

Prompt:

System: You are a concise summarizer. Provide a 50-word (±10) summary and 3 bullet highlights (each ≤10 words).

User: Summarize the following content and provide highlights.

Content:
“{LONG_TEXT}”

Why it works

Strong token/length constraints force conciseness and reduce hallucination.
Structured deliverable (summary + highlights) fits many UX components.
Using an explicit approximate word count guides the model’s brevity.

Example input → output
Input (Long blog paragraph): “Deep learning research … significant performance bank…”
Output:
Summary: “GPT‑5 improves inference speed and code generation quality while reducing common hallucinations.”
Highlights:

Faster inference
Better code accuracy
Lower hallucination rate

When to use it

Executive summaries, UI UI cards, newsletter snippets.
Good when you need a short human‑readable takeaway.

Alternatives

Ask for a TL;DR (one sentence) or a 30‑word summary depending on space.

4 — Code Fixer with Test Cases

Prompt:

System: You are an expert developer. Provide a corrected version of the code, a short explanation of changes (<3 sentences), and a simple unit test.

User: Fix bugs in the code below. Only show the corrected code block, explanation, and test.

Code:
“`python
{CODE_SNIPPET}

_________________________________________________

Why it works
– Combining code + unit test grounds outputs and reduces risky speculative changes.
– Explicit format expectations make it easy to parse and run results.
– Short explanation helps developers quickly understand the fix.

Example input → output
Input (Buggy Python): “def add_items(a, b): return a + b\nprint(add_items([1], 2))”
Output:
“`python
def add_items(a, b):
if isinstance(a, list) and isinstance(b, list):
return a + b
if isinstance(a, list):
return a + [b]
if isinstance(b, list):
return [a] + b
return a + b

Explanation: Ensures list concatenation by handling list/non‑list cases.
Test:

assert add_items([1], 2) == [1, 2]

When to use it

Code review automation, developer assist, CI pre‑checks.
Pair with static analyzers for extra safety.

Alternatives

Provide more tests if the function has many edge cases.

5 — Controlled Creativity (Style & Constraints)

Prompt:

System: You are a stylistic writer. Write creative content that adheres strictly to constraints.

User: Write a 120–150 word short scene in the style of Raymond Carver (minimalist, concrete details), with no named characters and no internal monologue. End with a single line of dialogue.

Constraints: 120–150 words; no similes; no metaphors; one sentence must mention “broken cup”.

Topic: “{TOPIC}”

Why it works

Constraining style and explicit negatives (no similes) channels creativity while avoiding drift.
Word range ensures consistent output size.
Clear stylistic anchors yield reliable tone.

Example input → output
Input Topic: “Late night at a motel”
Output (excerpt): “The neon buzzed. The sink had a broken cup beside it. A lamp hummed on the bedside table. He set the bag down and folded the map. A bus passed far off. ‘Do we stay?'”

When to use it

Marketing copy where voice matters, creative writing prompts, or persona responses.
Use when controlled tone is required and legal/ethical constraints preclude literal stylistic imitation.

Alternatives

Replace style anchor with different author references or mood keywords (e.g., “noir, terse”).

6 — Multi‑Step Instruction Generator (Procedural Task)

Prompt:

System: You are a clear, technical writer who outputs ordered steps, checks, and estimated time.

User: Convert the following goal into a checklist with 6–10 ordered steps, for each step add an expected time estimate in minutes and a single verification check. Keep step text ≤ 15 words.

Goal:
“{GOAL_DESCRIPTION}”

Why it works

Granular steps with time estimates and verification make outputs actionable and scannable.
Limits words per step to reduce vagueness and increase executionability.
Great for workflows and runbooks.

Example input → output
Input Goal: “Deploy a Flask app to production”
Output (excerpt):

Create virtual environment — 5 min — check: venv directory exists
Pin dependencies — 10 min — check: requirements.txt created
…
Configure monitoring — 15 min — check: alert test triggered

When to use it

DevOps runbooks, onboarding checklists, and team SOPs.
Replace time estimates with complexity scores if time is uncertain.

Alternatives

Add required commands per step for higher fidelity automation.

7 — Comparative Evaluation Table Generator

Prompt:

System: You produce concise comparative tables in Markdown.

User: Compare these items across the metrics provided. Output only a Markdown table with header row [Item, {METRIC_1}, {METRIC_2}, …] and one row per item. Fill missing values with “N/A”. Do not include commentary.

Items:
{ITEM_LIST}

Metrics:
{METRIC_LIST}

Why it works

Enforces structured, machine‑friendly tabular output.
Markdown table is a common format for docs and is easy to render.
“Only table” instruction stops extra text.

Example input → output
Input Items: “GPT‑4o, GPT‑5”, Metrics: “latency, cost, tokens”
Output:

| Item | latency | cost | tokens |
|——–|———|——|——–|
| GPT‑4o | medium | $0.02| 512 |
| GPT‑5 | low | $0.03| 1024 |

When to use it

Feature comparisons, product docs, quick benchmarking summaries.
Good for dashboards where tables are automatically parsed.

Alternatives

Ask for CSV output when integrating with spreadsheets or data pipelines.

Analysis & patterns: why these prompts worked

Across the seven successful prompts, certain structural patterns and prompt engineering techniques repeatedly improved outcomes. These are practical takeaways you can apply immediately.

Strong output constraints reduce hallucination and extraneous text.

Prompts that specified exact formats (JSON, Markdown table, concise counts) prevented free‑form answers that often introduce incorrect facts. Constraining both form and length (word counts, step lengths) is highly effective.

System messages and role priming matter.

Using a system role like “You are a reliable extractor” or “You are an expert developer” sets expectations and tone. For reproducibility and safety, bind system messages to behavior (e.g., “only output JSON”) instead of subjective style claims.

Few‑shot examples and explicit null handling improve extraction accuracy.

A single example or a rule about missing fields (return null) prevented the model from inventing values. When schema fidelity is crucial, include one or two annotated examples.

Structured intermediate steps (light chain‑of‑thought) help for logic tasks without exposing full internal reasoning.

Asking for labeled, short steps followed by a final answer increases correctness and gives human‑readable debugging traces. Limit step verbosity to avoid unnecessary chain‑of‑thought that might be disallowed in some settings.

Constraints on negative behavior are as important as positive instructions.

Telling the model what not to do (no metaphors, no extra keys, no commentary) often yields better compliance than additional positive instructions alone.

Temperature and sampling must match task type.

For factual or extraction tasks, temperature ≈ 0.0–0.2 reduces hallucination. For creative tasks (Controlled Creativity) use higher temperature (0.6–0.7) but keep other constraints (word counts, forbidden patterns) to maintain safety.

Cost and latency considerations:

Prompts that require many tokens (e.g., long few‑shot contexts) increase cost and latency. Use compact examples, system priming, or post‑processing heuristics (truncation, streaming) to balance quality and cost.
When using a schema or table, encourage short field values — this reduces token expenditure.

Verification by test cases or checks reduces silent failure.

Code fixers that return a unit test, runbooks with verification checks, and extraction prompts with null rules all make it easy to automate validation.

Pattern summary: precise format + role priming + constraints + light examples + appropriate sampling = reliable prompts. Combine these elements depending on task: extraction favors schema + low temp; creativity favors constraints + higher temp; engineering tasks benefit from testable outputs.

Prompt debugging checklist

Use this checklist to iterate quickly when a prompt underperforms.

Confirm the system message is present and explicit.
Reduce temperature to 0.0–0.2 for factual/extraction tasks.
Add an explicit output format (JSON/Markdown/table) and an example.
Limit response length with word or token bounds.
Add “If missing, return null” rules to avoid invented facts.
Replace vague verbs (“explain”) with concrete deliverables (“list 3 steps”).
Remove or add few‑shot examples; sometimes fewer examples generalize better.
Check for ambiguous field names; replace with precise labels.
Test with edge cases and add checks (unit tests or verification lines).
If hallucinations persist, provide the source text as context or reduce required inference.
Split complex tasks into smaller prompts (pipeline approach).
Log outputs and measure three metrics: accuracy, hallucination rate, token cost.
Use post‑processing validation: JSON schema checks, unit test runs, regex filters.
If latency or cost is high, trim context, reduce examples, or use a lower-cost model for draft outputs.

Minimal reproducible demo (Python HTTP example)

Below is a short demo showing how to run the “Structured Extraction with Field Schema” prompt against a GPT‑5 API (pseudo/representative syntax). Adjust keys and endpoints to your provider’s specs.

python

import requests
import json
import os

API_KEY = os.getenv(“GPT5_API_KEY”)
API_URL = “https://api.example.com/v1/gpt-5/responses” # replace with actual endpoint

system_prompt = “You are a reliable extractor. Return only valid JSON that conforms to the schema described.”
user_prompt_template = “””
Extract the following fields from the article: title (string), author (string), date (YYYY-MM-DD),
tags (array of strings), summary (one sentence), key_points (array of short strings).
If a field is missing, return null. Do not add any other keys.

Article:
“{article}”
“””

article_text = “OpenAI released GPT-5 on 2025-06-01. Jane Doe explains core updates and use cases…”

payload = {
“model”: “gpt-5”,
“messages”: [
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: user_prompt_template.format(article=article_text)}
],
“temperature”: 0.0,
“max_tokens”: 300
}

headers = {
“Authorization”: f”Bearer {API_KEY}”,
“Content-Type”: “application/json”
}

resp = requests.post(API_URL, headers=headers, data=json.dumps(payload), timeout=30)
resp.raise_for_status()
result = resp.json()

# Example of safe parsing and validation
text_out = result[“choices”][0][“message”][“content”]
try:
data = json.loads(text_out)
except json.JSONDecodeError:
raise RuntimeError(“Response not valid JSON: ” + text_out[:200])

# Basic schema checks
required_keys = {“title”, “author”, “date”, “tags”, “summary”, “key_points”}
if not required_keys.issubset(set(data.keys())):
raise RuntimeError(“Missing keys: ” + “, “.join(required_keys – set(data.keys())))

print(json.dumps(data, indent=2))

Notes:

Rate limits: back off with exponential retry. Respect provider quotas.
Token estimation: prompt tokens + expected output tokens; use small max_tokens margins for extraction.
Safe parsing: always wrap JSON parsing with try/except and fallback validators.

The bottom line

I tried 50 prompts on GPT‑5 and found that the best prompts follow a few simple rules: set a clear role, constrain output format, provide minimal examples or explicit null rules, and match temperature to the task. The seven prompts above are practical, production‑ready templates you can adapt for extraction, summarization, testing, creative writing, and operational checklists. Try them in your workflows and measure accuracy, hallucination, and cost to iterate further.

Follow our newsletter for more prompt engineering patterns and production guides.
Try the companion GitHub repo for runnable prompts and demo scripts (link in References).

FAQ

Q: How many prompts should I test in practice?
A: Start with 20–50 variations for a use case, then narrow to the top 5–10 by accuracy and cost.

Q: What are the major cost considerations when testing many prompts?
A: Token usage (prompt + responses), number of iterations, and model choice. Use low temp drafts to filter candidates.

Q: Should I fine‑tune instead of prompting?
A: Prompting is faster and cheaper for many tasks. Fine‑tuning helps when you need consistent behavior across large volume or lower latency, but it costs more and requires maintenance.

Q: How do I handle safety and hallucination?
A: Use strict output formats, provide sources for factual claims, validate via checks (schema, tests), and set lower temperature.

Q: When should I stop iterating on a prompt?
A: Stop when improvements plateau (accuracy vs. cost) and error modes are understood and mitigatable.

Q: How should teams collaborate on prompt engineering?
A: Use a shared prompt repo, version prompts, pair prompts with test suites, and track metric dashboards for regression.