• Skip to main content

Annielytics.com

I make data sexy

  • About
  • Tools
  • Blog
  • Portfolio
  • Contact
  • Log In

Jun 25 2025

Prompt Engineering or Fine-Tuning: Which Is Best for Your AI Project?

Image generated using ChatGPT

AI developers today have two primary strategies for customizing large language models (LLMs): prompt engineering and fine-tuning. While both influence model output, they operate at different levels and serve distinct use cases.

This guide explores the differences through real-world examples and includes a practical quiz to help you determine which strategy suits your project.

Prompt Engineering

What Is It?

Prompt engineering is the practice of designing inputs to guide a model’s output without modifying the model itself. It involves strategically phrasing requests, providing context, and structuring formatting to improve performance.

Is It Enough?

When evaluating whether prompt engineering alone will suffice for your project, consider whether the information your users need is likely well-represented in the model’s training data. Most modern LLMs are trained on massive datasets like Common Crawl, which contains petabytes of web content, along with books, academic papers, and other text sources.

If your use case involves general knowledge, common business scenarios, widely discussed technical topics, or well-documented procedures, prompt engineering will likely be sufficient. However, if you need the model to work with proprietary information, recent events after the training cutoff, specialized domain knowledge that’s rarely discussed online, or company-specific processes and data, you’ll need to supplement prompt engineering with techniques like fine-tuning, retrieval-augmented generation (RAG), or real-time data integration.

I’ll address RAG and real-time data integration in a follow-up post. This one will focus on deciding between prompt engineering and fine-tuning.

Structuring the API Call

Standard LLMs like Claude, GPT-4, and Llama are accessed through APIs, which allow developers to send text prompts and receive AI-generated responses programmatically. These API calls are structured as json objects that specify the model to use, the messages to send, and various parameters that control the model’s behavior.

Multimodal models—which are a distinct category—can handle multiple input types (e.g., text + image, text + audio, text + video, etc.). But those are outside the scope of this post.

When working with different LLM APIs, it’s important to understand that system prompts are formatted differently depending on the provider. OpenAI-compatible APIs (e.g., Llama and Phi) require the system prompt to be included as the first message in the messages array with "role": "system" where Anthropic’s Claude API uses a separate system parameter outside the messages array. (I personally find the latter approach cleaner.)

Both formats deliver the same instructions to the model with equal effectiveness. The LLM processes the system prompt the same way regardless of placement. What matters is that these system instructions remain consistent across all your examples, whether you’re using zero-shot, one-shot, or few-shot prompting, as they establish the foundational behavior and constraints that guide the model’s responses throughout the entire conversation. (More on those distinctions in a moment.)

Real-World Examples

I’ve included three real-world examples that might be prime for prompt engineering. In the API call examples, I included one zero-shot prompt example (i.e., there are no example responses), one one-shot prompt example (i.e., there is one example response), and one few-shot prompt examples (i.e., there are a few example responses).

Zero-Shot: Customer Support Automation

Take note there is the prompt instruction (assigned to the system role) and a user prompt. This is a good place to start. If you run test prompts and get acceptable responses, there’s no need to bloat your prompt instruction with an example(s). Because LLMs are stateless (meaning they don’t retain information from previous conversations or API calls), the complete prompt instruction and any examples typically must be sent with every API call, increasing inference costs. While some providers like Anthropic offer prompt caching to reduce this overhead, you should still start with zero-shot prompting and only add examples if the quality improvement justifies the additional cost.

{
  "model": "microsoft/phi-3.5-mini-instruct",
  "max_tokens": 800,
  "temperature": 0.1,
  "messages": [
    {
      "role": "system",
      "content": "You are an experienced English teacher evaluating student essays. Score each essay from 1-6 based on three criteria: clarity (how well ideas are communicated), structure (organization and flow), and argument strength (evidence and reasoning). Provide specific feedback for each criterion and an overall score."
    },
    {
      "role": "user",
      "content": "Score the following essay:\n\n'Climate change is a big problem. Many scientists say the earth is getting warmer because of pollution. Cars and factories make pollution. We should use less cars and more bikes. Also solar panels are good. In conclusion, climate change is bad and we need to fix it.'\n\nProvide scores for clarity, structure, and argument strength, plus overall feedback."
    }
  ]
}

I don’t want to get too in the weeds here, but it’s worth noting that you don’t need to wait for your LLM vendor to expose a caching API. You can build prompt caching entirely within your own service layer. By hashing the complete prompt (system + user messages) into a cache key, you can store and immediately replay high-frequency prompt/response pairs whenever similar request is submitted. Simply attach a sensible TTL or eviction policy to prevent stale replies from persisting, and you’ll eliminate repeated embedding and generation costs on your most common queries. Whether or not your provider offers managed caching, this local approach slashes both latency and token-usage fees without sacrificing freshness for dynamic interactions. You can learn more about the benefits of caching in this followup post.

One-Shot: Educational Assessment

In this example, we’re sending one example response for the LLM to use as an example of how we want it to respond. The example is highlighted in the code snippet.

{
  "model": "meta-llama/llama-3.1-70b-instruct",
  "max_tokens": 800,
  "temperature": 0.1,
  "messages": [
    {
      "role": "system",
      "content": "You are an experienced English teacher evaluating student essays. Score each essay from 1-6 based on three criteria: clarity (how well ideas are communicated), structure (organization and flow), and argument strength (evidence and reasoning). Provide specific feedback for each criterion and an overall score."
    },
    {
      "role": "user",
      "content": "Score this essay from 1-6 on clarity, structure, and argument strength:\n\n'Technology has revolutionized education by providing students with instant access to information, enabling personalized learning experiences, and facilitating collaboration across geographical boundaries. However, excessive screen time may impact social development and critical thinking skills.'"
    },
    {
      "role": "assistant",
      "content": "**Clarity: 5/6** - Clear, well-articulated ideas with sophisticated vocabulary and precise language.\n\n**Structure: 4/6** - Coherent flow with balanced presentation of benefits and drawbacks, but could benefit from more detailed organization.\n\n**Argument Strength: 4/6** - Presents both advantages and concerns with reasonable examples, though could use more specific evidence.\n\n**Overall Score: 4.3/6**\n\n**Feedback:** Strong essay with clear communication and balanced perspective. To improve, add specific examples and organize points into distinct paragraphs."
    },
    {
      "role": "user",
      "content": "Score this essay from 1-6 on clarity, structure, and argument strength:\n\n'Climate change is a big problem. Many scientists say the earth is getting warmer because of pollution. Cars and factories make pollution. We should use less cars and more bikes. Also solar panels are good. In conclusion climate change is bad and we need to fix it.'"
    }
  ]
}

Few-Shot: Code Generation

In this example, we’re sending three example responses for the LLM to use as an example of how we want it to respond. The examples are highlighted in the code snippet.

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1000,
  "temperature": 0.2,
  "system": "You are an expert Python developer. Write clean, well-documented functions with proper error handling. Include comprehensive docstrings with Args, Returns, and Raises sections. Add example usage demonstrating the function.",
  "messages": [
    {
      "role": "user",
      "content": "Write a Python function that checks if a number is even. Include error handling and documentation."
    },
    {
      "role": "assistant",
      "content": "```python\ndef is_even(number):\n    \"\"\"\n    Check if a number is even.\n    \n    Args:\n        number (int or float): The number to check\n        \n    Returns:\n        bool: True if the number is even, False otherwise\n        \n    Raises:\n        TypeError: If input is not a number\n    \"\"\"\n    if not isinstance(number, (int, float)):\n        raise TypeError(\"Input must be a number\")\n    \n    return number % 2 == 0\n\n# Example usage:\nprint(is_even(4))    # Output: True\nprint(is_even(7))    # Output: False\n```"
    },
    {
      "role": "user",
      "content": "Write a Python function that removes duplicates from a list while preserving order. Include error handling and documentation."
    },
    {
      "role": "assistant",
      "content": "```python\ndef remove_duplicates(items):\n    \"\"\"\n    Remove duplicate items from a list while preserving order.\n    \n    Args:\n        items (list): Input list that may contain duplicates\n        \n    Returns:\n        list: New list with duplicates removed\n        \n    Raises:\n        TypeError: If input is not a list\n    \"\"\"\n    if not isinstance(items, list):\n        raise TypeError(\"Input must be a list\")\n    \n    seen = set()\n    result = []\n    for item in items:\n        if item not in seen:\n            seen.add(item)\n            result.append(item)\n    return result\n\n# Example usage:\noriginal = [1, 2, 2, 3, 1, 4]\nresult = remove_duplicates(original)\nprint(result)  # Output: [1, 2, 3, 4]\n```"
    },
    {
      "role": "user",
      "content": "Write a Python function that finds the factorial of a number. Include error handling and documentation."
    },
    {
      "role": "assistant",
      "content": "```python\ndef factorial(n):\n    \"\"\"\n    Calculate the factorial of a non-negative integer.\n    \n    Args:\n        n (int): Non-negative integer to calculate factorial for\n        \n    Returns:\n        int: The factorial of n\n        \n    Raises:\n        TypeError: If input is not an integer\n        ValueError: If input is negative\n    \"\"\"\n    if not isinstance(n, int):\n        raise TypeError(\"Input must be an integer\")\n    \n    if n < 0:\n        raise ValueError(\"Input must be non-negative\")\n    \n    if n == 0 or n == 1:\n        return 1\n    \n    result = 1\n    for i in range(2, n + 1):\n        result *= i\n    return result\n\n# Example usage:\nprint(factorial(5))   # Output: 120\nprint(factorial(0))   # Output: 1\n```"
    },
    {
      "role": "user",
      "content": "Write a Python function that takes a list of integers and returns only the prime numbers. Include error handling and documentation."
    }
  ]
}

Fine-Tuning

What Is It?

Fine-tuning involves updating a pre-trained model’s parameters using custom data. Unlike prompting, this process modifies the model’s internal weights to align with specific language patterns, terminology, or behavior.

Traditional fine-tuning updates all model parameters, which can be computationally expensive and require significant resources. Low-Rank Adaptation (LoRA) has become increasingly popular in the past couple years because it achieves similar results by only training small adapter matrices that are added to the frozen pre-trained model, dramatically reducing memory requirements and training costs—often by 90% or more.

Explained in simpler terms, traditional fine-tuning could be viewed as gutting and renovating your home to give it a new look and feel, where LoRA is more like painting the walls, installing new flooring, and renovating the kitchen. If these less disruptive modifications can adequately transform your home and personalize the style without touching the underlying structure, you can save significant money and displacement.

Image generated using ChatGPT

This efficiency makes fine-tuning accessible to organizations without massive GPU clusters, lowering both the technical barrier to entry and ongoing operational costs. In fact, I fine-tuned a model to translate medical notes to patient-friendly language for just $50 and an afternoon of model tweaking. Additionally, LoRA adapters can be easily swapped in and out, allowing the same base model to serve multiple specialized use cases without maintaining separate full model copies.

Real-World Examples

Below are a few examples of where fine-tuning a model may be a better choice over prompt engineering:

  • Code documentation: A software company fine-tunes on their best-documented codebases to automatically generate consistent API documentation and code comments.
  • Script adaptation: A streaming service fine-tunes on successfully adapted screenplays to help convert novels into scripts that are engaging and genre appropriate.
  • Earnings call summaries: An investment firm fine-tunes on transcripts and analyst notes to produce standardized earnings summaries that highlight key metrics and forward guidance.
  • Product descriptions: An online retailer fine-tunes on their top-converting product pages to generate compelling descriptions that match their brand voice and boost sales.
  • Manufacturing: A factory fine-tunes on safety incident documentation to generate comprehensive reports that meet OSHA requirements and internal standards.
  • Interactive entertainment: A game studio fine-tunes on existing lore and character scripts to generate contextually appropriate dialog during gameplay.
  • Socratic math app: An edtech company fine-tunes on the free MATH dataset that contains 12,500 math problems to train it how to approach math problems socratically, like a human tutor would.

Fine-tuning requires more resources but delivers superior accuracy and efficiency for domain-specific applications at scale.

Test Your Understanding with a Quiz 📝

Test your understanding with these real-world scenarios. Click/tap to reveal the correct answer.

1. A startup is building a prototype that summarizes news articles into three key takeaways.
▶
Answer: Prompt Engineering
No specialized domain knowledge required, and rapid iteration is valuable. A well-designed prompt like “Summarize this article into 3 bullet points focusing on cause, effect, and impact” typically suffices.
2. A software company is creating an AI-powered tool that automates the creation of user stories and acceptance criteria from an SOW.
▶
Answer: Fine-Tuning
Converting SOWs to user stories requires understanding of the company’s specific project management methodology, terminology, and formatting standards. Training on historical SOWs and their corresponding user stories ensures consistent output that matches the team’s workflow.
3. A healthcare company is developing a voice assistant for doctors to record and summarize patient visits.
▶
Answer: Fine-Tuning
Medical applications demand precision, regulatory compliance, and reliable terminology. Training on transcribed clinical data is essential.
4. A restaurant wants to create seasonal menu descriptions that highlight ingredients and cooking methods.
▶
Answer: Prompt Engineering
Menu descriptions can be effectively generated using prompts that specify ingredients, cooking style, and desired tone without requiring specialized culinary training data.
5. An e-commerce site needs SEO-optimized product descriptions.
▶
Answer: Prompt Engineering
Templates like “Write a 3-sentence product description for [product], incorporating keywords [X, Y, Z]” provide flexibility across diverse products.
6. An insurance company needs to process claims and generate settlement letters following their specific approval workflows.
▶
Answer: Fine-Tuning
Insurance settlement processes involve company-specific workflows, legal language, and approval criteria that require training on the firm’s historical claims data.
7. A gaming company’s application helps writers generate character names based on genre and personality traits.
▶
Answer: Prompt Engineering
Creative tasks respond well to prompts like “Generate 5 fantasy-style names for a shy, clever elf” without requiring training data.
8. A logistics company is deploying a model on devices without internet connectivity (warehouses, vehicles).
▶
Answer: Fine-Tuning
Offline deployment requires a self-contained model. Fine-tune a compact LLM optimized for edge computing.
9. A law firm needs a system that translates Spanish legal questions to English and summarizes them for US attorneys.
▶
Answer: Fine-Tuning
The combination of legal terminology and bilingual requirements benefits from specialized training, especially when consistency is critical.
10. A fitness app wants to create motivational workout descriptions for different exercise routines.
▶
Answer: Prompt Engineering
Workout descriptions can be generated using prompts that specify exercise types, intensity levels, and motivational tone without requiring specialized fitness training data.

Conclusion

The key to successful AI implementation lies in mapping your AI strategy to the requirements of your app. Prompt engineering excels when you need flexibility, speed, and general capabilities, while fine-tuning is preferable when precision, consistency, and domain expertise are required.

Written by Annie Cushing · Categorized: AI

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025