Fine-tuning is the process of taking a pre-trained large language model and further training it on a smaller, domain-specific dataset to specialize its behavior, knowledge, or output style. The model weights are updated to reflect the new training data, creating a customized version of the base model.
Pre-trained LLMs are generalists, they know a lot about many topics but are not experts in any specific domain. Fine-tuning narrows this generality:
1. **Start with a pre-trained model** (e.g., GPT-5, Llama) that already understands language 2. **Provide domain-specific training examples**, typically hundreds to thousands of input-output pairs showing desired behavior 3. **Train for a few epochs**: the model adjusts its weights to perform better on your specific task 4. **Result**: A specialized model that retains general language ability but excels at your domain
Fine-tuning is commonly used for: adapting tone and style (matching brand voice), teaching specific output formats (JSON, structured responses), improving performance on niche domains (medical, legal, financial), and reducing latency by using smaller fine-tuned models instead of larger general ones.
In practice, fine-tuning should be evaluated by what it changes in the support workflow. Ask whether it improves answer accuracy, reduces repeated agent work, clarifies handoff decisions, or makes reporting easier. If the answer is only "it sounds modern," the concept is not yet operational.
A concrete example is brand voice adaptation: A luxury brand fine-tunes a model on 5,000 examples of their customer communications to match their formal, elegant tone. The fine-tuned model consistently produces responses in the brand voice without needing extensive tone instructions in every prompt, reducing token usage and latency.
The simplest takeaway is: Fine-tuning further trains a pre-trained model on domain-specific data to specialize its behavior
Use fine-tuning when you need consistent behavior changes (tone, style, format) rather than factual knowledge updates. Fine-tuning is better for teaching the model how to respond, while RAG is better for providing what to respond with. Most customer support use cases are better served by RAG or a combination of both.
Effective fine-tuning typically requires 500-5,000 high-quality input-output examples. More data generally improves results, but quality matters more than quantity. Poorly curated training data produces a model that is confidently wrong, which is worse than the base model.
A ticketing system fine-tunes a model to always output responses in a specific JSON format with fields for category, priority, summary, and suggested_action. The fine-tuned model produces valid JSON 99.5% of the time vs 85% for the base model with prompt-only instructions.
Fine-tuning costs include training compute ($50-$500+ per training run depending on model size and data volume), hosting the fine-tuned model ($100-$1,000+/month for inference), and data curation time (often the largest hidden cost). RAG on a base model is typically 5-10x cheaper for customer support use cases.
Not all LLMs support fine-tuning. OpenAI offers fine-tuning for GPT-4o and GPT-4o-mini. Open-source models like Llama and Mistral can be fine-tuned freely. Anthropic Claude and Google Gemini have more limited fine-tuning access. Check each provider for current availability and pricing.
Fine-tuning is taking a model that already understands language broadly and giving it extra practice on your specific examples so it gets better at your particular task. Think of it like sending a generalist new hire through onboarding focused on your products and tone before they handle real customer messages.
Fine-tuning is best for shaping how the model responds: matching brand voice, locking in a structured output format (JSON, fixed sections), or specializing on a narrow domain like medical or legal language. It is a poor fit for keeping the model up to date with changing facts; that job belongs to RAG.