Every business owner we talk to has tried ChatGPT. Most of them are impressed. Some of them have built internal tools around it. A few have discovered its limits the hard way. Here's an honest breakdown of when the generic model is enough — and when it isn't.
Generic AI is genuinely good
Let's be clear: GPT-4, Claude, Gemini — they're remarkable. For drafting emails, summarising documents, writing code in common languages, answering general questions — they perform at a level that would have seemed impossible five years ago. If your use case is general-purpose, using one of these APIs is absolutely the right call.
The problem arises when your use case is specific.
Where generic models fall short
- Domain knowledge gaps. A general model knows a little about everything. It does not have deep, reliable knowledge of your industry's regulations, your product catalogue, or your internal processes. It will confidently fill those gaps with plausible-sounding fiction.
- Inconsistent output format. You ask for JSON, you get JSON — until you don't. Prompt engineering can improve consistency but never guarantee it. Downstream systems that parse model output will break.
- Cost at scale. GPT-4 charges per token. A fine-tuned 7B model running on a single GPU costs a fraction of that at the same volume — often 10–20× cheaper after the initial training investment.
- Privacy. Sending your customer data, legal documents, or financial records to a third-party API is a compliance risk. A self-hosted fine-tuned model keeps your data on your infrastructure.
The core problem: generic models are optimised to be helpful across every possible topic. That generalisation is exactly what makes them unreliable for specific ones.
When fine-tuning makes sense
Fine-tuning is worth the investment when at least two of these are true:
- You need structured, predictable output — specific JSON fields, citations, section numbers
- Your domain has specialised terminology the base model doesn't know reliably — legal, medical, financial, or proprietary
- You're making more than a few hundred API calls per day — cost savings start to compound
- You need the model to run offline or on private infrastructure
- Prompt engineering alone isn't giving you the consistency you need
A real example: legal Q&A
We built a legal AI assistant for Pakistan's Penal Code. The requirement was simple: a user types a legal question and gets back the relevant section number, section title, and punishment — every time, in a consistent format.
We tested the base Llama 3.2 8B model first. It knew about Pakistani law in a vague, general sense — but it hallucinated section numbers, mixed up punishments, and returned prose answers when we needed structured data. Prompt engineering helped marginally but never consistently.
After fine-tuning on a structured dataset of all 511 PPC sections using Unsloth LoRA, the model returned perfectly formatted section/title/punishment triples on every query. No hallucinations. No format deviations. Exported to GGUF and deployed on Hugging Face Spaces — the whole inference pipeline costs less than $5/month.
The honest trade-off
Fine-tuning takes time and expertise. You need a good dataset, a training pipeline, evaluation metrics, and somewhere to host the model. It's not an afternoon project. For a simple internal chatbot or a one-off summary task, a well-prompted GPT-4 call is probably faster and cheaper.
But if you're building a product feature that's central to your business, needs to run reliably at scale, and works in a specific domain — you're leaving quality, cost, and control on the table by not owning your model.
The question to ask
Not “can ChatGPT do this?” — it probably can, loosely. Ask instead: “does it do this reliably enoughto bet my product on?” If the answer is no, it's time to talk about a custom model.