Lesson 5: Integration Patterns — APIs, RAG, and Fine-Tuning
Course: Enterprise AI Implementation Guide | Lesson 5 of 6
What You'll Learn
By the end of this lesson, you will be able to:
- Evaluate the three core AI integration patterns and select the right one for a given use case
- Design production-ready architectures for API wrapping, RAG, and fine-tuning
- Calculate the cost tradeoffs between patterns at different usage scales
- Avoid the 5 integration mistakes that cause 80% of enterprise AI project failures
Prerequisites
Before starting this lesson, make sure you've completed:
- Lesson 3: Building Your AI Team — your team structure determines which patterns you can support
- Lesson 4: Data Strategy — your data readiness score determines which patterns are viable
Or have equivalent experience with:
- Building applications that call external APIs
- Basic understanding of how large language models work
The Integration Decision That Determines Everything Else
You've built the business case. You've hired the team. Your data is ready. Now comes the decision that will determine your project's timeline, cost, and ceiling: how do you actually connect AI to your systems?
There are three core patterns. Every enterprise AI deployment uses one or a combination of them. The wrong choice doesn't just waste money — it locks you into an architecture that's expensive to undo. Menlo Ventures reports that enterprises spent $37 billion on AI in 2025, a 3.2x increase from 2024. A significant portion of that spend went to rebuilding integrations that started with the wrong pattern.
Here's the reality most vendors won't tell you: 76% of enterprise AI use cases in 2025 were solved by buying API access to existing models, not building custom ones. The pattern that requires the least engineering almost always wins — unless you have a specific, documented reason to go deeper.
Pattern 1: API Wrapping (Prompt Engineering)
What it is: You send text to a hosted model (Claude, GPT, Gemini) via API and get a response back. Your application logic lives in the prompt — the model itself stays unchanged.
Architecture:
User Input → Your Application → Prompt Template → LLM API → Parse Response → Action
That's it. No training. No infrastructure. No GPU clusters. You write a prompt, call an API, and handle the response.
When it works:
- The task is general-purpose (summarization, classification, content generation, translation)
- You don't need specialized domain knowledge beyond what's in the prompt
- Your volume is under 10 million tokens per month
- You need something working in days, not months
When it doesn't:
- The model needs to know things that aren't in the prompt or conversation (your company's products, internal policies, customer history)
- You need consistent formatting that the model can't reliably produce with prompting alone
- Per-query costs at your volume exceed what a self-hosted model would cost
Real costs: Claude Sonnet at $3/$15 per million tokens (input/output). For a customer support bot handling 1,000 queries per day with 500-token prompts and 200-token responses: roughly $45/month in API costs. At 10,000 queries per day, that's $450/month — still cheaper than one junior engineer's monthly salary.
The trap: Teams that start here often stay here too long. When the prompt grows past 2,000 tokens of context-stuffing (pasting documents, examples, and rules into every request), you've outgrown API wrapping. You need RAG.
Pattern 2: RAG (Retrieval-Augmented Generation)
What it is: Before sending a query to the LLM, you search your own knowledge base for relevant information and include it in the prompt. The model generates answers grounded in your data.
Architecture (production-grade):
User Query
→ Query Transformation (reformulate for better retrieval)
→ Hybrid Search (vector similarity + keyword matching)
→ Re-ranking (score and filter results)
→ Context Assembly (combine top results with system prompt)
→ LLM Generation (with citations)
→ Guardrails Check (hallucination, toxicity, PII)
→ Response with Sources
Notice this is 7 steps, not 2. The gap between a demo RAG system and a production RAG system is where most enterprise projects die.
When it works:
- Your knowledge changes frequently (daily policy updates, new products, changing regulations)
- You need answers grounded in specific documents with citations
- Compliance requires traceability — you must show where every answer came from
- Your corpus is under 100,000 documents (for most retrieval systems)
When it doesn't:
- The task requires reasoning patterns the base model can't do (domain-specific logic, specialized output formats)
- Your documents are highly technical and the model consistently misinterprets terminology
- Retrieval quality can't reach 85%+ accuracy on your data (some domains have documents too similar to disambiguate)
The numbers: The RAG market hit $1.92 billion in 2025 and is growing at 40% annually. 70% of enterprises using generative AI now use some form of RAG. It reduces hallucinations by 70-90% compared to base models. Typical response times: 1.2-2.5 seconds including retrieval.
Real costs: Vector database hosting ($70-$300/month on Pinecone, Weaviate, or Qdrant), embedding generation ($0.10-$0.50 per million tokens), plus LLM API costs. A European bank using Squirro's RAG platform saved EUR 20M over 3 years by automating compliance document review — ROI achieved in 2 months.
The production gap: 80% of enterprise RAG projects experience critical failures. The top reasons:
- Chunking destroys context — A compliance clause retrieved without its governing condition is worse than no retrieval at all
- Stale indexes — Documents updated but the vector index wasn't refreshed
- Noise dilution — Irrelevant chunks fill the context window, pushing useful information out
- No evaluation pipeline — Teams ship RAG without measuring retrieval accuracy, then wonder why answers are wrong
Pattern 3: Fine-Tuning
What it is: You train an existing model on your data so it learns your domain's patterns, vocabulary, and reasoning. The model itself changes — it internalizes knowledge rather than receiving it at query time.
Architecture:
Training Data (1,000-100,000+ examples)
→ Data Cleaning & Formatting
→ Base Model Selection
→ Training Run (hours to days on GPU)
→ Evaluation (accuracy, latency, cost)
→ Deployment (self-hosted or managed)
→ Monitoring & Retraining Pipeline
When it works:
- Your domain has stable, specialized knowledge that rarely changes (medical coding, legal citation formats, financial regulations)
- You need a specific output format or reasoning pattern that prompting can't reliably produce
- Your volume exceeds 10 million tokens per month (the break-even point where fine-tuning becomes cheaper than API calls)
- You need sub-100ms latency (smaller fine-tuned models are faster than large API models)
When it doesn't:
- Your knowledge changes weekly or monthly (fine-tuning can't keep up; use RAG)
- You have fewer than 1,000 high-quality training examples
- You need the model to cite sources (fine-tuned models generate from learned patterns, not from retrieved documents)
- Your team doesn't include an ML engineer who can manage training and evaluation
Real costs: Fine-tuning GPT-4 costs roughly $0.008 per 1,000 training tokens. A 10,000-example dataset with 500 tokens each: about $40 per training run. But the hidden costs are data preparation (40-60% of total effort), evaluation infrastructure, and retraining every quarter. Total first-year cost for a production fine-tuned model: $50K-$500K depending on scale and complexity.
The hard truth: 77% of enterprises using open-source models choose models with 13 billion parameters or fewer. Not because smaller models are better — because the operational cost of running and maintaining larger models exceeds their value for most use cases. Fine-tune the smallest model that meets your accuracy threshold.
The Decision Framework
Stop thinking about which pattern is "best." Think about which constraints matter most for your use case.
| Constraint | API Wrapping | RAG | Fine-Tuning |
|---|---|---|---|
| Time to production | Days | 2-8 weeks | 2-6 months |
| Setup cost | Near zero | $10K-$50K | $50K-$500K |
| Monthly run cost (10K queries/day) | $450 | $500-$1,500 | $200-$800 |
| Knowledge freshness | Real-time (in prompt) | Hours to days | Months (retrain cycle) |
| Citation / traceability | No | Yes | No |
| Domain specialization | Low | Medium | High |
| Team required | 1 engineer | 2-3 engineers | 3-5 engineers + ML |
| Data requirement | None | Documents exist | 1,000+ labeled examples |
The decision tree:
- Can you solve it with a good prompt? Start there. Test with 50 real queries. If accuracy exceeds 85%, ship it.
- Does the model need to know things it doesn't? Add RAG. Your data readiness score from Lesson 4 must be 16+ on the dimensions that matter.
- Does the model need to behave differently? Fine-tune. But only after proving RAG isn't enough with measurable evaluation.
- Need both knowledge and behavior? Use hybrid — fine-tune for behavior (tone, format, reasoning) and RAG for knowledge (facts, documents, current data).
The Hybrid Pattern: What Production Systems Actually Use
In practice, the best enterprise AI systems combine patterns. Pure fine-tuning without retrieval produces confidently wrong answers when facts change. Pure RAG without behavioral training produces generic-sounding responses with poor formatting.
The hybrid architecture:
User Query
→ Fine-Tuned Model (understands your domain's vocabulary, output format, reasoning)
→ RAG Pipeline (retrieves current facts, policies, documents)
→ Combined Generation (domain behavior + current knowledge)
→ Response
A healthcare company fine-tuned a model on 50,000 clinical notes to learn medical terminology and report formatting. They added RAG to retrieve current drug interaction databases and treatment protocols. The fine-tuned model alone hallucinated drug names 12% of the time. With RAG, hallucination dropped to under 1%.
When hybrid is worth the complexity:
- Your domain has both stable patterns (how to structure a medical report) AND changing facts (current drug protocols)
- Volume exceeds 50,000 queries per month (justifies the infrastructure cost)
- You have both labeled training data AND a document corpus
- Accuracy requirements exceed 95%
Five Integration Mistakes That Kill Projects
Mistake 1: Fine-Tuning for Knowledge
Teams fine-tune models on company documentation hoping the model will "memorize" the content. It doesn't work. Models learn patterns, not facts. A fine-tuned model that learned from your Q1 policy documents will confidently cite Q1 policies even after Q3 updates. Use RAG for knowledge. Use fine-tuning for behavior.
Mistake 2: Building RAG Before Testing Prompting
A financial services company spent 3 months building a RAG system for internal Q&A. When they finally tested, they discovered that 70% of the questions could be answered by a well-crafted prompt with a few examples. The RAG infrastructure added latency and maintenance burden for marginal improvement on those queries. Always benchmark the simpler pattern first.
Mistake 3: Ignoring Retrieval Quality
Teams obsess over the LLM choice and ignore retrieval. If your retrieval returns irrelevant documents, it doesn't matter how good the model is. Before tuning any model parameter, measure your retrieval accuracy: of the top 5 chunks returned, how many are actually relevant to the query? If it's under 3 out of 5, fix retrieval first.
Mistake 4: No Evaluation Pipeline
42% of companies abandoned AI initiatives in 2025, up from 17% in 2024. The leading cause: no way to measure whether the system was working. Before building any integration, define your evaluation:
- Accuracy: percentage of correct responses on a held-out test set of 200+ queries
- Latency: 95th percentile response time
- Cost: per-query cost at projected volume
- Freshness: time between knowledge update and system availability
Mistake 5: Over-Architecting from Day One
You don't need agentic RAG with multi-model orchestration for your first deployment. Start with the simplest pattern that clears your accuracy threshold. Add complexity only when you hit measured limitations. The companies that ship AI into production fastest are the ones that resist the urge to build the "ultimate" system.
Exercise: Pattern Selection for Your Use Case
Put your learning into practice:
Task: Take the AI use case you defined in Lesson 2 and select the right integration pattern.
Steps:
- Write down the 5 constraints that matter most (from the decision framework table)
- Score each pattern (1-5) against those constraints
- Test the simplest viable pattern: write 10 representative queries and evaluate responses
- Document your decision with evidence: "We chose [pattern] because [measured result on constraint]"
Expected Outcome: A documented pattern selection with test results showing accuracy on real queries.
Time Required: 1-2 days (includes testing with real queries)
Key Takeaways
- Start with the simplest pattern: API wrapping solves 76% of enterprise use cases. Test it first — if accuracy exceeds 85% on real queries, ship it. Don't over-engineer.
- RAG is for knowledge, fine-tuning is for behavior: This distinction prevents the most expensive integration mistake. Need current facts? RAG. Need domain-specific reasoning or formatting? Fine-tune. Need both? Hybrid.
- Measure retrieval before tuning models: In RAG systems, retrieval quality determines output quality. If your top-5 retrieval accuracy is under 60%, no model improvement will fix the problem.
- Build evaluation before building integration: Define accuracy, latency, cost, and freshness metrics before writing any integration code. Without measurement, you can't distinguish progress from waste.
Quick Reference
| Concept | Definition | Example |
|---|---|---|
| API Wrapping | Sending prompts to a hosted model via API; no model modification | Customer support bot using Claude API with prompt templates |
| RAG | Retrieving relevant documents before generation; grounds answers in your data | Compliance Q&A that searches 10,000 policy documents |
| Fine-Tuning | Training an existing model on your data to learn domain patterns | Medical report generator trained on 50,000 clinical notes |
| Hybrid | Combining fine-tuning (behavior) with RAG (knowledge) | Healthcare AI with learned formatting + current drug databases |
| Retrieval Accuracy | Percentage of retrieved chunks relevant to the query | 4 out of 5 top chunks are relevant = 80% accuracy |
Up Next
In Lesson 6: Pilot to Production — Deploying and Monitoring Your AI System, we'll cover:
- How to scope a pilot that proves value in 6-8 weeks
- The deployment architecture that scales from pilot to production
- Monitoring and observability for AI systems
- When to pull the plug on a pilot vs. when to push through
FAQ
Should we start with RAG or fine-tuning for our first AI project?
Start with neither. Start with API wrapping (prompt engineering). Most teams overestimate the complexity they need. Test your use case with a well-crafted prompt and 50 real queries. Measure accuracy. If it hits 85%+, ship it. If the model needs company-specific knowledge to answer correctly, add RAG. If it needs to reason or format outputs in domain-specific ways that prompting can't achieve, then consider fine-tuning. The pattern that ships fastest and meets your accuracy threshold is the right choice.
How much data do we need for fine-tuning?
For behavioral fine-tuning (output format, tone, reasoning patterns): 1,000-5,000 high-quality examples is enough for most commercial LLMs. For deep domain specialization (medical coding, legal citation): 10,000-50,000 examples with expert-validated labels. Quality matters more than quantity — 1,000 carefully curated examples outperform 50,000 noisy ones. Budget 40-60% of your fine-tuning timeline for data preparation and validation, not training.
What's the typical cost difference between RAG and fine-tuning at enterprise scale?
At 10,000 queries per day: RAG costs $500-$1,500/month (vector database + embeddings + API calls). Fine-tuning costs $200-$800/month in inference (cheaper per query) plus $50K-$100K upfront for data preparation and initial training. RAG is 40% cheaper in Year 1. Fine-tuning becomes cheaper after roughly 18 months for high-volume, stable use cases. For most enterprises starting their first AI project, RAG offers better economics and faster iteration. Fine-tuning only wins when you have stable requirements and volume that justifies the upfront investment.
Need help with AI implementation?
We build production AI systems that actually ship. Not demos, not POCs—real systems that run your business.
Get in Touch