AI Engineer

What I learned from two years of building with LLMs in production.

intermediate · last updated: Jul 1, 2026

My thoughts — read this first.

This roadmap is pulled from two years of my own experience building with LLMs in production. It's not a textbook list — it's what actually mattered. This might look overwhelming. You don't need to know 100% of it. But aim for at least 60–70% to be job-ready these days, and 70–80% to be competitive. Even as a fresher, companies expect a strong portfolio of real AI projects — RAG pipelines, LLM-integrated apps, agents — not just coursework. Build projects alongside learning; that's what gets you hired.

1. Python

The language you’ll live in. Master the native parts before anything else.

Functions & lambda functions — definitions, arguments, closures.
Classes & data classes — OOP, @dataclass, when to use which.
Dictionaries, lists & data types — the core built-ins you’ll reach for daily.
Decorators — how they wrap functions, writing your own.
Iteration & generators — lazy evaluation, yield, memory-efficient pipelines.
Async coroutines — async/await, the backbone of FastAPI and concurrent LLM calls.
Exception handling — try/except, and how exceptions propagate from child to parent.
Logging — the native logging module, different log levels, structured logging.
Date/time packages — datetime, timezones, formatting.
Regular expressions — pattern matching, validation.

2. Tooling & packaging

The non-language essentials around your code.

Modules & packaging — imports, virtual environments, pyproject.toml. Very important in production.
Pydantic — data validation and typed schemas; the backbone of structured LLM output and FastAPI models. Learn it well.
FastAPI — the most important web framework for AI engineers. Build APIs that serve LLMs, handle streaming, and expose structured endpoints.
Async — async/await in the context of FastAPI; concurrent LLM calls, non-blocking I/O, background tasks.
Hugging Face — the hub for models, datasets, and demos.

3. Git & GitHub

Version control you’ll use daily.

Branching — creating, switching, and managing branches.
Good commits — clear messages, atomic commits, telling a story.
Merge conflict resolution — fixing conflicts without losing work.
Projects on GitHub — having at least one or two good projects deployed on your profile is very important.

4. Production systems

Think beyond a single server from day one.

Main database — Postgres (recommended) or MySQL. Pick one and learn it well.
Redis — caching, semantic caching, and the different types of caching you’ll do in production.
Asynchronous workers & queues — background processing is essential for production AI apps:
- Celery — Python async task queue; runs jobs in the background.
- SQS — Amazon Simple Queue Service; managed queue for decoupling services.
- Redis as a queue — using Redis as a lightweight message broker.
- BullMQ — Node.js queue library; explore others on your own.
- Why it matters — async processing is critical for chunking in RAG systems, background analysis, and any long-running task that shouldn’t block the request.
Distributed systems mindset — never think from the perspective of a single server:
- Always assume multiple servers running at the same time.
- How data syncs between them when something fails or is dependent on them.
- Stateless servers — always aim for stateless; keep state in the database/cache, not in memory.

5. How LLMs actually work

Theory first — you don’t need to train one, but you need to understand what’s inside.

Transformers architecture — attention, encoders/decoders, how tokens flow through a model.
How can we use LLMs? — inference vs training, the API mental model.
Small language models — why smaller matters, when to reach for one. Models under 1 billion parameters are especially important for specific, narrow use cases (e.g., Gemma function-calling variants and similar small models). At least be aware of what’s available.
Quantization — at least the theory: shrinking models to fit on smaller hardware.
Types of model weights — full precision, FP16, INT8, INT4, GGUF, AWQ — what they mean and when each matters.
Benchmarks — what different benchmarks evaluate and why they matter:
- MMLU — broad multi-task language understanding across many subjects.
- HumanEval — code generation and functional correctness.
- GSM8K — grade-school math word problems; reasoning ability.
- GPQA — graduate-level science questions; deep reasoning.
- SWE-bench — real-world software engineering tasks; agentic coding ability.
- Chatbot Arena (LMSYS) — human-preference-based head-to-head model ranking.
- Know which benchmark measures what — don’t trust a single number.
Keeping up with the model landscape — always stay aware of the latest open-source and local models. For example, right now Gemma 4, GLM-5.2, Kimi K2.7 and others are in the news. You don’t need to track every model daily, but get in sync with the news and learn how different models differ from each other — that gives you an edge. Use paid platforms where possible for coding access to these models, or use OpenRouter to try many models through a single API.
Multimodal models — models that handle text, images, audio, and video; know how multimodal inference differs from text-only inference.
Inference hardware companies — know the companies doing research at the hardware level to accelerate LLM inference:
- Cerebras — wafer-scale chips (WSE-3) with massive on-chip SRAM; ultra-fast inference, partnered with OpenAI and AWS.
- Groq — LPU (Language Processing Unit); deterministic, compiler-driven execution for sub-100ms latency.
- SambaNova — SN-40 accelerator; high-bandwidth architecture for inference at scale.
- These matter because the hardware your inference runs on directly affects speed, cost, and latency.
Image generation models — stay aware of the latest image generation models if your target company or use case needs them (e.g., Nano Banana 2.0, HiDream-O1-Image, FLUX.2, Qwen-Image, Seedream 4.0). You don’t need to master all of them, but know what’s out there.
OCR models — if your use case involves document parsing or text extraction, know the latest OCR models (e.g., Nanonets OCR-3, PaddleOCR-VL, PP-OCRv6, DeepSeek-OCR). Specialized OCR models often outperform general VLMs at a fraction of the size.
TTS & STT models — text-to-speech and speech-to-text, both cloud APIs and local models. Know the major cloud providers and how to call them, plus what’s available to run locally:
- Cloud TTS/STT — OpenAI TTS, Google, Azure, ElevenLabs; know the major providers and their APIs.
- Local TTS — Chatterbox (open source, MIT licensed, zero-shot voice cloning from 5 seconds of audio, 23+ languages), Chatterbox Turbo (350M, low-latency for voice agents), Qwen3-TTS (multilingual with delivery instructions), Kokoro (82M, fast CPU inference).
- Local STT — Whisper (OpenAI, the standard, 100+ languages), Whisper Large V3 Turbo (809M, faster), Parakeet TDT (NVIDIA, ultra-low-latency streaming), Moonshine (245M, edge/mobile), Qwen3-ASR (52 languages, state-of-the-art open-source accuracy), Voxtral Mini (4B, real-time streaming).

6. Calling LLMs

The first real step: talk to a model.

OpenAI SDK — the openai Python package (or NPM package). The most important one to master.
REST APIs — raw HTTP calls to any provider.
- All HTTP methods — GET, POST, PUT, PATCH, DELETE; when to use which.
- Route naming conventions — consistent, predictable, RESTful paths.
- Swagger / OpenAPI — API documentation and specification; the basics every AI engineer should know.
Hosted inference — simple calls to LLMs hosted by OpenAI, Anthropic, AWS Bedrock, and Google.
OpenAI-compatible endpoints — the de-facto API shape most providers mirror.
Specifications of router endpoints — how OpenRouter and similar gateways route requests.
Local inference engines — running models on your own hardware:
- vLLM — production-grade multi-user serving on NVIDIA GPUs; PagedAttention, continuous batching, OpenAI-compatible API.
- SGLang — optimized for structured generation, agentic loops, and shared-prefix workloads like RAG.
- llama.cpp — C/C++ inference with maximal hardware coverage (CPU, Apple Silicon via Metal, NVIDIA via CUDA, AMD via ROCm, Vulkan); GGUF models.
- Ollama — wraps llama.cpp with one-command model management and a local OpenAI-compatible API server; great for local dev.
- LM Studio — GUI-first desktop app for running local models; good for non-CLI workflows and small-team serving.
- MLX — Apple Silicon-only inference framework (mlx-lm); max throughput on M1+ chips. Good to know, not necessary for production.

7. Prompting

How you ask is what you get.

How prompting works — the request/response loop, system vs user messages.
Different types of prompts — zero-shot, few-shot, chain-of-thought, role-based.
Temperature — randomness vs determinism.
TopK — limiting the candidate token pool.
TopP — nucleus sampling.

8. Response handling

Real apps don’t just print() the answer.

Streaming & non-streaming calls — when to use each.
Response styles for streaming — chunked text, delta formats.
Server-side events (SSE) — the transport behind most streaming LLM responses.

9. Context management

Context windows are finite and expensive. Learn to manage them well.

LLM, agents & other patterns — the common patterns for structuring LLM calls and pipelines.
Context compression methods — reducing the tokens you send without losing meaning.
Sliding window — keeping only the most recent relevant context.
Preserving context — different techniques for maintaining useful context across long conversations or multi-step agents.

10. Structured output & tool use

Make the model return data your code can trust.

Structured output — JSON schemas, Pydantic models as the contract.
Tool calling — letting the model invoke functions you define.
Thinking mode — extended reasoning, when to enable it.
Non-thinking mode — cheaper/faster paths when reasoning isn’t needed.

11. RAG & vector stores

Ground the model in your own data.

RAG system — retrieval-augmented generation end to end.
Embeddings & embedding models — how text becomes vectors, choosing the right embedding model, dimensions, and cost trade-offs:
- Cloud / managed — OpenAI text-embedding-3-small/large, Cohere Embed v4, Voyage AI voyage-3.5, Gemini Embedding 001.
- Open source / self-hosted — BGE-M3 (multilingual workhorse), Qwen3-Embedding-8B (#1 on MTEB multilingual), EmbeddingGemma-300M (on-device/edge).
- Pick based on volume, language coverage, and whether you self-host or use an API.
Re-ranking models — after retrieval, re-rank the candidates for better accuracy. Know the major ones:
- Cloud / managed — Cohere Rerank 4 Pro, Voyage Rerank 2.5, Jina Reranker v3.
- Open source / self-hosted — BGE-Reranker-v2-m3 (default open-source choice), BGE-Reranker-v2-Gemma (max quality, 9B), mxbai-rerank-large-v2, Qwen3 Reranker.
HyDE — Hypothetical Document Embeddings; generating synthetic documents to improve retrieval.
Chunking mechanisms — different ways to split documents for retrieval; when to use which.
Vector databases — what they are and why they matter. Learn whichever you want, but at least be aware of these:
- Open source options — self-hostable vector stores.
- pgvector — Postgres as a vector database.
- Chroma — lightweight embedded vector store.
Graph databases — when relationships matter more than embeddings.
Caching in LLM — avoiding redundant calls, semantic caching.
Retry mechanisms — handling rate limits, timeouts, and transient failures.

12. Model selection & capabilities

Choosing the right model for the job saves money and improves results.

When to use a small language model vs a large language model — cost, latency, task complexity.
When to use BERT & when to train BERT — not commonly asked, but saves cost at scale in production. Worth knowing.
Translation — how well each LLM performs on translation, if your product or company needs it. Depends on the company.
PII redaction — removing personally identifiable information in production using local models or APIs.

13. LLM frameworks

Don’t hand-wire every pipeline.

LangChain — the broadest ecosystem.
LangGraph — stateful, graph-based agent workflows.
Strands Agents — alternative orchestration.
Boto3 — AWS SDK, essential for Bedrock integrations.
Others — learn as per the use case; don’t collect frameworks for the sake of it.

14. Developer platforms

Where you’ll get API keys, test prompts, and watch usage.

Google AI Studio — Gemini playground and keys.
Groq — fast inference for open models.
OpenRouter — one API, many models; the minimum you should try.
AWS Bedrock — managed multi-model access.
OpenAI dashboard / developer dashboard — keys, usage, billing.
Anthropic developer dashboard — Claude keys and console.

15. Agents

When a single call isn’t enough.

Agent with tools — giving a model the ability to act, not just talk.
Multi-agent systems — multiple agents working together, each with a role; at least know what they are and when they’re useful.
Sub-agents — spawning focused sub-agents for specific tasks; important for context management — each sub-agent gets its own clean context window instead of bloating the main agent’s.
When to use an agent — complexity, multi-step tasks, dynamic decisions.
When to use normal LLM calling chains — simpler, cheaper, more predictable; reach for this first.
Cost reduction — model routing, caching, prompt compression.
Token management — counting, budgeting, context window discipline.
Token saving frameworks — removing excessive tokens using different frameworks.
Alerting system for token usage — know before you overspend.
Agent permission scoping — define at what level an agent can perform its functions; restrict it using de-facto rules at initialization. Don’t give an agent more access than it needs.
Multi-level guardrails — different layers of guardrails at different levels (input, output, tool execution, system level); not just one filter, but defense in depth.
Rate limiting — proactively manage API call rates; prevent abuse and avoid hitting provider limits.

16. Prompt management

Prompts are code. Treat them like it.

Prompt templates & versioning — manage prompts as code, not hardcoded strings.
LangSmith — version, test, and track prompt changes (also covers observability, see section 18).
Git versioning for prompts — store prompts in your repo, review changes in pull requests.
Using databases for prompts — store and retrieve prompts from a database for dynamic updates without redeploying.

17. Testing & CI/CD

Ship with confidence.

pytest — the standard Python testing framework; write tests for your LLM pipelines, API endpoints, and utility functions.
LLM-as-a-judge — using an LLM to evaluate the output of another LLM; a practical pattern for evaluating generation quality at scale.
CI/CD pipelines — automate testing and deployment:
- GitHub Actions — the most common CI/CD; know what it does and how to write a basic workflow.
- Bitbucket Pipelines — optional; same concept if your company uses Bitbucket instead of GitHub.
- Run tests, lint, and type-check on every push before merging.

18. Security, governance & observability

AI in production introduces new attack surfaces and new things to monitor.

Security

Prompt injection — malicious inputs that hijack model behavior.
OWASP Top 10 for LLMs — the standard vulnerability list for LLM applications.
Cybersecurity in generative & agentic AI — how attack surfaces change when models can call tools and take actions.

Governance

Governance issues — accountability, compliance, audit trails, model usage policies.
GDPR & European Union AI guidelines — good to know, especially if your product serves EU users.

Evaluation

RAGAS — RAG Assessment; evaluating retrieval quality and generation faithfulness.

Observability

LangSmith — tracing and debugging for LangChain/LangGraph pipelines; also used for prompt versioning (see section 16).
Langfuse — open-source LLM observability and analytics.
OpenTelemetry — industry-standard instrumentation for distributed tracing across your whole stack.

19. Fine-tuning & deployment

Take it to production.

When to fine-tune vs prompt engineering — always try prompt engineering, RAG, and structured output first. Fine-tune only when you’ve exhausted those and have a clear, repeatable task that needs it.
LoRA — Low-Rank Adaptation; fine-tune a small set of adapter weights instead of the full model. Efficient, cheap, and the standard approach these days.
QLoRA — Quantized LoRA; fine-tune in 4-bit precision, making it possible to fine-tune large models on consumer GPUs. Knowing this is a bonus.
Unsloth — fast, memory-efficient fine-tuning of open models (uses LoRA/QLoRA under the hood). Just knowing it exists is enough for now.
Unsloth Studio — a no-code fine-tuning interface on top of Unsloth. Good to know it exists; leave for the future.
Docker deployment — containerize your LLM app for reproducible shipping.
AWS deployment — EC2 instances and ECS for hosting models and services on the server.
Lambda functions — when to use them and when not to:
- Good for lightweight, event-driven tasks.
- Avoid for long-running agents, persistent LLM calls, or custom RAG systems that need state or long connections.
ASGI / WSGI servers — uvicorn for ASGI apps (FastAPI), gunicorn for WSGI apps (Flask/Django); know how to use them.
Nginx — reverse proxy in front of your app server. Streaming does not work by default — you must disable buffering:
- proxy_buffering off — tokens flow directly to the client, not held in a buffer.
- proxy_cache off — each streaming response is unique, don’t cache it.
- chunked_transfer_encoding on — required for Server-Sent Events (SSE).
- proxy_read_timeout 600s — long timeouts for streaming generation (5–10 minutes common).
- proxy_http_version 1.1 — keep-alive for streaming connections.
Free & low-cost hosting — for testing when your PC isn’t good enough:
- Vercel — free hosting for frontend and serverless functions.
- Render — free tier for small web services and APIs.
- Google Colab — free GPU access for testing models and code when your PC can’t handle it.
- GitHub Codespaces — cloud dev environment when your local machine isn’t powerful enough.
- Hugging Face — host and serve models via Spaces and inference endpoints.
Indian GPU cloud platforms — if you’re in India, know the local options for GPU hosting and model deployment. They often offer cheaper INR billing and data residency. Explore them as alternatives to US-based providers.
Linux environment — Ubuntu is the most common server OS; know your way around the terminal.
SSH — connecting to remote servers, key-based auth, secure access to your production machines.
Cron jobs — scheduling recurring tasks on Linux (e.g., periodic data syncs, cleanup jobs, scheduled model runs). Very common in production.

20. MCPs

Model Context Protocol — the standard for connecting models to external tools and data sources. Very common these days.

What are MCPs? — how and where to use them; connecting LLMs to files, databases, APIs, and services.
Build at least one MCP — if not a big project, at least a small one. Hands-on experience matters here.
Authentication for MCPs — how to add auth to your MCP server. Not needed by default, but good to know.

21. Staying updated

AI moves fast. Make awareness a daily habit.

News & content

Newsletters — subscribe to a few good AI newsletters.
Major AI speakers — be aware of the leading voices in the industry.
YouTube channels — watch tutorials, talks, and breakdowns.
Blogs & podcasts — wherever and whenever you can.

Conferences & announcements

Major conferences — watch big announcements from Nvidia, Google, Anthropic, OpenAI, Microsoft, and others.

Security awareness

Vulnerabilities — stay aware of new vulnerabilities as they emerge.
Model behavior differences — how different models behave based on different prompts.
Prompt injection — be aware of famous prompt injection techniques and researchers on X (Twitter); awareness is enough, you don’t need to follow everyone.

Agentic coding tools

Agentic coding IDEs — know what’s out there and what’s changing: Cursor, Windsurf, Devin, Claude Code, AntiGravity, Codex, OpenCode.
Coding platform news — keep up with what’s going on with these platforms; the landscape shifts quickly.
Skills in code editors & CLIs — be aware of “skills” in VS Code and other agentic coding editors, CLIs, and agents; how they extend what an agent can do.