Guide
LLM Development Guide: Fine-tuning to Deployment
Practical guide to working with large language models for Indian developers.
5 Dec 2025•25 min read
Working with LLMs
A comprehensive guide to fine-tuning and deploying LLMs for production use.
Part 1: Understanding LLMs
Key Concepts
- Transformer architecture basics
- Attention mechanisms
- Tokenization for Indian languages
- Context windows and limitations
Popular Models
- LLama 2/3 (Meta)
- Mistral models
- Gemma (Google)
- Indic models (AI4Bharat, Sarvam)
Part 2: Fine-tuning
When to Fine-tune
- Domain-specific knowledge needed
- Specific output format required
- Cost optimization at scale
Techniques
- Full Fine-tuning: Update all weights (expensive)
- LoRA: Low-rank adaptation (recommended)
- QLoRA: Quantized LoRA (memory efficient)
- Prefix Tuning: Add learned prefixes
Code Example (LoRA with Hugging Face)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("base-model")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(model, lora_config)
Part 3: Deployment
Inference Options
- vLLM: Fast inference with PagedAttention
- TGI: Hugging Face Text Generation Inference
- llama.cpp: CPU inference
- Ollama: Local deployment
Optimization
- Quantization (INT8, INT4)
- KV cache optimization
- Batching strategies
- Speculative decoding
Part 4: Indian Language Considerations
- Use Indic tokenizers (IndicBERT tokenizer)
- Consider romanized text handling
- Test on code-mixed data
- Evaluate with native speakers
T
TheIndian.AI Team
Editorial
Curated resources and guides to help you navigate your AI career in India.