How We Trained FOMOA: 86,000 Samples for India-Centric AI
Inside FOMOA's training methodology - Qwen2.5-7B base, QLoRA fine-tuning, 65% Hindi data, 113 hours of training. Technical deep-dive into building India-first AI.
Why Custom Training Matters
Most "Indian" AI assistants follow a simple pattern: take a Western-trained model, add a translation layer, and call it localized. This approach fails because:
- Translation loses nuance and context
- Cultural references get mangled
- Hindi idioms translate literally (and incorrectly)
- Indian number systems confuse the base model
Our model doesn't translate - it thinks in Hindi and English simultaneously, understanding both languages at a foundational level.
The Architecture Decision
Base Model: Qwen2.5-7B-Instruct
After evaluating 15+ open-source models, we chose Qwen2.5-7B-Instruct:
Model Selection Criteria
========================
Evaluated models:
├── Llama 3.1 8B - Good English, weak multilingual
├── Mistral 7B - Fast, limited Hindi
├── Gemma 7B - Google quality, license restrictions
├── Falcon 7B - Open, but training instability
└── Qwen2.5-7B-Instruct ✓ - Best multilingual, Apache license
Qwen2.5-7B Advantages:
├── Native multilingual architecture
├── Strong Hindi baseline (pre-training on Indian content)
├── Efficient attention mechanism
├── Apache 2.0 license (commercial use allowed)
├── 7B parameters = runnable on single GPU
└── Instruction-tuned variant available
Why Not Bigger Models?
Model Size vs. Practical Deployment
===================================
70B models:
├── Require 4x A100 80GB GPUs
├── $15-20/hour inference cost
├── 200ms+ latency
└── Not practical for production
13B models:
├── Require 2x A100 40GB
├── $8-10/hour inference cost
├── 150ms latency
└── Marginal quality improvement
7B models (FOMOA choice):
├── Single L4 GPU sufficient
├── $1-2/hour inference cost
├── 50-80ms latency
└── Sweet spot for production
Training Data Composition
Total training samples: 86,760
FOMOA Training Data Breakdown
=============================
┌────────────────────────────────────────────────────┐
│ │
│ Hindi/Hinglish (65%) │
│ ████████████████████████████████░░░░░░░░░ │
│ 56,760 samples │
│ │
│ Analytical Reasoning (23%) │
│ ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ 20,000 samples │
│ │
│ Diverse Knowledge (12%) │
│ ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ 10,000 samples │
│ │
└────────────────────────────────────────────────────┘
Hindi/Hinglish Component (56,760 samples)
Hindi Training Data Sources
===========================
1. Hindi Alpaca Dataset (51,760 samples)
├── Source: Stanford Alpaca translated + verified
├── Format: Instruction → Response pairs
├── Quality: Native Hindi speakers validated
├── Topics: General knowledge, tasks, explanations
└── Example:
Instruction: "भारत की राजधानी क्या है और
इसका इतिहास बताइए"
Response: "भारत की राजधानी नई दिल्ली है..."
2. Hindi Wikipedia QA (5,000 samples)
├── Source: Hindi Wikipedia articles
├── Format: Question-Answer pairs
├── Topics: Indian history, geography, science
├── Verification: Cross-referenced with sources
└── Example:
Question: "महाभारत के रचयिता कौन थे?"
Answer: "महाभारत की रचना महर्षि वेदव्यास ने की थी..."
Analytical Reasoning (20,000 samples)
Reasoning Training Data
=======================
1. Open-Orca (10,000 samples)
├── Complex reasoning chains
├── Step-by-step problem solving
├── Mathematical reasoning
└── Logical deduction
2. SlimOrca (10,000 samples)
├── Refined, high-quality subset
├── Reduced noise and errors
├── Focus on clear reasoning
└── Diverse problem types
Diverse Knowledge (10,000 samples)
General Knowledge Data
======================
UltraChat Dataset:
├── Natural conversation flows
├── Multi-turn dialogues
├── Real-world scenarios
├── Diverse topic coverage
└── Conversational AI patterns
Training Methodology: QLoRA
We used QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning.
Why QLoRA?
Training Method Comparison
==========================
Full Fine-Tuning:
├── Updates all 7B parameters
├── Requires 8x A100 80GB GPUs
├── 500GB+ memory footprint
├── Risk of catastrophic forgetting
└── Cost: ~$5,000 for one training run
LoRA (Low-Rank Adaptation):
├── Updates only adapter weights
├── Requires 4x A100 40GB GPUs
├── Much smaller memory footprint
├── Preserves base model knowledge
└── Cost: ~$1,500 for one training run
QLoRA (Quantized LoRA): ✓ Our choice
├── 4-bit quantized base model
├── Single L4 24GB GPU sufficient
├── 20GB memory footprint
├── Best cost-efficiency ratio
└── Cost: ~$400 for one training run
QLoRA Technical Configuration
# FOMOA QLoRA Configuration
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Nested quantization
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank - balance between capacity and efficiency
lora_alpha=32, # Scaling factor
lora_dropout=0.05, # Regularization
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj"
]
)
# Total trainable parameters
# Base model: 7B parameters (frozen in 4-bit)
# LoRA adapters: ~20M parameters (trainable)
# Effective training: 0.3% of total parameters
Training Hyperparameters
# Training configuration that worked
training_args = TrainingArguments(
output_dir="./fomoa-checkpoints",
# Batch settings
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
# Effective batch size: 4 × 8 = 32
# Learning rate
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
# Training duration
num_train_epochs=3,
max_steps=-1, # Use epochs, not steps
# Precision
bf16=True, # bfloat16 for stability
tf32=True,
# Optimization
optim="paged_adamw_8bit",
weight_decay=0.01,
max_grad_norm=0.3,
# Logging
logging_steps=10,
save_strategy="steps",
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
# Memory optimization
gradient_checkpointing=True,
group_by_length=True,
# Reproducibility
seed=42
)
Training Infrastructure
Hardware Setup
==============
GPU: NVIDIA L4 (24GB VRAM)
├── Sufficient for 7B QLoRA
├── Cost-effective ($0.50-0.80/hour)
└── Available on GCP, AWS
CPU: 8 vCPUs
RAM: 32 GB
Storage: 200 GB SSD
Training Duration: ~113 hours
├── 3 epochs over 86,760 samples
├── ~38 hours per epoch
└── Total cost: ~$80-90
Training Progress
FOMOA Training Timeline
=======================
Hour 0-10: Warmup phase, loss stabilizing
Hour 10-38: Epoch 1, Hindi patterns emerging
Hour 38-76: Epoch 2, reasoning improving
Hour 76-113: Epoch 3, final refinement
Loss Curve:
├── Start: 2.45
├── Epoch 1 end: 1.82
├── Epoch 2 end: 1.54
└── Final: 1.38
Validation Metrics:
├── Hindi accuracy: 89%
├── English accuracy: 91%
├── Reasoning benchmark: 78%
└── Overall improvement: +34% over base
Data Processing Pipeline
# Data preparation for FOMOA training
import json
from datasets import Dataset
def prepare_training_data():
"""
Combine and format all training datasets
"""
all_samples = []
# 1. Load Hindi Alpaca
with open("hindi_alpaca_cleaned.json") as f:
hindi_alpaca = json.load(f)
for item in hindi_alpaca:
all_samples.append({
"instruction": item["instruction"],
"input": item.get("input", ""),
"output": item["output"],
"language": "hindi"
})
# 2. Load Hindi Wikipedia QA
with open("hindi_wiki_qa.json") as f:
wiki_qa = json.load(f)
for item in wiki_qa:
all_samples.append({
"instruction": item["question"],
"input": "",
"output": item["answer"],
"language": "hindi"
})
# 3. Load reasoning datasets
# ... similar processing for Orca datasets
# 4. Format for training
formatted = []
for sample in all_samples:
formatted.append({
"text": format_prompt(sample)
})
return Dataset.from_list(formatted)
def format_prompt(sample: dict) -> str:
"""
Format sample into training prompt
"""
if sample["input"]:
return f"""<|im_start|>system
You are FOMOA, an India-first AI assistant.<|im_end|>
<|im_start|>user
{sample["instruction"]}
{sample["input"]}<|im_end|>
<|im_start|>assistant
{sample["output"]}<|im_end|>"""
else:
return f"""<|im_start|>system
You are FOMOA, an India-first AI assistant.<|im_end|>
<|im_start|>user
{sample["instruction"]}<|im_end|>
<|im_start|>assistant
{sample["output"]}<|im_end|>"""
Quality Assurance
Pre-Training Data Validation
def validate_sample(sample: dict) -> bool:
"""
Quality checks before including in training
"""
checks = {
"min_length": len(sample["output"]) > 50,
"max_length": len(sample["output"]) < 5000,
"no_repetition": check_repetition(sample["output"]) < 0.3,
"language_match": detect_language(sample["output"]) == sample["language"],
"no_harmful": not contains_harmful_content(sample["output"]),
"grammatical": grammar_score(sample["output"]) > 0.7
}
return all(checks.values())
# Rejection rate: ~15% of raw data
# Final dataset: 86,760 validated samples
Post-Training Evaluation
FOMOA Evaluation Benchmarks
===========================
1. Hindi Factual Accuracy
├── Test set: 500 questions
├── Sources: Wikipedia, textbooks, gov sites
└── Score: 89% (vs 45% competitor baseline)
2. English Reasoning (MMLU subset)
├── Test set: 1000 questions
├── Categories: Science, math, logic
└── Score: 78% (vs 72% base model)
3. Code Generation
├── Test set: 200 coding tasks
├── Languages: Python, JavaScript
└── Score: 75% correct (maintained from base)
4. Hinglish Understanding
├── Test set: 300 mixed-language queries
├── Native speaker evaluation
└── Score: 94% appropriate responses
5. Indian Context
├── Test set: 250 India-specific queries
├── Topics: Schemes, geography, culture
└── Score: 92% (vs 55% GPT-4)
Why This Approach Works
Native vs. Translated
Approach Comparison
===================
Translation-Based ("Indian" ChatGPT wrapper):
├── Base model has no Hindi context
├── Translate input → Process → Translate output
├── Latency: +200ms for translation
├── Context loss in translation
├── Idioms translated literally
├── Numbers confused (lakh vs 100k)
└── Cultural references lost
Native Training (FOMOA):
├── Hindi in core training data
├── Direct understanding, no translation
├── Latency: Same as English queries
├── Context preserved
├── Idioms understood natively
├── Indian number system native
└── Cultural context embedded
Real Example: Idiom Understanding
Query: "वो तो अंधों में काना राजा है"
(He's like a one-eyed king among the blind)
Translation-Based Response:
"He is the one-eyed king among the blind people."
→ Literal translation, misses the idiom meaning
FOMOA Response:
"यह मुहावरे का अर्थ है कि वह अन्य अयोग्य लोगों में
सबसे कम अयोग्य है, इसलिए नेता बन गया।"
→ Correctly explains the idiom's meaning
Lessons Learned
What Worked
Successful Decisions
====================
1. 65% Hindi training data
└── Critical mass for native understanding
2. QLoRA over full fine-tuning
└── 10x cost reduction, same quality
3. Qwen2.5 as base
└── Best multilingual foundation
4. 3 epochs (not more)
└── Diminishing returns after epoch 3
5. Diverse reasoning data
└── Maintained English quality
What We'd Do Differently
Future Improvements
===================
1. More regional languages
└── Tamil, Telugu, Bengali next
2. Domain-specific fine-tuning
└── Legal, medical, financial variants
3. RLHF phase
└── Human feedback for preference alignment
4. Larger validation set
└── More robust evaluation
Open Source Considerations
While FOMOA's API is free, we're evaluating open-sourcing the training recipes:
Open Source Plan
================
Phase 1 (Current):
├── Free API access
├── Documentation published
└── Integration guides
Phase 2 (Planned):
├── Training scripts
├── Data preprocessing code
├── Evaluation benchmarks
Phase 3 (Considering):
├── Model weights (with license)
├── Fine-tuning guides
└── Community contributions
---
Building AI that truly understands India requires more than translation - it requires native training on Indian content, in Indian languages, with Indian context.
That's what FOMOA delivers.
Try it free at fomoa.cloud.
Interested in the technical details or collaboration opportunities? Connect on LinkedIn.