LLM MODELS, PROVIDERS AND TRAINING

Fine-Tuning LLMs with LoRA: 2025 Guide

LoRA fine-tuning for efficient large language model training and optimization

Low-Rank Adaptation (LoRA) has revolutionized how we fine-tune large language models in 2025. This technique allows developers to adapt models like Llama 3.1 70B on consumer hardware while maintaining performance.

Why LoRA Changed Everything

Traditional fine-tuning updates all model parameters, requiring massive GPU memory and compute. LoRA freezes the base model and injects trainable low-rank matrices into each layer.

Resource comparison for Llama 3.1 70B:

  • Full fine-tuning: 8x A100 80GB GPUs, 48 hours
  • LoRA: 1x A100 40GB GPU, 4 hours
  • Cost: $15,000 → $500

How LoRA Works

Instead of updating weight matrix W, LoRA adds a low-rank decomposition:

W_new = W + BA

Where:
- W: frozen pretrained weights (d × k)
- B: trainable matrix (d × r)
- A: trainable matrix (r × k)
- r: rank (typically 8-64, much smaller than d or k)

This reduces trainable parameters by 99%+ while achieving 95-98% of full fine-tuning performance.

Practical Implementation

from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,  # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Wrap model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 87M || all params: 70B || trainable%: 0.124%

Choosing Hyperparameters

Parameter Value Use Case
r (rank) 4-8 Simple tasks, limited data
r (rank) 16-32 Complex tasks, sufficient data
r (rank) 64+ Maximum quality, lots of data
lora_alpha 2r Standard scaling
lora_dropout 0.05-0.1 Prevent overfitting

Target Modules Selection

Which layers to apply LoRA matters significantly:

  • Minimal (q_proj, v_proj): Fastest, works for simple tasks
  • Standard (q_proj, k_proj, v_proj, o_proj): Best balance
  • Comprehensive (all linear layers): Maximum adaptation

For domain adaptation, target all attention layers. For task-specific tuning, attention query/value is often sufficient.

Dataset Preparation

Quality over quantity. 1,000 high-quality examples beat 100,000 mediocre ones.

// Training format for instruction tuning
{
  "instruction": "Explain quantum entanglement in simple terms",
  "input": "",
  "output": "Quantum entanglement is when two particles..."
}

Use tools like Lilac or Argilla to identify and remove low-quality samples.

Training Configuration

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./llama-lora-medical",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # effective batch size: 16
    learning_rate=2e-4,  # higher than full fine-tuning
    num_train_epochs=3,
    fp16=True,  # mixed precision
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.05
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

QLoRA: Even More Efficient

QLoRA combines LoRA with 4-bit quantization, enabling 70B model fine-tuning on a single 24GB GPU.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70b",
    quantization_config=bnb_config,
    device_map="auto"
)

Evaluation Best Practices

Always evaluate on held-out test sets with multiple metrics:

  • Perplexity: General language modeling quality
  • Task-specific: Accuracy, F1, ROUGE for your use case
  • Human eval: Sample 100 outputs for quality review

Merging LoRA Weights

After training, merge LoRA weights into base model for faster inference:

from peft import PeftModel

# Load base and LoRA
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70b")
lora_model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")

# Merge and save
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("./llama-3.1-70b-medical")

Production Deployment

Serve LoRA-adapted models with vLLM or TGI:

vllm serve ./llama-3.1-70b-medical 
  --tensor-parallel-size 4 
  --max-model-len 8192 
  --gpu-memory-utilization 0.9

Common Pitfalls

  • Too low rank: Model can’t adapt sufficiently
  • Too high learning rate: Catastrophic forgetting of base knowledge
  • Insufficient data diversity: Overfitting to training distribution
  • Wrong target modules: Not adapting the right layers for your task

Advanced: Multi-LoRA Serving

Serve multiple LoRA adapters on one base model with dynamic switching:

# S-LoRA enables efficient multi-adapter serving
# Load base model once, swap LoRA adapters per request
server.register_adapter("medical", "./lora-medical")
server.register_adapter("legal", "./lora-legal")
server.register_adapter("code", "./lora-code")

# Request specifies which adapter
response = server.generate(prompt, adapter="medical")

This architecture supports hundreds of specialized models on shared infrastructure.

Cost Analysis

Training a Llama 3.1 70B LoRA adapter:

  • AWS p4d.24xlarge (8x A100): $32/hour × 4 hours = $128
  • Lambda Labs A100: $1.29/hour × 4 hours = $5.16
  • Local RTX 4090 with QLoRA: electricity only (~$2)

LoRA makes fine-tuning accessible to teams of any size.