Model Evaluation Guide
Overview
This directory contains comprehensive evaluation scripts for your FSDP QDoRA fine-tuned Llama 3 70B model trained on the Uganda Clinical Guidelines dataset.
Files
evaluate_model.py
- Main comprehensive evaluation scriptrun_evaluation.py
- Quick evaluation and interactive testing scriptrequirements_eval.txt
- Required dependenciesREADME_evaluation.md
- This file
Quick Start
1. Install Dependencies
pip install -r requirements_eval.txt
2. Run Quick Evaluation
Test your model with predefined medical scenarios:
python run_evaluation.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA
3. Interactive Testing
Test your model interactively:
python run_evaluation.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA --interactive
4. Comprehensive Evaluation
Run full evaluation with metrics:
python evaluate_model.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA
Evaluation Features
Comprehensive Metrics
The evaluation script calculates:
- Content Similarity: ROUGE-L and BLEU scores
- Medical Relevance: Medical terminology density and Uganda-specific terms
- Response Quality: Structure, specificity, and advice quality
- Length Analysis: Response length vs reference comparison
Medical-Specific Assessment
The evaluator includes:
- Medical terminology detection
- Uganda-specific medical condition recognition
- Clinical advice quality assessment
- Response structure analysis
Output Files
Results are saved to evaluation_results/
:
detailed_results.json
- Complete evaluation dataaggregate_metrics.json
- Summary statisticsevaluation_results.csv
- Spreadsheet format for analysis
Usage Examples
Basic Evaluation
# Evaluate with default settings
python evaluate_model.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA
Custom Settings
# Custom evaluation parameters
python evaluate_model.py \
--model_path models/Llama-3-70b-ucg-bnb-QDoRA \
--base_model meta-llama/Meta-Llama-3-70B \
--dataset silvaKenpachi/uganda-clinical-guidelines \
--output_dir my_evaluation_results \
--max_tokens 1024 \
--temperature 0.5 \
--test_split 0.3
Interactive Mode
# Test specific questions interactively
python run_evaluation.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA --interactive
Understanding Results
Key Metrics
- ROUGE-L Score (0-1): Measures longest common subsequence overlap with reference
- BLEU-1 Score (0-1): Measures unigram precision vs reference
- Medical Term Density: Ratio of medical terms to total words
- Response Structure: Percentage of responses with organized structure
Good Performance Indicators
- ROUGE-L > 0.3: Good content overlap
- Medical Term Density > 0.1: Medically relevant responses
- 80%+ responses with medical terms: Consistent medical focus
- Structured responses: Clear, organized answers
Troubleshooting
Memory Issues
If you encounter CUDA out of memory errors:
# Disable quantization (requires more memory but may be more stable)
python evaluate_model.py --model_path models/Llama-3-70b-ucg-bnb-QDoRA --no_quantization
Model Loading Issues
If the adapter fails to load:
- Check that the model path contains PEFT adapter files
- Verify the base model name matches training
- Ensure all dependencies are installed
Dataset Loading Issues
If the dataset fails to load:
- Check internet connection for remote datasets
- Verify dataset name/path is correct
- Try using a local dataset file
Customization
Adding Custom Test Cases
Edit run_evaluation.py
to add your own test cases:
= [
test_cases
{"instruction": "Your custom medical question?",
"input": "Additional context if needed"
},# Add more cases...
]
Custom Metrics
Extend the MedicalEvaluator
class in evaluate_model.py
to add:
- Domain-specific terminology detection
- Clinical reasoning assessment
- Safety evaluation metrics
Performance Notes
- Evaluation on full dataset may take 30-60 minutes depending on hardware
- Use smaller test splits for faster iterations during development
- Interactive mode provides immediate feedback for qualitative assessment
Support
For issues or questions about the evaluation scripts, check:
- Model path and files exist
- All dependencies are installed
- CUDA is available if using GPU
- Sufficient GPU memory for model + quantization