Fine-tuning Qwen2.5-3B-Instruct model with LoRa (Low-Rank Adaptation) and Group Relative Policy Optimization (GRPO)