GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning
Reinforcement Learning Highlights
Unlike traditional supervised fine-tuning (used in ConflLlama), this model uses GRPO to:
- Optimize multiple reward signals simultaneously
- Enforce structured reasoning format through reinforcement signals
- Improve output consistency with formatted XML responses
- Self-improve through reinforcement rather than direct imitation
Training Data
- Dataset: GLOCON event classification dataset
- Time Period: Contemporary civil conflict events
- Format: News articles with associated event categories
- Labels: Five main event categories:
- Demonstration
- Armed Militancy
- Group Clash
- Industrial Action
- Other
Data Processing
- Train/Test Split:
- 80% training, 20% testing
- Consistent random seed (42) for reproducibility
- Format Standardization:
- System prompt with structured reasoning requirements
- Consistent XML output format
- Answer Extraction:
- Specialized extraction from structured responses
- Validation against known categories
Training Format
- Input: News article describing potential conflict event
- Output: Structured XML with reasoning and final category
- System prompt template:
Respond in the following format:
<reasoning>
1. Triggers detected: [List any event triggers]
2. Participants and organizers: [List any actors involved]
3. Location details: [Specify the location]
4. Violence assessment: [Indicate if violent or non-violent]
5. Event category determination: [State and justify the category]
</reasoning>
<answer>
[Final category]
</answer>
Key Mathematical Concepts
Policy Gradient with Multiple Rewards
The GRPO approach optimizes policy parameters using:
Reward Functions
Our implementation uses five specialized reward functions:
- Correctness Reward: 2.0 points for accurate classification
- Category Format Reward: 0.5 points for valid category selection
- Format Rewards: Combined 1.0 points for proper XML structure
- XML Microrewards: Small incentives for tag placement and structure
Training Details
- Framework: Unsloth GRPO
- Hardware: Single NVIDIA GPU with vLLM acceleration
- Training Configuration:
- Batch Size: 1 per device
- Gradient Accumulation Steps: 4
- Learning Rate: 5e-6
- Max Steps: 1,000
- Save Steps: 500
- Logging Steps: 1
- Samples per prompt: 6
- Memory utilization: 60%
LoRA Configuration
- Rank: 64 (significantly larger than ConflLlama's rank 8)
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Alpha Scaling: 64
- Quantization: 4-bit training
- Gradient Checkpointing: Enabled ("unsloth" mode)
Generation Parameters
- Temperature: 0.8
- Top-p: 0.95
- Max tokens: 256
- Max prompt length: 512
Model Architecture
The training architecture combines reinforcement learning with efficient LLM fine-tuning.
Reinforcement Learning Benefits
This model demonstrates key advantages over supervised fine-tuning:
Structured Output Enforcement
- Consistent XML formatting:
<reasoning> 1. Triggers detected: [...] 2. Participants and organizers: [...] 3. Location details: [...] 4. Violence assessment: [...] 5. Event category determination: [...] </reasoning> <answer> [Final category] </answer>
Improved Reasoning Capability
- Explicit step-by-step reasoning before final classification
- Consideration of multiple factors (violence, participants, location)
- Transparent justification process
Reward-Based Improvement
- Self-correcting behavior through multiple reward signals
- Balance between format adherence and classification accuracy
- Incentivizes proper structure without sacrificing correctness
Implementation Details
The reward functions are implemented with efficient vectorized operations:
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [2.0 if r.strip() == a.strip() else 0.0
for r, a in zip(extracted_responses, answer)]
Memory Optimizations
- Used 4-bit quantization
- Gradient accumulation steps: 4
- Memory-efficient gradient checkpointing
- Reduced maximum sequence length to 1024
- GPU memory utilization capped at 60%
- Fast inference with vLLM
Intended Use
This model is designed for:
- Classification of civil conflict events with reasoning
- Academic research requiring transparent decision processes
- Event analysis with structured outputs
- Educational demonstration of RL-based classification
Limitations
- Fixed output structure may limit flexibility
- Performance dependent on quality of reward functions
- Maximum sequence length limited to 1024 tokens
- Reinforcement may overoptimize for reward signals rather than true understanding
- Limited to five predefined event categories
- May not generalize well to conflict events outside training distribution
Ethical Considerations
- Model trained on conflict event data
- Should be used responsibly for research purposes only
- Not intended for operational security decisions
- Results should be interpreted with appropriate context
- May contain biases present in training data
Citation
@misc{glocon-reasoning,
author = {Meher, Shreyas},
title = {GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning},
year = {2024},
publisher = {HuggingFace},
note = {Based on Qwen2.5-3B-Instruct and GRPO framework}
}
Acknowledgments
- Unsloth for GRPO implementation and optimization framework
- Qwen team for the base model
- Hugging Face for transformers infrastructure
- vLLM team for fast inference capabilities
- This research was supported by NSF award 2311142

- Downloads last month
- 30
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.