0.5B, 4x3090. if you have 4 GPUs, you should set --num_processes=3. One GPU deploy vLLM as online inference engine, for faster GRPO sampling example: 4x4090, 3epochs, training time, ~1h20min ...
X-R1 aims to build an easy-to-use, low-cost training framework based on end-to-end reinforcement learning to accelerate the development of Scaling Post-Training Inspired by DeepSeek-R1 and open-r1, we ...