RL for LLM 高质量文章汇总
Anthropic skils解读与实践
LLM强化学习算法演进之路:MC->TD->Q-Learning->DQN->PG->AC->TRPO->PPO->DPO->GRPO
pytorch学习
WebDancer:Towards Autonomous Information Seeking Agency
TongSearch-QR:Reinforced Query Reasoning for Retrieval