On-Policy
#Analysis#Pocket#NLP#LanguageModel#ReinforcementLearning#TransferLearning#DPO#GRPO#VerifiableRewards#Off-Policy#Non-VerifiableRewards
Issue Date: 2025-06-30 Bridging Offline and Online Reinforcement Learning for LLMs, Jack Lanchantin+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1939673136842313960?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Analysis#Pocket#NLP#LanguageModel#Alignment#ReinforcementLearning#PPO (ProximalPolicyOptimization)#ICML#DPO
Issue Date: 2025-06-25 Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data, Fahim Tajwar+, ICML24
Issue Date: 2025-06-30 Bridging Offline and Online Reinforcement Learning for LLMs, Jack Lanchantin+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1939673136842313960?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Analysis#Pocket#NLP#LanguageModel#Alignment#ReinforcementLearning#PPO (ProximalPolicyOptimization)#ICML#DPO
Issue Date: 2025-06-25 Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data, Fahim Tajwar+, ICML24