GRPO
#Pocket#NLP#LanguageModel#ReinforcementLearning#LLM-as-a-Judge#PostTraining#VerifiableRewards
Issue Date: 2025-05-16 J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning, Chenxi Whitehouse+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1923186392420450545?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QLLM-as-a-Judgeのなめのモデルを学習するレシピにおいて、初めてRLを適用した研究と主張し、より高品質なreasoni ... #EfficiencyImprovement#Pocket#NLP#ReinforcementLearning#Reasoning#PEFT(Adaptor/LoRA)
Issue Date: 2025-05-07 Tina: Tiny Reasoning Models via LoRA, Shangshang Wang+, arXiv25 Comment元ポスト:https://x.com/rasbt/status/1920107023980462575?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q(おそらく)Reasoningモデルに対して、LoRAとRLを組み合わせて、reasoning能力を向上させた初めての研究 ... #Survey#Pocket#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Chain-of-Thought#InstructionTuning#PPO (ProximalPolicyOptimization)#Reasoning#LongSequence#RewardHacking#Contamination#VerifiableRewards#CurriculumLearning
Issue Date: 2025-05-06 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models, Chong Zhang+, arXiv25 Comment元ポスト:https://x.com/_philschmid/status/1918898257406709983?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qサーベイのtakeawayが箇条書きされている。 ...
Issue Date: 2025-05-16 J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning, Chenxi Whitehouse+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1923186392420450545?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QLLM-as-a-Judgeのなめのモデルを学習するレシピにおいて、初めてRLを適用した研究と主張し、より高品質なreasoni ... #EfficiencyImprovement#Pocket#NLP#ReinforcementLearning#Reasoning#PEFT(Adaptor/LoRA)
Issue Date: 2025-05-07 Tina: Tiny Reasoning Models via LoRA, Shangshang Wang+, arXiv25 Comment元ポスト:https://x.com/rasbt/status/1920107023980462575?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q(おそらく)Reasoningモデルに対して、LoRAとRLを組み合わせて、reasoning能力を向上させた初めての研究 ... #Survey#Pocket#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Chain-of-Thought#InstructionTuning#PPO (ProximalPolicyOptimization)#Reasoning#LongSequence#RewardHacking#Contamination#VerifiableRewards#CurriculumLearning
Issue Date: 2025-05-06 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models, Chong Zhang+, arXiv25 Comment元ポスト:https://x.com/_philschmid/status/1918898257406709983?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qサーベイのtakeawayが箇条書きされている。 ...
#Pocket#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#DiffusionModel#Reasoning#PostTraining
Issue Date: 2025-04-18 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning, Siyan Zhao+, arXiv25 Comment元ポスト:https://x.com/iscienceluvr/status/1912785180504535121?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QdLLMに対してGRPOを適用する手法(diffuGRPO)を提案している。long CoTデータでSFTしてreasoni ... #MachineLearning#Pocket#LanguageModel#ReinforcementLearning#Reasoning#LongSequence#read-later
Issue Date: 2025-03-20 DAPO: An Open-Source LLM Reinforcement Learning System at Scale, Qiying Yu+, arXiv25 Comment既存のreasoning modelのテクニカルレポートにおいて、スケーラブルなRLの学習で鍵となるレシピは隠されていると主張し、実際彼らのbaselineとしてGRPOを走らせたところ、DeepSeekから報告されているAIME2024での性能(47ポイント)よりもで 大幅に低い性能(30ポイント ... #NLP#LanguageModel#RLHF#Reasoning#Mathematics#read-later
Issue Date: 2025-01-04 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Zhihong Shao+, arXiv24 Comment元ポスト:https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_the-rlhf-method-behind-the-best-open-models-activity-7280850174522843137-3V9v?utm_source= ... #Article#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Reasoning#SmallModel#OpenWeightLLM
Issue Date: 2025-05-01 Phi-4-reasoning Technical Report, 2025.04 Comment元ポスト:https://x.com/dimitrispapail/status/1917731614899028190?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qこちらの解説が非常によくまとまっている:https://x.com/_philschmid/status/1918216 ... #Article#MachineLearning#Pocket#LanguageModel#Reasoning#read-later
Issue Date: 2025-03-22 Understanding R1-Zero-Like Training: A Critical Perspective, 2025.03 Comment関連研究:#1815解説ポスト:https://x.com/wenhuchen/status/1903464313391624668?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q解説ポストを読むと、DAPOでの Token Level Policy UpdateのようなLengthに対 ... #Article#MachineLearning#NLP#LanguageModel#ReinforcementLearning#Article
Issue Date: 2025-03-05 GRPO Judge Experiments: Findings & Empirical Observations, kalomazes kalomazing blog, 2025.03 Comment元ポスト:https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_forget-basic-math-problems-grpo-can-do-more-activity-7302608410875691009-nntf?utm_source= ... #Article#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Article
Issue Date: 2025-02-19 強化学習「GRPO」をCartPoleタスクで実装しながら解説, 小川雄太郎, 2025.02 Comment元ポスト:https://x.com/ogawa_yutaro_22/status/1892059174789407213?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...
Issue Date: 2025-04-18 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning, Siyan Zhao+, arXiv25 Comment元ポスト:https://x.com/iscienceluvr/status/1912785180504535121?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QdLLMに対してGRPOを適用する手法(diffuGRPO)を提案している。long CoTデータでSFTしてreasoni ... #MachineLearning#Pocket#LanguageModel#ReinforcementLearning#Reasoning#LongSequence#read-later
Issue Date: 2025-03-20 DAPO: An Open-Source LLM Reinforcement Learning System at Scale, Qiying Yu+, arXiv25 Comment既存のreasoning modelのテクニカルレポートにおいて、スケーラブルなRLの学習で鍵となるレシピは隠されていると主張し、実際彼らのbaselineとしてGRPOを走らせたところ、DeepSeekから報告されているAIME2024での性能(47ポイント)よりもで 大幅に低い性能(30ポイント ... #NLP#LanguageModel#RLHF#Reasoning#Mathematics#read-later
Issue Date: 2025-01-04 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Zhihong Shao+, arXiv24 Comment元ポスト:https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_the-rlhf-method-behind-the-best-open-models-activity-7280850174522843137-3V9v?utm_source= ... #Article#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Reasoning#SmallModel#OpenWeightLLM
Issue Date: 2025-05-01 Phi-4-reasoning Technical Report, 2025.04 Comment元ポスト:https://x.com/dimitrispapail/status/1917731614899028190?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qこちらの解説が非常によくまとまっている:https://x.com/_philschmid/status/1918216 ... #Article#MachineLearning#Pocket#LanguageModel#Reasoning#read-later
Issue Date: 2025-03-22 Understanding R1-Zero-Like Training: A Critical Perspective, 2025.03 Comment関連研究:#1815解説ポスト:https://x.com/wenhuchen/status/1903464313391624668?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q解説ポストを読むと、DAPOでの Token Level Policy UpdateのようなLengthに対 ... #Article#MachineLearning#NLP#LanguageModel#ReinforcementLearning#Article
Issue Date: 2025-03-05 GRPO Judge Experiments: Findings & Empirical Observations, kalomazes kalomazing blog, 2025.03 Comment元ポスト:https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_forget-basic-math-problems-grpo-can-do-more-activity-7302608410875691009-nntf?utm_source= ... #Article#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Article
Issue Date: 2025-02-19 強化学習「GRPO」をCartPoleタスクで実装しながら解説, 小川雄太郎, 2025.02 Comment元ポスト:https://x.com/ogawa_yutaro_22/status/1892059174789407213?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...