RewardHacking

#Survey#Pocket#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Chain-of-Thought#InstructionTuning#PPO (ProximalPolicyOptimization)#Reasoning#LongSequence#GRPO#Contamination#VerifiableRewards#CurriculumLearning
Issue Date: 2025-05-06 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models, Chong Zhang+, arXiv25 Comment元ポスト:https://x.com/_philschmid/status/1918898257406709983?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qサーベイのtakeawayが箇条書きされている。 ... #Pocket#NLP#ICLR
Issue Date: 2025-04-06 CREAM: Consistency Regularized Self-Rewarding Language Models, Zhaoyang Wang+, ICLR25 Comment#1212を改善した研究OpenReview:https://openreview.net/forum?id=Vf6RDObyEFこの方向性の研究はおもしろい ... #Analysis#NLP#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Chain-of-Thought#Reasoning#LongSequence#PostTraining
Issue Date: 2025-02-07 Demystifying Long Chain-of-Thought Reasoning in LLMs, Edward Yeo+, arXiv25 Comment元ポスト:https://x.com/xiangyue96/status/1887332772198371514?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q元ポストのスレッド中に論文の11個の知見が述べられている。どれも非常に興味深い。DeepSeek-R1のテクニカルペーパーと同様 ...