CurriculumLearning
#Survey#Pocket#LanguageModel#Supervised-FineTuning (SFT)#ReinforcementLearning#Chain-of-Thought#InstructionTuning#PPO (ProximalPolicyOptimization)#Reasoning#LongSequence#RewardHacking#GRPO#Contamination#VerifiableRewards
Issue Date: 2025-05-06 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models, Chong Zhang+, arXiv25 Comment元ポスト:https://x.com/_philschmid/status/1918898257406709983?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qサーベイのtakeawayが箇条書きされている。 ...
Issue Date: 2025-05-06 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models, Chong Zhang+, arXiv25 Comment元ポスト:https://x.com/_philschmid/status/1918898257406709983?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qサーベイのtakeawayが箇条書きされている。 ...