RLVR
#Pocket#NLP#LanguageModel
Issue Date: 2025-06-05 Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards, Xun Lu, arXiv25 Comment元ポスト:https://x.com/grad62304977/status/1929996614883783170?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QWriting Principleに基づいて(e.g., 一貫性、創造性とか?)批評を記述し、最終的に与えられたペアワイズの ... #ComputerVision#Pocket#NLP#LanguageModel#MulltiModal#DataMixture
Issue Date: 2025-06-05 MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning, Yiqing Liang+, arXiv25 Comment元ポスト:https://x.com/_vztu/status/1930312780701413498?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qマルチモーダルな設定でRLVRを適用すると、すべてのデータセットを学習に利用する場合より、特定のタスクのみのデータで学習した方が当該タスク ... #Pocket#NLP#LanguageModel#read-later#VerifiableRewards#Verification
Issue Date: 2025-06-03 Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning, Yuzhen Huang+, arXiv25 Comment元ポスト:https://x.com/junxian_he/status/1929371821767586284?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qverificationタスクに特化してfinetuningされたDiscriminative Classifierが、rewa ...
Issue Date: 2025-06-05 Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards, Xun Lu, arXiv25 Comment元ポスト:https://x.com/grad62304977/status/1929996614883783170?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QWriting Principleに基づいて(e.g., 一貫性、創造性とか?)批評を記述し、最終的に与えられたペアワイズの ... #ComputerVision#Pocket#NLP#LanguageModel#MulltiModal#DataMixture
Issue Date: 2025-06-05 MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning, Yiqing Liang+, arXiv25 Comment元ポスト:https://x.com/_vztu/status/1930312780701413498?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qマルチモーダルな設定でRLVRを適用すると、すべてのデータセットを学習に利用する場合より、特定のタスクのみのデータで学習した方が当該タスク ... #Pocket#NLP#LanguageModel#read-later#VerifiableRewards#Verification
Issue Date: 2025-06-03 Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning, Yuzhen Huang+, arXiv25 Comment元ポスト:https://x.com/junxian_he/status/1929371821767586284?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qverificationタスクに特化してfinetuningされたDiscriminative Classifierが、rewa ...
#NLP#LanguageModel
Issue Date: 2025-06-01 Can Large Reasoning Models Self-Train?, Sheikh Shafayat+, arXiv25 Comment元ポスト:https://x.com/askalphaxiv/status/1928487492291829809?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q#1995と似ているように見えるself-consistencyでground truthを推定し、推定したground tr ... #Pocket#LanguageModel#read-later
Issue Date: 2025-05-08 Absolute Zero: Reinforced Self-play Reasoning with Zero Data, Andrew Zhao+, arXiv25 Comment元ポスト:https://x.com/arankomatsuzaki/status/1919946713567264917?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Article#Analysis#NLP#LanguageModel#Mathematics#SmallModel
Issue Date: 2025-05-27 Spurious Rewards: Rethinking Training Signals in RLVR, Shao+, 2025.05 Comment元ポスト:https://x.com/stellalisy/status/1927392717593526780?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q参考(考察): https://x.com/weiliu99/status/1930826904522875309?s=46&t ...
Issue Date: 2025-06-01 Can Large Reasoning Models Self-Train?, Sheikh Shafayat+, arXiv25 Comment元ポスト:https://x.com/askalphaxiv/status/1928487492291829809?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q#1995と似ているように見えるself-consistencyでground truthを推定し、推定したground tr ... #Pocket#LanguageModel#read-later
Issue Date: 2025-05-08 Absolute Zero: Reinforced Self-play Reasoning with Zero Data, Andrew Zhao+, arXiv25 Comment元ポスト:https://x.com/arankomatsuzaki/status/1919946713567264917?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Article#Analysis#NLP#LanguageModel#Mathematics#SmallModel
Issue Date: 2025-05-27 Spurious Rewards: Rethinking Training Signals in RLVR, Shao+, 2025.05 Comment元ポスト:https://x.com/stellalisy/status/1927392717593526780?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q参考(考察): https://x.com/weiliu99/status/1930826904522875309?s=46&t ...