DPO

#Analysis#Pocket#NLP#LanguageModel#ReinforcementLearning#TransferLearning#GRPO#VerifiableRewards#Off-Policy#On-Policy#Non-VerifiableRewards
Issue Date: 2025-06-30 Bridging Offline and Online Reinforcement Learning for LLMs, Jack Lanchantin+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1939673136842313960?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #ComputerVision#Analysis#Pocket#NLP#LanguageModel#Supervised-FineTuning (SFT)#SyntheticData#ACL#PostTraining#Probing
Issue Date: 2025-05-18 Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding, Kung-Hsiang Huang+, ACL25 Comment元ポスト:https://x.com/steeve__huang/status/1923543884367306763?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q既存のLLM (proprietary, openweightそれぞれ)が、シンプルなvisual arithmeticタ ... #Pretraining#Pocket#NLP#LanguageModel#Supervised-FineTuning (SFT)#Safety#Toxicity#ITI (Inference Time Intervention)
Issue Date: 2025-05-09 When Bad Data Leads to Good Models, Kenneth Li+, arXiv25 Comment元ポスト:https://x.com/ke_li_2021/status/1920646069613957606?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qこれは面白そうWebコーパスなどを事前学習で利用する際は、質の高いデータを残して学習した方が良いとされているが、4chanのよう ...

#Analysis#MachineLearning#Pocket#NLP#LanguageModel#Alignment#Hallucination#ICLR#Repetition
Issue Date: 2025-04-18 Learning Dynamics of LLM Finetuning, Yi Ren+, ICLR25 Comment元ポスト:https://x.com/joshuarenyi/status/1913033476275925414?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q解説ポスト:https://x.com/hillbig/status/1917189793588613299?s=46&t=Y ... #Pocket#NLP#LanguageModel#Alignment#ICLR#PostTraining#Diversity
Issue Date: 2025-02-01 Diverse Preference Optimization, Jack Lanchantin+, ICLR25 Comment元ポスト:https://x.com/jaseweston/status/1885399530419450257?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QOpenReview: https://openreview.net/forum?id=pOq9vDIYevDPOと同じ最適化方 ... #Analysis#Pocket#NLP#LanguageModel#Alignment#ReinforcementLearning#PPO (ProximalPolicyOptimization)#ICML#On-Policy
Issue Date: 2025-06-25 Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data, Fahim Tajwar+, ICML24 #NLP#LanguageModel#Alignment#PostTraining#read-later#Admin'sPick
Issue Date: 2024-09-25 Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafael Rafailov+, N_A, NeurIPS24 CommentDPOを提案した研究 image解説ポスト:https://x.com/thet ... #Pocket#NLP#LanguageModel#Alignment#Supervised-FineTuning (SFT)#Safety#PostTraining
Issue Date: 2024-09-24 Backtracking Improves Generation Safety, Yiming Zhang+, N_A, arXiv24 Comment元ポスト: https://x.com/jaseweston/status/1838415378529112330?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Article#NLP#LanguageModel#Alignment#Supervised-FineTuning (SFT)#Blog#PostTraining
Issue Date: 2025-01-25 How to align open LLMs in 2025 with DPO & and synthetic data, PHILSCHMID, 2025.01 Comment元ポスト:https://x.com/_philschmid/status/1882428447877705908?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QDPOの概要やRLHFと比較した利点ルールベース、あるいはLLM as a Judgeを用いたOn-policy prefer ... #Article#MachineLearning#NLP#LanguageModel#Alignment#RLHF#Blog
Issue Date: 2024-12-18 RLHF_DPO 小話, 和地瞭良_ Akifumi Wachi, 2024.04 Commentめちゃめちゃ勉強になる… ...