MoE(Mixture-of-Experts)

#EfficiencyImprovement#Pretraining#Pocket#NLP#LanguageModel#ICLR
Issue Date: 2025-06-25 Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization, Taishi Nakamura+, ICLR25 CommentOpenReview:https://openreview.net/forum?id=gx1wHnf5Vp関連:#1546提案手法の全体像とDiversity re-initializationの概要。元のUpcyclingでは全てidenticalな重みでreplicateされていたため、これが個 ... #Pocket#NLP#LanguageModel#ICML#Scaling Laws
Issue Date: 2025-06-21 Scaling Laws for Upcycling Mixture-of-Experts Language Models, Seng Pei Liew+, ICML25 Comment元ポスト:https://x.com/sbintuitions/status/1935970879923540248?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QOpenReview:https://openreview.net/forum?id=ZBBo19jldX関連:#1546 ... #EfficiencyImprovement#Pocket#NLP#LanguageModel#Transformer#Attention#LLMServing#Architecture#SoftwareEngineering
Issue Date: 2025-05-20 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, Chenggang Zhao+, arXiv25 Comment元ポスト:https://x.com/deedydas/status/1924512147947848039?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...

#Pocket#NLP#LanguageModel#ACL
Issue Date: 2025-01-06 DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Damai+, ACL24, 2024.08 CommentIn the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model param ... #Pretraining#MachineLearning#Pocket#NLP#LanguageModel#Supervised-FineTuning (SFT)#PostTraining
Issue Date: 2024-11-25 Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints, Aran Komatsuzaki+, ICLR23 Comment斜め読みしかできていないが、Mixture-of-Expertsを用いたモデルをSFT/Pretrainingする際に、既存のcheckpointの重みを活用することでより効率的かつ性能向上する方法を提案。MoE LayerのMLPを全て既存のcheckpointにおけるMLPの重みをコピーして初期 ... #EfficiencyImprovement#Pretraining#Pocket#NLP#Transformer#Architecture
Issue Date: 2025-02-11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus+, JMLR22 #NeuralNetwork#Pocket#NLP#ICLR
Issue Date: 2025-04-29 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Noam Shazeer+, ICLR17 CommentMixture-of-Experts (MoE) Layerを提案した研究 ... #NeuralNetwork#MachineLearning#Pocket
Issue Date: 2025-04-29 Adaptive Mixture of Local Experts, Jacobs+, Neural Computation91 CommentMixture of Expertsの起源と思ったのだが、下記研究の方が年号が古いようだが、こちらが起源ではなのか・・・?だがアブスト中に上記論文で提案されたMoEのパフォーマンスを比較する、といった旨の記述があるので時系列がよくわからない。[Evaluation of Adaptive Mixtu ... #Article#ComputerVision#NLP#LanguageModel#MulltiModal#OpenWeight
Issue Date: 2025-06-30 ERNIE 4.5 Series, ERNIE TEAM, 2025.06 CommentTech Report:https://yiyan.baidu.com/blog/publication/ERNIE_Technical_Report.pdf元ポスト:https://x.com/paddlepaddle/status/1939535276197744952?s=46&t=Y6UuI ... #Article#NLP#LanguageModel#Reasoning#OpenWeight
Issue Date: 2025-06-17 MiniMax-M1, MiniMax, 2025.06 Comment元ポスト:https://x.com/arankomatsuzaki/status/1934642204397744137?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Qベンチマーク![image](https://github.com/user-attachments/assets/e ... #Article#NLP#Library#Supervised-FineTuning (SFT)#Blog#OpenWeight#PostTraining
Issue Date: 2025-05-11 ms-swiftによるMegatron-LMベースのQwen3のファインチューニング, Aratako, 2025.05 Comment元ポスト:https://x.com/aratako_lm/status/1921401994532487174?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QMegatron-SWIFTというAlibaba製のライブラリを利用しQwen3の継続事前学習とSFTを実施する方法を、ベストプ ... #Article#NLP#LanguageModel#Alignment#Supervised-FineTuning (SFT)#ReinforcementLearning#InstructionTuning#Blog#LongSequence#MultiLingual#OpenWeight#PostTraining
Issue Date: 2025-04-29 Qwen3, Qwen Team, 2025.04 Comment119言語をサポートMoEモデル #1911 30B-A3B / 235B-A22N 128K context window Qwen2.5はMoEを採用していないので新たなアーキテクチャとなるDenseモデル(非MoEモデル)も公開BestPracticeに関するポスト:http ...