Architecture
#EfficiencyImprovement#Pocket#NLP#LanguageModel
Issue Date: 2025-06-28 Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models, Zihan Wang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1938728784351658087?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Transformer#ACL
Issue Date: 2025-06-12 Value Residual Learning, Zhanchao Zhou+, ACL25 Comment元ポスト:https://x.com/zhanchaozhou/status/1932829678081098079?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q#SoftwareEngineering
Issue Date: 2025-05-20 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, Chenggang Zhao+, arXiv25 Comment元ポスト:https://x.com/deedydas/status/1924512147947848039?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks, Taoran Fang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 XAttention: Block Sparse Attention with Antidiagonal Scoring, Ruyi Xu+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA, Nils Graef+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #EfficiencyImprovement#Pocket#NLP#Transformer#LongSequence
Issue Date: 2025-04-06 Scalable-Softmax Is Superior for Attention, Ken M. Nakanishi, arXiv25 Comment#1863で採用されている手法で、ブログポスト中で引用されている。Long Contextになった場合にsoftmaxの分布が均一になる(=重要な情報にattendする能力が削がれる)ことを防ぐための手法を提案している。解説ポスト:https://x.com/nrehiew_/status/1908 ... #Pocket#NLP#LanguageModel#Transformer#Attention
Issue Date: 2025-04-02 Multi-Token Attention, Olga Golovneva+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1907260086017237207?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q従来のMulti Head Attentionでは、単体のQKのみを利用していたけど、複数のQKの情報を畳み込んで活用できるよう ... #Pocket#NLP#LanguageModel#Test-Time Scaling
Issue Date: 2025-02-10 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Jonas Geiping+, arXiv25 #NLP#LanguageModel#Transformer
Issue Date: 2024-10-21 Differential Transformer, Tianzhu Ye+, N_A, ICLR25 Comment最近のMSはなかなかすごい(小並感# 概要 attention scoreのノイズを低減するようなアーキテクチャとして、二つのQKVを用意し、両者の差分を取ることで最終的なattentiok scoreを計算するDifferential Attentionを提案した。 attentionのnois ... #ComputerVision#EfficiencyImprovement#NLP#Transformer#MulltiModal#SpeechProcessing
Issue Date: 2024-11-12 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, Weixin Liang+, arXiv24 Comment ... #ComputerVision#Pocket#NLP#Transformer#MulltiModal#SpeechProcessing#Normalization
Issue Date: 2025-04-19 Foundation Transformers, Hongyu Wang+, PMLR23 Comment関連:#1900 ... #NLP#Transformer#Normalization
Issue Date: 2025-04-19 DeepNet: Scaling Transformers to 1,000 Layers, Hongyu Wang+, arXiv22 CommentステートオブAIガイドによる解説:https://ja.stateofaiguides.com/20220308-deepnet-transformer/ ... #EfficiencyImprovement#Pretraining#Pocket#NLP#Transformer#MoE(Mixture-of-Experts)
Issue Date: 2025-02-11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus+, JMLR22
Issue Date: 2025-05-20 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, Chenggang Zhao+, arXiv25 Comment元ポスト:https://x.com/deedydas/status/1924512147947848039?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks, Taoran Fang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 XAttention: Block Sparse Attention with Antidiagonal Scoring, Ruyi Xu+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA, Nils Graef+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #EfficiencyImprovement#Pocket#NLP#Transformer#LongSequence
Issue Date: 2025-04-06 Scalable-Softmax Is Superior for Attention, Ken M. Nakanishi, arXiv25 Comment#1863で採用されている手法で、ブログポスト中で引用されている。Long Contextになった場合にsoftmaxの分布が均一になる(=重要な情報にattendする能力が削がれる)ことを防ぐための手法を提案している。解説ポスト:https://x.com/nrehiew_/status/1908 ... #Pocket#NLP#LanguageModel#Transformer#Attention
Issue Date: 2025-04-02 Multi-Token Attention, Olga Golovneva+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1907260086017237207?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q従来のMulti Head Attentionでは、単体のQKのみを利用していたけど、複数のQKの情報を畳み込んで活用できるよう ... #Pocket#NLP#LanguageModel#Test-Time Scaling
Issue Date: 2025-02-10 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Jonas Geiping+, arXiv25 #NLP#LanguageModel#Transformer
Issue Date: 2024-10-21 Differential Transformer, Tianzhu Ye+, N_A, ICLR25 Comment最近のMSはなかなかすごい(小並感# 概要 attention scoreのノイズを低減するようなアーキテクチャとして、二つのQKVを用意し、両者の差分を取ることで最終的なattentiok scoreを計算するDifferential Attentionを提案した。 attentionのnois ... #ComputerVision#EfficiencyImprovement#NLP#Transformer#MulltiModal#SpeechProcessing
Issue Date: 2024-11-12 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, Weixin Liang+, arXiv24 Comment ... #ComputerVision#Pocket#NLP#Transformer#MulltiModal#SpeechProcessing#Normalization
Issue Date: 2025-04-19 Foundation Transformers, Hongyu Wang+, PMLR23 Comment関連:#1900 ... #NLP#Transformer#Normalization
Issue Date: 2025-04-19 DeepNet: Scaling Transformers to 1,000 Layers, Hongyu Wang+, arXiv22 CommentステートオブAIガイドによる解説:https://ja.stateofaiguides.com/20220308-deepnet-transformer/ ... #EfficiencyImprovement#Pretraining#Pocket#NLP#Transformer#MoE(Mixture-of-Experts)
Issue Date: 2025-02-11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus+, JMLR22