Architecture
#Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks, Taoran Fang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 XAttention: Block Sparse Attention with Antidiagonal Scoring, Ruyi Xu+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA, Nils Graef+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...
Issue Date: 2025-04-07 KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks, Taoran Fang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 XAttention: Block Sparse Attention with Antidiagonal Scoring, Ruyi Xu+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket#NLP#LanguageModel#Attention
Issue Date: 2025-04-07 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA, Nils Graef+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...
#EfficiencyImprovement#Pocket#NLP#Transformer#LongSequence
Issue Date: 2025-04-06 Scalable-Softmax Is Superior for Attention, Ken M. Nakanishi, arXiv25 Comment#1863で採用されている手法で、ブログポスト中で引用されている。Long Contextになった場合にsoftmaxの分布が均一になる(=重要な情報にattendする能力が削がれる)ことを防ぐための手法を提案している。解説ポスト:https://x.com/nrehiew_/status/1908 ... #Pocket#NLP#LanguageModel#Transformer#Attention
Issue Date: 2025-04-02 Multi-Token Attention, Olga Golovneva+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1907260086017237207?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q従来のMulti Head Attentionでは、単体のQKのみを利用していたけど、複数のQKの情報を畳み込んで活用できるよう ... #Pocket#NLP#LanguageModel#Test-time Compute
Issue Date: 2025-02-10 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Jonas Geiping+, arXiv25 #NLP#LanguageModel#Transformer
Issue Date: 2024-10-21 Differential Transformer, Tianzhu Ye+, N_A, ICLR25 Comment最近のMSはなかなかすごい(小並感# 概要 attention scoreのノイズを低減するようなアーキテクチャとして、二つのQKVを用意し、両者の差分を取ることで最終的なattentiok scoreを計算するDifferential Attentionを提案した。 attentionのnois ... #ComputerVision#EfficiencyImprovement#NLP#Transformer#MulltiModal#SpeechProcessing
Issue Date: 2024-11-12 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, Weixin Liang+, arXiv24 Comment ... #NLP#Transformer#Normalization
Issue Date: 2025-04-19 DeepNet: Scaling Transformers to 1,000 Layers, Hongyu Wang+, arXiv22 CommentステートオブAIガイドによる解説:https://ja.stateofaiguides.com/20220308-deepnet-transformer/ ... #ComputerVision#Pocket#NLP#Transformer#MulltiModal#SpeechProcessing#Normalization
Issue Date: 2025-04-19 Foundation Transformers, Hongyu Wang+, arXiv22 Comment関連:#1900 ... #EfficiencyImprovement#Pretraining#Pocket#NLP#Transformer#MoE(Mixture-of-Experts)
Issue Date: 2025-02-11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus+, JMLR22
Issue Date: 2025-04-06 Scalable-Softmax Is Superior for Attention, Ken M. Nakanishi, arXiv25 Comment#1863で採用されている手法で、ブログポスト中で引用されている。Long Contextになった場合にsoftmaxの分布が均一になる(=重要な情報にattendする能力が削がれる)ことを防ぐための手法を提案している。解説ポスト:https://x.com/nrehiew_/status/1908 ... #Pocket#NLP#LanguageModel#Transformer#Attention
Issue Date: 2025-04-02 Multi-Token Attention, Olga Golovneva+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1907260086017237207?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q従来のMulti Head Attentionでは、単体のQKのみを利用していたけど、複数のQKの情報を畳み込んで活用できるよう ... #Pocket#NLP#LanguageModel#Test-time Compute
Issue Date: 2025-02-10 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, Jonas Geiping+, arXiv25 #NLP#LanguageModel#Transformer
Issue Date: 2024-10-21 Differential Transformer, Tianzhu Ye+, N_A, ICLR25 Comment最近のMSはなかなかすごい(小並感# 概要 attention scoreのノイズを低減するようなアーキテクチャとして、二つのQKVを用意し、両者の差分を取ることで最終的なattentiok scoreを計算するDifferential Attentionを提案した。 attentionのnois ... #ComputerVision#EfficiencyImprovement#NLP#Transformer#MulltiModal#SpeechProcessing
Issue Date: 2024-11-12 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, Weixin Liang+, arXiv24 Comment ... #NLP#Transformer#Normalization
Issue Date: 2025-04-19 DeepNet: Scaling Transformers to 1,000 Layers, Hongyu Wang+, arXiv22 CommentステートオブAIガイドによる解説:https://ja.stateofaiguides.com/20220308-deepnet-transformer/ ... #ComputerVision#Pocket#NLP#Transformer#MulltiModal#SpeechProcessing#Normalization
Issue Date: 2025-04-19 Foundation Transformers, Hongyu Wang+, arXiv22 Comment関連:#1900 ... #EfficiencyImprovement#Pretraining#Pocket#NLP#Transformer#MoE(Mixture-of-Experts)
Issue Date: 2025-02-11 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus+, JMLR22