Attentionに関する論文・技術記事メモの一覧

Attention

#EfficiencyImprovement #Pocket #NLP #Transformer #Architecture
Issue Date: 2025-06-10 Log-Linear Attention, Han Guo+, arXiv25 Comment元ポスト:https://x.com/hillbig/status/1932194773559107911?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q解説ポスト:https://x.com/theturingpost/status/1931432543766847887?s=46&t ... #EfficiencyImprovement #Pocket #NLP #LanguageModel #Transformer #LLMServing #Architecture #MoE(Mixture-of-Experts)#SoftwareEngineering
Issue Date: 2025-05-20 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, Chenggang Zhao+, arXiv25 Comment元ポスト:https://x.com/deedydas/status/1924512147947848039?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket #NLP #LanguageModel #AttentionSinks
Issue Date: 2025-04-09 Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs, Pedro Sandoval-Segura+, arXiv25 Comment元ポスト:https://x.com/psandovalsegura/status/1909652533334712691?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ...

#Pocket #NLP #LanguageModel #Architecture
Issue Date: 2025-04-07 KAA: Kolmogorov-Arnold Attention for Enhancing Attentive Graph Neural Networks, Taoran Fang+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket #NLP #LanguageModel #Architecture
Issue Date: 2025-04-07 XAttention: Block Sparse Attention with Antidiagonal Scoring, Ruyi Xu+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket #NLP #LanguageModel #Architecture
Issue Date: 2025-04-07 Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA, Nils Graef+, arXiv25 Comment元ポスト:https://x.com/theturingpost/status/1908966571227398449?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket #NLP #LanguageModel #ICLR #AttentionSinks
Issue Date: 2025-04-05 When Attention Sink Emerges in Language Models: An Empirical View, Xiangming Gu+, ICLR25 CommentSink Rateと呼ばれる、全てのheadのFirst Tokenに対するattention scoreのうち（layer l * head h個存在する）、どの程度の割合のスコアが閾値を上回っているかを表す指標を提案#1860の先行研究 ... #Analysis #NLP #LanguageModel #AttentionSinks
Issue Date: 2025-04-05 Why do LLMs attend to the first token?, Federico Barbero+, arXiv25 Comment元ポスト:https://x.com/omarsar0/status/1908187563422261411?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QAttention Sinkによって、トークンの情報がover-mixingされることが抑制され、Decoder-only LLMの ... #Pocket #NLP #LanguageModel #Transformer #Architecture
Issue Date: 2025-04-02 Multi-Token Attention, Olga Golovneva+, arXiv25 Comment元ポスト:https://x.com/jaseweston/status/1907260086017237207?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q従来のMulti Head Attentionでは、単体のQKのみを利用していたけど、複数のQKの情報を畳み込んで活用できるよう ... #EfficiencyImprovement #MachineLearning #Pocket #NLP #LanguageModel
Issue Date: 2025-03-02 Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention, Jingyang Yuan+, arXiv25 Comment元ポスト:https://x.com/dair_ai/status/1893698286545969311?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Pocket #NLP #Transformer
Issue Date: 2025-04-06 Flex Attention: A Programming Model for Generating Optimized Attention Kernels, Juechu Dong+, arXiv24 Comment#1863で利用されているAttentionpytochによる解説:https://pytorch.org/blog/flexattention/Flex AttentionはオリジナルのAttentionのQK/sqrt(d_k)の計算後にユーザが定義した関数score_modを適用するscore ... #Pocket #ICLR #AttentionSinks
Issue Date: 2025-04-05 Efficient Streaming Language Models with Attention Sinks, Guangxuan Xiao+, ICLR24 CommentAttention Sinkという用語を提言した研究#1860の先行研究 ... #Survey #EfficiencyImprovement #NLP #LanguageModel #Transformer
Issue Date: 2024-11-17 Understanding LLMs: A Comprehensive Overview from Training to Inference, Yiheng Liu+, arXiv24 Comment[Perplexity（参考;Hallucinationに注意）](https://www.perplexity.ai/search/yi-xia-nolun-wen-wodu-minei-ro-7vGwDK_AQX.HDO7j9H8iNA)単なるLLMの理論的な説明にとどまらず、実用的に必要な各種 ... #EfficiencyImprovement #Pocket #NLP #LanguageModel #Transformer
Issue Date: 2024-04-07 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N_A, arXiv24 Summaryトランスフォーマーの生成効率を向上させるために、Dynamic Memory Compression（DMC）が提案された。DMCは、異なるヘッドとレイヤーで異なる圧縮率を適用する方法を学習し、事前学習済みLLMsに適用される。DMCは、元の下流パフォーマンスを最大4倍のキャッシュ圧縮で維持しつつ、スループットを向上させることができる。DMCは、GQAと組み合わせることでさらなる利益をもたらす可能性があり、長いコンテキストと大きなバッチを処理する際に有用である。 Comment参考: https://x.com/hillbig/status/1776755029581676943?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q論文中のFigure1が非常にわかりやすい。GQA #1271 と比較して、2~4倍キャッシュを圧縮しつつ、より高い性能を実現。70Bモ ...

#EfficiencyImprovement #Pocket #NLP #LanguageModel #Transformer
Issue Date: 2024-04-07 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Joshua Ainslie+, N_A, arXiv23 SummaryMulti-query attention（MQA）は、単一のkey-value headのみを使用しており、デコーダーの推論を劇的に高速化しています。ただし、MQAは品質の低下を引き起こす可能性があり、さらには、より速い推論のためだけに別個のモデルをトレーニングすることが望ましくない場合もあります。既存のマルチヘッド言語モデルのチェックポイントを、オリジナルの事前トレーニング計量の5%を使用してMQAを持つモデルにアップトレーニングするためのレシピを提案し、さらに、複数のkey-value headを使用するマルチクエリアテンションの一般化であるグループ化クエリアテンション（GQA）を紹介します。アップトレーニングされたGQAが、MQAと同等の速度でマルチヘッドアテンションに匹敵する品質を達成することを示しています。 Comment通常のMulti-Head AttentionがQKVが1対1対応なのに対し、Multi Query Attention (MQA) #1272 は全てのQに対してKVを共有する。一方、GQAはグループごとにKVを共有する点で異なる。MQAは大幅にInfeerence` speedが改善するが、精 ...

#Pocket #NLP #LanguageModel
Issue Date: 2023-11-10 Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs, Qingru Zhang+, N_A, arXiv23 SummaryPASTAは、大規模言語モデル（LLMs）において、ユーザーが指定した強調マークのあるテキストを読むことを可能にする手法です。PASTAは、注意の一部を特定し、再重み付けを適用してモデルの注意をユーザーが指定した部分に向けます。実験では、PASTAがLLMの性能を大幅に向上させることが示されています。 Commentユーザがprompt中で強調したいした部分がより考慮されるようにattention weightを調整することで、より応答性能が向上しましたという話っぽい。かなり重要な技術だと思われる。後でしっかり読む。 ...

#MachineLearning #NLP #LanguageModel
Issue Date: 2023-08-08 The Hydra Effect: Emergent Self-repair in Language Model Computations, Thomas McGrath+, N_A, arXiv23 Summary私たちは、言語モデルの内部構造を調査し、言語モデルの計算における特定の効果を示しました。具体的には、1つの層の削除が他の層によって補完される「Hydra効果」と、遅いMLP層が最大尤度トークンを制御する役割を持つことを示しました。また、ドロップアウトを使用しない言語モデルでも同様の効果が見られることを示しました。これらの効果を事実の回想の文脈で分析し、言語モデルの回路レベルの属性付与について考察しました。 CommentLLMからattention layerを一つ取り除くと、後続の層が取り除かれたlayerの機能を引き継ぐような働きをすることがわかった。これはLLMの自己修復機能のようなものであり、HydraEffectと命名された。 ... #NLP #DataDistillation #Zero/FewShotLearning
Issue Date: 2023-07-14 Dataset Distillation with Attention Labels for Fine-tuning BERT, ACL23 Summary本研究では、データセットの蒸留を使用して、元のデータセットのパフォーマンスを保持しながら、ニューラルネットワークを迅速にトレーニングするための小さなデータセットを作成する方法に焦点を当てています。具体的には、事前学習済みのトランスフォーマーを微調整するための自然言語処理タスクの蒸留されたfew-shotデータセットの構築を提案しています。実験結果では、注意ラベルを使用してfew-shotデータセットを作成し、BERTの微調整において印象的なパフォーマンスを実現できることを示しました。例えば、ニュース分類タスクでは、わずか1つのサンプルとわずか1つの勾配ステップのみで、元のデータセットの98.5％のパフォーマンスを達成しました。 CommentDatadistillationしたら、データセットのうち1サンプルのみで、元のデータセットの98.5%の性能を発揮できたという驚異的な研究（まえかわ君） ... #EfficiencyImprovement #Pocket #NLP #LanguageModel #Transformer #LongSequence #Inference
Issue Date: 2023-04-30 Efficiently Scaling Transformer Inference, Reiner Pope+, N_A, MLSys23 Summary大規模Transformerベースのモデルの推論のエンジニアリングのトレードオフを理解するために、最適な多次元分割技術を選択するための単純な解析モデルを開発低レベルの最適化と組み合わせることで、500B+パラメータモデルのレイテンシーとモデルFLOPS利用率のトレードオフにおいて、FasterTransformerベンチマークスイートを上回る新しいParetoフロンティアを実現適切な分割により、マルチクエリアテンションの低いメモリ要件により、32倍の大きなコンテキスト長にスケーリング可能int8ウェイト量子化を使用した生成中の低バッチサイズレイテンシーは、トークンあたり29msであり、入力トークンの大バッチサイズ処理において76％のMFUを実現し、PaLM 540Bパラメータモデルにおいて2048トークンの長いコンテキスト長をサポートしている。 Comment特にMultiquery Attentionという技術がTransformerのinferenceのコスト削減に有効らしい ... #EfficiencyImprovement #MachineLearning
Issue Date: 2023-05-20 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N_A, arXiv22 Summaryトランスフォーマーは、長いシーケンスに対して遅く、メモリを多く消費するため、注意アルゴリズムを改善する必要がある。FlashAttentionは、タイリングを使用して、GPUの高帯域幅メモリ（HBM）とGPUのオンチップSRAM間のメモリ読み取り/書き込みの数を減らし、トランスフォーマーを高速にトレーニングできる。FlashAttentionは、トランスフォーマーでより長い文脈を可能にし、より高品質なモデルや、完全に新しい機能を提供する。 Commentより高速なGPU上のSRAM上で計算できるようにQKVをブロック単位に分割して計算することで、より高い計算効率を実現するFlashAttentionを提案[^1][^1]: （2025.05.24追記)下記日本語ブログを参考に一部文言を訂正しました。ありがとうございます。日本語解説:https:// ...

#EfficiencyImprovement #Pocket #NLP #LanguageModel #Transformer
Issue Date: 2024-04-07 Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N_A, arXiv19 Summaryマルチヘッドアテンションレイヤーのトレーニングは高速かつ簡単だが、増分推論は大きな"keys"と"values"テンソルを繰り返し読み込むために遅くなることがある。そこで、キーと値を共有するマルチクエリアテンションを提案し、メモリ帯域幅要件を低減する。実験により、高速なデコードが可能で、わずかな品質の低下しかないことが確認された。 CommentMulti Query Attention論文。KVのsetに対して、単一のQueryのみでMulti-Head Attentionを代替する。劇的にDecoderのInferenceが早くなりメモリ使用量が減るが、論文中では言及されていない？ようだが、性能と学習の安定性が課題となるようである。 ...

#NeuralNetwork #MachineTranslation #NLP #Transformer #FoundationModel #NeurIPS #Admin'sPick
Issue Date: 2018-01-19 Attention is all you need, Vaswani+, NIPS17 CommentTransformer (self-attentionを利用) 論文解説スライド：https://www.slideshare.net/DeepLearningJP2016/dlattention-is-all-you-need 解説記事：https://qiita.com/nishiba/i分か ... #NeuralNetwork #MachineTranslation #Pocket #NLP #ICLR #Admin'sPick
Issue Date: 2025-05-12 Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau+, ICLR15 Comment(Cross-)Attentionを初めて提案した研究。メモってなかったので今更ながら追加。Attentionはここからはじまった（と認識している） ... #Article #Tutorial #Pretraining #MachineLearning #NLP #LanguageModel #Transformer #Chain-of-Thought #In-ContextLearning #DiffusionModel #SSM (StateSpaceModel)#Scaling Laws #PostTraining
Issue Date: 2025-05-31 2025年度人工知能学会全国大会チュートリアル講演「深層基盤モデルの数理」, Taiji Suzuki, 2025.05 Comment元ポスト:https://x.com/btreetaiji/status/1927678122817921442?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Article #Survey #Blog
Issue Date: 2025-03-18 15 types of attention mechanisms, Kseniase, 2025.03 CommentLuongらのアテンションやsoft, globalアテンションなど、古くからあるattentionも含まれている。 ... #Article #Tutorial #NLP #LanguageModel #Blog
Issue Date: 2024-12-28 MHA vs MQA vs GQA vs MLA, Zain ul Abideen, 2024.07 CommentDeepSeekで使われているMulti Head Latent Attention（MLA）ってなんだ？と思い読んだ。端的に言うと、GQAやMQAは、KVのヘッドをそもそも減らしてKV Cacheを抑えよう、という手法だったが、MLAはKVを低ランクなベクトルに圧縮して保持し、使う時に復元するとい ... #Article #EfficiencyImprovement #NLP #LanguageModel
Issue Date: 2023-12-14 【続】Flash Attentionを使ってLLMの推論を高速・軽量化できるか？ Commentuse_cacheがTrue/Falseの場合のFlashAttention2のinference timeとVRAM使用量の傾向をsequence_lengthごとに考察している。use_cacheはKey Value cacheのオンオフを切り替えられるオプションである。autoregresFl ... #Article #EfficiencyImprovement #MachineLearning #NLP #Transformer
Issue Date: 2023-07-23 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023 SummaryFlashAttention-2は、長いシーケンス長におけるTransformerのスケーリングの問題に対処するために提案された手法です。FlashAttention-2は、非対称なGPUメモリ階層を利用してメモリの節約とランタイムの高速化を実現し、最適化された行列乗算に比べて約2倍の高速化を達成します。また、FlashAttention-2はGPTスタイルのモデルのトレーニングにおいても高速化を実現し、最大225 TFLOPs/sのトレーニング速度に達します。 CommentFlash Attention1よりも2倍高速なFlash Attention 2Flash Attention1はこちらを参照https://arxiv.org/pdf/2205.14135.pdfQK Matrixの計算をブロックに分けてSRAMに送って処理することで、3倍高速化し、メモリ効率を ...