LLMServing

#EfficiencyImprovement#Pocket#NLP#LanguageModel#Transformer#Attention#Architecture#MoE(Mixture-of-Experts)#SoftwareEngineering
Issue Date: 2025-05-20 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures, Chenggang Zhao+, arXiv25 Comment元ポスト:https://x.com/deedydas/status/1924512147947848039?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q ... #Article#NLP#LanguageModel#Blog#Repository
Issue Date: 2025-06-22 Nano-vLLM, GeeeekExplorer, 2025.06 Comment元ポスト:https://x.com/marktechpost/status/1936689592507543643?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-QvLLMと同等のinference speedを実現するミニマムでクリーンな実装。勉強用に良さそう。 ... #Article#NLP
Issue Date: 2025-06-20 Mirage Persistent Kernel: Compiling LLMs into a MegaKernel, 2025.06 CommentvLLM, SGLangよりもデコーディングが早い模様(図は下記ブログより引用)![image](https://github.com/user-attachments/assets/0a2bf0e5-0d3f-4dd0-a912-6ce05ead2cad)ブログ:https://zhihao元ポス ...

#Article#LanguageModel
Issue Date: 2025-02-12 SGlang, sgl-project, 2024.01 CommentSGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more control ... #Article#NLP#LanguageModel#Library#Repository
Issue Date: 2024-08-31 NanoFlow, 2024.08 CommentvLLMよりも2倍程度高速なLLM serving framework。オフライン評価![image](https://github.com/user-attachments/assets/93d8362d-e0e4-4bdb-9de4-178e1eef2e33)オンラインでのlatenc元ポスト: ... #Article#EfficiencyImprovement#Library#Blog#OpenWeight
Issue Date: 2024-08-05 DeepSpeed, vLLM, CTranslate2 で rinna 3.6b の生成速度を比較する, 2024.06 Comment[vllm](https://github.com/vllm-project/vllm)を使うのが一番お手軽で、inference速度が速そう。PagedAttentionと呼ばれるキャッシュを利用して高速化しているっぽい。 (図はブログ中より引用) ![image](https://gitこちら ...