TextToVideoGenerationに関する論文・技術記事メモの一覧

TextToVideoGeneration

Paper/Blog Link My Issue
#ComputerVision #Pocket #NLP #Dataset #Evaluation #FoundationModel #TextToImageGeneration #2D (Image) #3D (Scene) #WorldModels #KeyPoint Notes Issue Date: 2025-12-19 GPT Summary- MMGR（Multi-Modal Generative Reasoning Evaluation and Benchmark）を導入し、物理的、論理的、空間的、時間的な推論能力に基づくビデオ基盤モデルの評価フレームワークを提案。既存の指標では見落とされる因果関係や物理法則の違反を考慮し、主要なビデオおよび画像モデルをベンチマークした結果、抽象的推論でのパフォーマンスが低いことが明らかに。MMGRは、生成的世界モデルの推論能力向上に向けた統一診断ベンチマークを提供。 Comment

pj page: https://zefan-cai.github.io/MMGR.github.io/

元ポスト:

Loading…

video/image 生成モデルを（単なる動画生成という枠ではなく世界モデルという観点で評価するために）
- physical reasoning: ロボットのシミュレーションやinteractionに必要な物理世界の理解力
- logical (abstract) reasoning: System2 Thinkingい必要な抽象的なコンテプトやルールに従う能力（Aが起きたらBが続く）
- 3D spatial reasoning: 世界の認知mapを内包するために必要な3D空間における関係性や、環境の案内、物事の構造や全体像を把握する能力
- 2D spatial reasoning: 複雑なpromptをgroundingするために必要な2D空間に写像されたレイアウト、形状、相対位置を理解する能力
- Temporal Reasoning: coherenceを保つために必要な、因果関係、イベントの順序、長期的な依存関係を捉える能力
の5つの軸で評価するフレームワーク。

[Paper Note] Paper2Video: Automatic Video Generation from Scientific Papers, Zeyu Zhu+, arXiv'25, 2025.10

Paper/Blog Link My Issue
#ComputerVision #Pocket #NLP #SpeechProcessing #VideoGeneration/Understandings #VisionLanguageModel #Science #TTS #4D (Video) Issue Date: 2025-11-29 GPT Summary- Paper2Videoは、研究論文から学術プレゼンテーション動画を自動生成するための新しいベンチマークとフレームワークを提案。101の研究論文に基づくデータセットを用い、動画生成のための評価指標を設計。PaperTalkerは、スライド生成や字幕、音声合成を統合し、効率的な生成を実現。実験により、提案手法が既存の方法よりも情報量が多く、忠実な動画を生成することを示した。データセットやコードは公開されている。 Comment

pj page: https://showlab.github.io/Paper2Video/

元ポスト:

Loading…

[Paper Note] LongCat-Video Technical Report, Meituan LongCat Team+, arXiv'25, 2025.10

Paper/Blog Link My Issue
#ComputerVision #Pocket #DiffusionModel #OpenWeight #VideoGeneration/Understandings #WorldModels #4D (Video) #SparseAttention #Video Continuation #ImageToVideoGeneration Issue Date: 2025-11-02 GPT Summary- 「LongCat-Video」は、13.6Bパラメータを持つ動画生成モデルで、複数の動画生成タスクにおいて高いパフォーマンスを発揮します。Diffusion Transformerフレームワークに基づき、テキストや画像から動画を生成し、長時間動画の生成においても高品質と一貫性を維持します。効率的な推論を実現するために、粗から細への生成戦略とブロックスパースアテンションを採用し、720p、30fpsの動画を数分で生成可能です。マルチリワードRLHFによるトレーニングにより、最新のモデルと同等の性能を達成し、コードとモデルの重みは公開されています。 Comment

pj page: https://github.com/meituan-longcat/LongCat-Video

元ポスト:

Loading…

[Paper Note] Sketching the Future （STF）: Applying Conditional Control Techniques to Text-to-Video Models, Rohan Dhesikan+, arXiv'23, 2023.05

Paper/Blog Link My Issue
#NeuralNetwork #ComputerVision #Controllable #Pocket #NLP #VideoGeneration/Understandings Issue Date: 2023-05-12 GPT Summary- ゼロショットテキストからビデオ生成のためにControlNetを組み合わせ、フレームの流れに一致する高品質で一貫したビデオを生成。スケッチ入力を補間し、Text-to-Video Zeroを実行。実験結果は、ユーザーの意図に対する高い適合性を示し、デモやオープンソースリソースも提供。