Cozy Bar with Jazz Trio
MOTIFVIDEO PREVIEW
Preview Showcase
GPU Days
H200 560 days
Model Scale
1.9B
Resolution Roadmap
480p
720p1080pcoming soon
Generation Time
5s @ 24fps
→ 10s @ 24fps
MotifVideo is an open-source text-to-video generation model designed for rapid experimentation, strong text–video alignment, and scalable deployment under constrained compute budgets. This page presents a December 2025 technical preview, sharing the current architecture and performance trajectory while the model remains in active training. A full open source release including weights and code will follow upon completion of the development phase.
The model is built at a compact scale of 1.9B parameters, enabling cost-effective training and inference while preserving long-range temporal coherence. To date, development has accumulated 560 GPU-days on H200-class hardware, and we intend to train the final model with several times more compute.
At its core, MotifVideo adopts a Hybrid Stream DiT architecture with a modified Decoupled Diffusion Transformer (DDT) head as the diffusion component. This hybrid design separates representation learning—covering text–video alignment, visual content generation, and high-frequency detail modeling—from diffusion and flow-matching dynamics. By decoupling semantic representations from noise dynamics, the model achieves improved training stability, controllability, and visual fidelity even at limited scale.
MotifVideo employs a coarse-to-fine alignment strategy via switching text-encoders, where global semantic intent is established first, followed by progressive refinement of motion, composition, and fine-grained visual details. This prioritization allows the model to maintain robust text alignment over long temporal horizons before allocating capacity to local motion and texture realism.
Training is supported by an in-house Motif-Video-Dataset, collected from publicly available videos and filtered using multiple quality metrics, including aesthetic scores and motion dynamics. We further apply model-based filtering with a learned video-quality scoring system designed to approximate human curation. Combined with an optimized data pipeline—featuring improved bin-packing to reduce padding waste—this approach significantly improves training efficiency and hardware utilization.
The current roadmap begins with 480p video generation, with planned expansion to 720p and 1080p resolutions, future support for synchronized audio generation and playback, and a transition toward a Mixture-of-Experts (MoE) architecture to increase model capacity and specialization.
Cozy Bar with Jazz Trio
Twilight Ocean Horizon
Serene Willow Pond with Ducks
Plump Rabbit in Fantasy Landscape
Knights Charging on Battlefield
Cyberpunk Coastal Beach
Astronaut on Martian Landscape
Aerial View of Spring Park
Golden Retriever in Yellow Turtleneck
Elderly Man at Marketplace
Golden Retriever in Forest
Cozy Restaurant with Fairy Lights
Spaceship Landing on Mars
Tired Man on Subway Train
Young Musician Playing Guitar
Person in Dawn Landscape
Scientists in High-Tech Lab
Person in Dawn Landscape
Post-Apocalyptic Cityscape
Baby Crawling in Sunlit Room
Astronaut on Martian Landscape
Knights Charging on Battlefield
Plaka’s Historic Cobblestone Square
Colossal Robot in Cyberpunk Beijing
Young Musician Playing Guitar
Arctic Fox in Snowy Tundra
Surreal Grassland with Monolith
Arctic Fox in Snowy Tundra
Artisan Shaping Metal
Person Petting Golden Retriever
Underwater Ballroom Dancers
Astronaut Riding Horse in Space
Stylish Woman Walking Tokyo Street
Teenage Girl on Park Bench
Crystal Cavern with Wooden Boat
Vibrant Coral Reef Ecosystem
Woman in Green Sweater at Airport
Futuristic Megacity at Night
Hot Air Balloon Over Valley
Teenage Girl on Park Bench
Determined Girl on Park Bench
Futuristic Megacity at Night
Cliffside Ocean Sunset View
Autumn Countryside Aerial View
Post-Apocalyptic Cityscape
Futuristic Megacity at Night
Tired Man on Subway Train
Person Petting Golden Retriever
Woman in Green Sweater at Airport
Man Savoring Amber Beer at Bar
Award Ceremony Triumph Moment
Mahogany Dining Table Setup
Fog-Covered Forest at Dawn
Teddy Bear Playing Drums in Times Square
Person Standing in Rain at Night
Metropolis Awakening at Dawn