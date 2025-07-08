SINGAPORE, July 8, 2025 /PRNewswire/ — Vidu, the flagship product of ShengShu Technology and a pioneer in generative AI video, today announced a major update to its latest Vidu Q1 model, offering a new advanced ‘Reference-to-Video’ feature, powered by semantic understanding, which in an industry-first can generate videos at scale from up to seven image inputs.

Producing complex, multi-character films with AI is quickly becoming a reality. But until now, maintaining visual consistency across multiple scenes and videos has been one of the field’s most difficult challenges. Characters would shift subtly between shots, objects would change, and continuity would break down entirely from one video to the next.

Vidu Q1 changes that with its Reference-to-Video feature. This breakthrough allows creators to preserve a high level of consistency from the first video to the videos generated thereafter, from character appearance and behavior to background elements and props. The model understands and tracks visual identity across frames, so new additions to a scene won’t disrupt the continuity of what’s already been established.

This profound capability directly underpins how this update to Vidu’s dynamic Q1 generative video model marks a transformative milestone. Scenes that once required tens of millions of dollars and months of production can now be cut down to hundreds of dollars and a matter of minutes. In fact, generating a 5-second 1080p video clip with Vidu Q1’s Reference-to-Video feature can cost as little as $0.14. This is less than the cost of a can of soda for a high-definition video.

Imagine recreating the iconic Battle of Helm’s Deep from The Lord of the Rings, which originally took 120 days and hundreds of actors and extras. Vidu Q1 brings us closer to achieving a similar scale on a shoestring production budget, and all within the comfort of your home. It signals a paradigm shift: AI models are becoming sophisticated and perceptive enough to synthesize cinematic-scale detail with minimal human input and visual cues, redefining the boundaries of what’s possible in filmmaking.

Inferring the Unseen: Generating Objects Without Reference Images

At the heart of the Reference-to-Video feature is Vidu Q1’s enhanced semantic understanding engine. By understanding how reference images relate to text prompts, Vidu Q1 can automatically infer missing visual elements and generate key objects described in the prompt, even if they aren’t explicitly present in the input images.

For example, a creator might upload an image of a man, a bird, and a cityscape, then prompt: “The man plays a violin while the bird lands on his shoulder in the city at sunset.” Even if no violin image is provided, Vidu’s semantic core generates and integrates a violin seamlessly, preserving visual consistency and narrative clarity throughout the clip.

With this breakthrough, creators no longer face steep technical hurdles when attempting to create complex scenes. A simple text prompt is interpreted and realized by the model’s robust semantic layer, removing the need to upload every individual prop or element. This way, users can focus on storytelling and visual impact, letting Vidu handle asset generation and making sure each scene is coherent.

Unlocking Next-Level Consistency Multi-Reference for Up to Seven Images

Vidu Q1’s expanded multi-image reference capability, which supports up to seven reference images per video sequence, is a major leap forward for AI filmmaking. Creators can build visually richer scenes featuring multiple characters, props, or backgrounds, without ever needing them in the same room. This is more than just a feature. It’s a creative unlock that pushes generative video closer to the complexity of traditional film production using only prompts and reference visuals.

“This update breaks through the limits of what creators thought they could do with AI video. We’re getting closer to enabling users to create fully realized scenes, complete with a detailed cast of characters, objects, and backgrounds, by expanding multi-image referencing to support up to seven inputs,” said Luo Yihang, CEO at ShengShu Technology. “Combined with advanced semantic understanding, this marks a major step toward bridging pure imagination with precise execution. This will allow users to craft scenes with deliberate structure and consistency, progressing from isolated clips to fully formed, narratively cohesive videos.”

In addition, the references can be saved to a personal library of images from which they can be repurposed for future projects for generating more detailed scenarios with layered visual elements, while maintaining stable continuity frame to frame. While the Vidu Q1 model supports up to seven images for now, this is just the start. This capability is continuously being optimized to achieve greater stability.

About ShengShu Technology

Founded in March 2023, ShengShu Technology is a world-leading artificial intelligence company, specializing in the development of Multimodal Large Language Models. Driven by innovation, the company delivers cutting-edge MaaS and SaaS products that revolutionize creative production by enabling smarter, faster, and more scalable content creation. With its flagship video generation platform Vidu, ShengShu Technology’s solutions have reached more than 200 countries and regions around the world, spanning fields including interactive entertainment, advertising, film, animation, cultural tourism, and more.

