Alibaba Cloud Backs ShengShu's Ambitious Multimodal AI for Real-World Applications

April 10, 2026
Alibaba Cloud Backs ShengShu's Ambitious Multimodal AI for Real-World Applications
  • Alibaba Cloud leads ShengShu Technology’s funding round to accelerate the development of multimodal AI models that process video, audio, and text, signaling a broader shift toward non-language-focused AI with real-world action.

  • ShengShu aims to build a general-purpose system that understands and acts in physical environments by bridging digital and real-world contexts through its Vidu video platform and related technologies.

  • Industry observers note that multimodal world models could become the next major phase of AI, addressing the real-world grounding gaps seen in traditional LLMs and expanding into robotics and embodied AI.

  • ShengShu did not provide a timeline for when commercially viable world-model systems might be ready.

  • HappyHorse 1.0 has debuted with strong performance in video-generation rankings, and more ATH products are anticipated.

  • The broader context shows LLMs excel at text but struggle with real-time physical-world understanding, prompting investment in multimodal world models.

  • ShengShu reported significant growth in users and revenue in 2025, though exact figures were not disclosed.

  • Industry watchers expect world models to define AI’s next stage if they can successfully integrate multiple modalities with real-world grounding.

  • ShengShu’s Vidu Q3 Pro ranks competitively in text-to-video and image-to-video benchmarks, with audio-enabled versions improving rankings.

  • Vidu targets independent creators and animators, promising effortless content production across styles while aiming to build a general world model using multimodal data.

  • ShengShu’s core concept is a multimodal world model trained on vision, audio, and touch to simulate real-world physics and causal reasoning, potentially outperforming text-only LLMs.

  • The company emphasizes that combining vision, audio, and touch data enables perception-action understanding that connects perception to real-world interaction.

Summary based on 11 sources


Get a daily email with more AI stories

More Stories