Sidebar

 Toward Lifelike Visual AI: Generating Expressive Faces and Versatile 3D Worlds

Fri Jul 25, 2025, 15.30 – 16.30

We present two advances in generative visual media. First, VASA is a framework for creating lifelike talking faces from a single image and audio clip. Its core model, VASA-1, generates synchronized lip movements, expressive facial dynamics, and natural head motions using a diffusion-based system in a disentangled face latent space. It achieves real-time 512×512 video generation at up to 40 FPS, significantly outperforming prior methods. 

Second, we introduce a novel 3D generation framework based on a unified Structured LATent (SLAT) representation. SLAT combines sparse 3D grids with dense multiview features from a vision foundation model, enabling output in Radiance Fields, 3D Gaussians, or meshes. Trained on 500K diverse assets, the model supports high-quality, flexible generation and local editing from text or images, setting a new standard for 3D content creation. 

  • Home
  • AI Summer School Talk Abstract:  Toward Lifelike Visual AI: Generating Expressive Faces and Versatile 3D Worlds