I build AI systems that see, hear, and reason about the 3D world.
I am a PhD student in the Computer Science Department at Johns Hopkins University, advised by Prof. Alan Yuille. My research focuses on AI systems with 3D spatial reasoning and multimodal understanding — aligning 3D/4D knowledge with language models and fusing audio-visual modalities for both generation and understanding. Ultimately, I strive to develop AI systems that can truly understand and engage with the 4D physical world.
Before JHU, I obtained my M.S. from the University of Southern California, where I worked with Prof. Laurent Itti. Prior to USC, I received my B.S. in Statistics from Renmin University of China, supervised by Prof. Hanfang Yang. I have also conducted research internships with the GenAI group at AMD and at Samsung R&D Institute.
-
Jan 2026XModBench accepted at ICLR 2026. Code release coming soon.
-
Oct 2025Heading to ICCV 2025 to present at the GEN4AVM Workshop.
-
Sep 2025SpatialReasoner accepted at NeurIPS 2025.
-
Jun 2025Invited talk at BEAM Workshop, CVPR 2025. (slides)
-
Feb 2025Spatial457 accepted at CVPR 2025 as a highlight paper.
-
Jan 2025One paper accepted at ICLR 2025.
Full list on Google Scholar
- Built a video generation diffusion model conditioned on audio and image inputs for dynamic motion synthesis.
- Evaluated temporal alignment between given audio and the generated video sequences.
- Proposed a method combining language hints with object template matching for human-guided reinforcement learning, enhancing learning efficiency.
- Developed a solution for the ALFRED benchmark integrating instance segmentation and depth estimation to ground object positions on a bird's-eye-view map and generate language-guided navigation paths.
- Focused on multi-scale feature learning methods for improving semantic segmentation performance.