I build AI systems that see, hear, and reason about the 3D world.
I am a PhD student in the Computer Science Department at Johns Hopkins University, advised by Prof. Alan Yuille. My research focuses on AI systems with 3D spatial reasoning and multimodal understanding, aligning 3D/4D knowledge with language models and fusing audio-visual modalities for both generation and understanding. Ultimately, I strive to develop AI systems that can truly understand and engage with the 4D physical world.
Before JHU, I obtained my M.S. from the University of Southern California, where I worked with Prof. Laurent Itti. Prior to USC, I received my B.S. in Statistics from Renmin University of China, supervised by Prof. Hanfang Yang. I have also conducted research internships with the GenAI group at AMD and at Samsung R&D Institute.
-
Jan 2026XModBench accepted at ICLR 2026. Code release coming soon.
-
Oct 2025Heading to ICCV 2025 🏖️ to present at the GEN4AVM Workshop.
-
Sep 2025SpatialReasoner accepted at NeurIPS 2025.
-
Jun 2025Invited talk at BEAM Workshop, CVPR 2025. (slides)
-
Feb 2025Spatial457 accepted at CVPR 2025 as a highlight paper.
-
Jan 2025One paper accepted at ICLR 2025.
Full list on Google Scholar
- Built a video generation diffusion model conditioned on audio and image inputs for dynamic motion synthesis.
- Evaluated temporal alignment between given audio and the generated video sequences.
- Designed and implemented a human-centric computer vision system that decompose visual reasoning into shape, color, and texture components.
- Knowledge interaction between human and interpretable CV systems using graph neural networks.
- Proposed a method combining language hints with object template matching for human-guided reinforcement learning, enhancing learning efficiency.
- Developed a solution for the ALFRED benchmark integrating instance segmentation and depth estimation to ground object positions on a bird's-eye-view map and generate language-guided navigation paths.
- Focused on multi-scale feature learning methods for improving semantic segmentation performance.