Junhao Cai Handwriting
Logo M.S., Shenzhen University (2025-Present)

I am a Master's student in the College of Mechatronics and Control Engineering at Shenzhen University. My research interests include Text-to-3D generation, diffusion models, hypergraph neural networks (HGNN), and LoRA fine-tuning. I focus on developing parameter-efficient fine-tuning strategies and hypergraph-based architectures to advance multimodal learning and high-fidelity 3D content generation.

My academic excellence has been recognized with several distinctions, including being named an 🏆 Outstanding Graduate of Shenzhen University (2025, awarded to 333 out of 5,208 graduates) and receiving the 🏆 Bachelor of Honor from the School of Mechatronics and Control Engineering (2025).


Education
  • Shenzhen University
    Shenzhen University
    College of Mechatronics and Control Engineering
    M.S. Student
    Sep. 2025 - present
  • Shenzhen University
    Shenzhen University
    College of Mechatronics and Control Engineering
    B.S. Student
    Sep. 2021 - Jul. 2025
Honors & Awards
  • 🏆 Outstanding Graduate, Shenzhen University
    2025
  • 🏆 Bachelor of Honor, School of Mechatronics and Control Engineering
    2025
  • 🥈 Second Prize, 15th Blue Bridge Cup Guangdong Regional Competition (Embedded Design & Development, University Group)
    2024
  • 🎖️ Outstanding Communist Youth League Member
    2023
  • 🥇 First Prize, Public Welfare Star (Academic Year 2023-2024)
    2024
  • 🏅 National Encouragement Scholarship (Academic Year 2023-2024)
    2024
  • 🥉 Third Prize, Academic Star (Academic Year 2023-2024)
    2024
  • 🏅 National Encouragement Scholarship (Academic Year 2022-2023)
    2023
  • 🥈 Second Prize, Academic Star (Academic Year 2022-2023)
    2023
  • 🥈 Second Prize, Academic Star (Academic Year 2021-2022)
    2022
Academic Service
  • Reviewer of International Conference on Computer Vision. ICCV 2025
  • Reviewer of Association for the Advancement of Artificial Intelligence. AAAI 2026
  • Reviewer of The IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR 2026
News
2026
🎉 One paper has been accepted to Conference on Computer Vision and Pattern Recognition 2026 Findings!
Feb 23
2025
🎉 One paper has been accepted to CCHI 2026!
Nov 17
🔍 One paper is under review at the Conference on Computer Vision and Pattern Recognition (CVPR) 2026.
Nov 14
🔍 One paper is under review at the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026).
Sep 17
🎉 One paper has been accepted to MICCAI 2025!
May 13
Selected Publications (view all )
Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?
Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

arXiv preprint 2026

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? This paper presents a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, the paper further proposes RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs, and shows that the generated data consistently improves downstream policies in both simulation and real-world environments.

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

arXiv preprint 2026

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? This paper presents a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, the paper further proposes RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs, and shows that the generated data consistently improves downstream policies in both simulation and real-world environments.

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

Junhao Cai*, Deyu Zeng*, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong (* equal contribution)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR Findings) 2026

Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges, where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies, where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer that addresses both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically: improved semantic understanding enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

Junhao Cai*, Deyu Zeng*, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong (* equal contribution)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR Findings) 2026

Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges, where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies, where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer that addresses both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically: improved semantic understanding enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.

Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis
Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Junhao Cai*, Yujie Chang, Dong Zhang, Shaoyi Du, Juan Wang

Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

Multi-modal brain networks represent the complex connectivity between different brain regions from both functional and structural perspectives, which is of great significance for brain disease diagnosis. However, existing methods are limited to information fusion in the feature dimension, failing to fully exploit the complementary information between functional and structural connectivity networks. To address these issues, this paper proposes a cross-modal brain graph transformer (CBGT) method for brain disease diagnosis, which also provides an in-depth analysis of coupled function-structure connectivity networks. Specifically, CBGT consists of two main modules: the cross-modal Transformer module enhances the attention mechanism by utilizing structural connectivity features extracted through machine learning methods, capturing long-range dependencies in the cross-modal brain network. The cross-modal topK pooling module combines information from both functional and structural connectivity networks to select significant regions of interest (ROIs) during the reconstruction of the pooled graph, aiming to retain as much effective information as possible. Experiments conducted on the ABIDE and ADNI datasets demonstrate that the proposed method outperforms state-of-the-art approaches. Interpretation analysis reveals that the proposed method can identify multi-modal biomarkers associated with brain diseases.

Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Junhao Cai*, Yujie Chang, Dong Zhang, Shaoyi Du, Juan Wang

Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

Multi-modal brain networks represent the complex connectivity between different brain regions from both functional and structural perspectives, which is of great significance for brain disease diagnosis. However, existing methods are limited to information fusion in the feature dimension, failing to fully exploit the complementary information between functional and structural connectivity networks. To address these issues, this paper proposes a cross-modal brain graph transformer (CBGT) method for brain disease diagnosis, which also provides an in-depth analysis of coupled function-structure connectivity networks. Specifically, CBGT consists of two main modules: the cross-modal Transformer module enhances the attention mechanism by utilizing structural connectivity features extracted through machine learning methods, capturing long-range dependencies in the cross-modal brain network. The cross-modal topK pooling module combines information from both functional and structural connectivity networks to select significant regions of interest (ROIs) during the reconstruction of the pooled graph, aiming to retain as much effective information as possible. Experiments conducted on the ABIDE and ADNI datasets demonstrate that the proposed method outperforms state-of-the-art approaches. Interpretation analysis reveals that the proposed method can identify multi-modal biomarkers associated with brain diseases.

All publications