学术主页

蔡俊豪 Handwriting

聚焦文生 3D、多模态学习与高效微调的生成式 AI 研究。

Logo 深圳大学硕士研究生 (2025-至今)
研究方向
文生 3D 生成 扩散模型 多模态学习 超图神经网络 LoRA 微调

我目前是深圳大学机电与控制工程学院硕士研究生。 我的研究方向包括 文生 3D 生成扩散模型超图神经网络(HGNN)LoRA 微调。 我主要关注参数高效微调策略与基于超图的网络架构,以推进多模态学习与高保真 3D 内容生成。

在学业方面,我曾获得多项荣誉,包括 🏆 深圳大学优秀毕业生 (2025 年,5208 名毕业生中仅 333 人获评)以及 🏆 机电与控制工程学院荣誉学士(2025 年)。

教育经历
  • 深圳大学
    深圳大学
    机电与控制工程学院
    硕士研究生
    2025年9月 - 至今
  • 深圳大学
    深圳大学
    机电与控制工程学院
    学士
    2021年9月 - 2025年7月
学术服务
  • 国际计算机视觉大会 审稿人 ICCV 2025
  • 欧洲计算机视觉会议 审稿人 ECCV 2026
  • AAAI 人工智能大会 审稿人 AAAI 2026
  • IEEE/CVF 计算机视觉与模式识别会议 审稿人 CVPR 2026
荣誉奖项
  • 🏆 研究生学业奖学金特等奖
    2026
  • 🏆 深圳大学优秀毕业生
    2025
  • 🏆 机电与控制工程学院荣誉学士
    2025
  • 🥈 第十五届蓝桥杯广东赛区二等奖(嵌入式设计与开发大学组)
    2024
  • 🎖️ 优秀共青团员
    2023
  • 🥇 公益之星一等奖(2023-2024学年)
    2024
  • 🏅 国家励志奖学金(2023-2024学年)
    2024
  • 🥉 学习之星三等奖(2023-2024学年)
    2024
  • 🏅 国家励志奖学金(2022-2023学年)
    2023
  • 🥈 学习之星二等奖(2022-2023学年)
    2023
  • 🥈 学习之星二等奖(2021-2022学年)
    2022
研究动态
2026
🎉 一篇论文被国际人工智能联合会议(IJCAI 2026)录用!
04-30
🎉 一篇论文被 IEEE/CVF 计算机视觉与模式识别会议(CVPR 2026)Findings 录用!
02-23
2025
🎉 一篇论文被医学图像计算与计算机辅助干预国际会议(MICCAI 2025)录用!
05-13
代表性论文 查看全部
Cross-Modal Dynamic Hypergraph Computation via Functional-Structural Brain Network for Brain Disease Diagnosis
Cross-Modal Dynamic Hypergraph Computation via Functional-Structural Brain Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Rundong Xue, Junhao Cai, Xudong Chen, Dong Zhang, Shaoyi Du

International Joint Conference on Artificial Intelligence (IJCAI) 2026

Cross-modal brain networks offer complementary functional and structural views for understanding inter-regional connectivity and diagnosing brain diseases, yet existing methods often underuse their joint topology and high-order associations. This paper proposes Cross-Modal Dynamic Hypergraph Computation (CDHGC), a framework that learns topology-guided high-order correlations in functional-structural brain networks. CDHGC first generates and dynamically optimizes topology-aware hypergraphs to reveal latent cross-modal relationships, then performs cross-modal hypergraph convolution with attention-based message passing to produce joint representations. Experiments on ADNI and ABIDE show that CDHGC surpasses state-of-the-art methods, while interpretability analysis identifies multimodal biomarkers related to brain disorders.

Cross-Modal Dynamic Hypergraph Computation via Functional-Structural Brain Network for Brain Disease Diagnosis
Cross-Modal Dynamic Hypergraph Computation via Functional-Structural Brain Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Rundong Xue, Junhao Cai, Xudong Chen, Dong Zhang, Shaoyi Du

International Joint Conference on Artificial Intelligence (IJCAI) 2026

Cross-modal brain networks offer complementary functional and structural views for understanding inter-regional connectivity and diagnosing brain diseases, yet existing methods often underuse their joint topology and high-order associations. This paper proposes Cross-Modal Dynamic Hypergraph Computation (CDHGC), a framework that learns topology-guided high-order correlations in functional-structural brain networks. CDHGC first generates and dynamically optimizes topology-aware hypergraphs to reveal latent cross-modal relationships, then performs cross-modal hypergraph convolution with attention-based message passing to produce joint representations. Experiments on ADNI and ABIDE show that CDHGC surpasses state-of-the-art methods, while interpretability analysis identifies multimodal biomarkers related to brain disorders.

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?
Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

arXiv preprint 2026

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? This paper presents a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, the paper further proposes RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs, and shows that the generated data consistently improves downstream policies in both simulation and real-world environments.

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?
Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

arXiv preprint 2026

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? This paper presents a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, the paper further proposes RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs, and shows that the generated data consistently improves downstream policies in both simulation and real-world environments.

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

Junhao Cai*, Deyu Zeng*, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong (* equal contribution)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR Findings) 2026

Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges, where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies, where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer that addresses both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically: improved semantic understanding enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.

ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

Junhao Cai*, Deyu Zeng*, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong (* equal contribution)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR Findings) 2026

Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges, where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies, where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer that addresses both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically: improved semantic understanding enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.

Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis
Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Junhao Cai, Yujie Chang, Dong Zhang, Shaoyi Du, Juan Wang

Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

Multi-modal brain networks represent the complex connectivity between different brain regions from both functional and structural perspectives, which is of great significance for brain disease diagnosis. However, existing methods are limited to information fusion in the feature dimension, failing to fully exploit the complementary information between functional and structural connectivity networks. To address these issues, this paper proposes a cross-modal brain graph transformer (CBGT) method for brain disease diagnosis, which also provides an in-depth analysis of coupled function-structure connectivity networks. Specifically, CBGT consists of two main modules: the cross-modal Transformer module enhances the attention mechanism by utilizing structural connectivity features extracted through machine learning methods, capturing long-range dependencies in the cross-modal brain network. The cross-modal topK pooling module combines information from both functional and structural connectivity networks to select significant regions of interest (ROIs) during the reconstruction of the pooled graph, aiming to retain as much effective information as possible. Experiments conducted on the ABIDE and ADNI datasets demonstrate that the proposed method outperforms state-of-the-art approaches. Interpretation analysis reveals that the proposed method can identify multi-modal biomarkers associated with brain diseases.

Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis
Cross-Modal Brain Graph Transformer via Function-Structure Connectivity Network for Brain Disease Diagnosis

Jingxi Feng, Heming Xu, Junhao Cai, Yujie Chang, Dong Zhang, Shaoyi Du, Juan Wang

Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

Multi-modal brain networks represent the complex connectivity between different brain regions from both functional and structural perspectives, which is of great significance for brain disease diagnosis. However, existing methods are limited to information fusion in the feature dimension, failing to fully exploit the complementary information between functional and structural connectivity networks. To address these issues, this paper proposes a cross-modal brain graph transformer (CBGT) method for brain disease diagnosis, which also provides an in-depth analysis of coupled function-structure connectivity networks. Specifically, CBGT consists of two main modules: the cross-modal Transformer module enhances the attention mechanism by utilizing structural connectivity features extracted through machine learning methods, capturing long-range dependencies in the cross-modal brain network. The cross-modal topK pooling module combines information from both functional and structural connectivity networks to select significant regions of interest (ROIs) during the reconstruction of the pooled graph, aiming to retain as much effective information as possible. Experiments conducted on the ABIDE and ADNI datasets demonstrate that the proposed method outperforms state-of-the-art approaches. Interpretation analysis reveals that the proposed method can identify multi-modal biomarkers associated with brain diseases.