Kairun Wen 「温凯润」

| CV | Email | Github |
| Google Scholar | HuggingFace |
| LinkedIn | RedNote | Youtube |

I am a second-year master student in the School of Informatics at Xiamen University, advised by Prof. Xinghao Ding in the SmartDSP group. I am also collaborating with Dr. Zhiwen Fan and Prof. Atlas Wang from the VITA group at the University of Texas at Austin.

        My current research primarily covers the following topics:

  • 3D Reconstruction: Few-Shot and Ultra-Efficient 3D Learning
  • 3D Perception: Semantic Understanding
  • Computational Photography: Image Restoration, Low-Light Enhancement
  • AI Agents (ongoing): Planning and Decision-Making, Reinforcement learning

I am currently looking for 26 fall PhD positions and actively seeking collaborations! If you are interested in working together or have potential PhD opportunities, please feel free to reach out to me.
        WeChat: kairun_wen    Email: wenkairun@gmail.com


  Recent News
  • [09/2024] Our NeurIPS'24 (LightGaussian) is selected as spotlight presentation!

  Publications    ( * denotes equal contribution )

InstantSplat: Sparse-view SfM-free Gaussian Splatting in Seconds
Kairun Wen*, Zhiwen Fan*, Wenyan Cong*, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, Yue Wang
Preprint

Project | Paper | Abstract | Bibtex | Video | HF Demo | Code [1300+⭐]

    While neural 3D reconstruction has advanced substantially, it typically requires densely captured multi-view data with carefully initialized poses (e.g., using COLMAP). However, this requirement limits its broader applicability, as Structure-from-Motion (SfM) is often unreliable in sparse-view scenarios where feature matches are limited, resulting in cumulative errors.
    In this paper, we introduce InstantSplat, a novel and lightning-fast neural reconstruction system that builds accurate 3D representations from as few as 2-3 images. InstantSplat adopts a self-supervised framework that bridges the gap between 2D images and 3D representations using Gaussian Bundle Adjustment (GauBA) and can be optimized in an end-to-end manner. InstantSplat integrates dense stereo priors and co-visibility relationships between frames to initialize pixel-aligned geometry by progressively expanding the scene avoiding redundancy. Gaussian Bundle Adjustment is used to adapt both the scene representation and camera parameters quickly by minimizing gradient-based photometric error. Overall, InstantSplat achieves large-scale 3D reconstruction in mere seconds by reducing the required number of input views, and is compatible with multiple 3D representations (3D-GS, Mip-Splatting). It achieves an acceleration of over 20 times in reconstruction, improves visual quality (SSIM) from 0.3755 to 0.7624 than COLMAP with 3D-GS.

  @misc{fan2024instantsplat,
    title={InstantSplat: Sparse-view Gaussian Splatting in Seconds},
    author={Zhiwen Fan and Kairun Wen and Wenyan Cong and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
    year={2024},
    eprint={2403.20309},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
  }

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin*, Zixu Lin*, Haoyu Chen*, Panwang Pan*, Chenxin Li, Sixiang Chen, Kairun Wen, Yeying Jin, Wenbo Li, Xinghao Ding
CVPR 2025

Project | Paper | Abstract | Bibtex | Code

    Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real.

  @inproceedings{jarvisir2025,
    title={JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration},
    author={Lin, Yunlong and Lin, Zixu and Chen, Haoyu and Pan, Panwang and Li, Chenxin and Chen, Sixiang and Kairun, Wen and Jin, Yeying and Li, Wenbo and Ding, Xinghao},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
  }

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS
Zhiwen Fan*, Kevin Wang*, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
NeurIPS 2024 (Spotlight)

Project | Paper | Abstract | Bibtex | Video | Code [600+⭐]

    Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the Structure-from-Motion (SfM) points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency.
    To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs knowledge distillation and pseudo-view augmentation to transfer spherical harmonics coefficients to a lower degree, allowing knowledge convert to more compact representations. LightGaussian also proposes a Gaussian Vector Quantization based on Gaussian global significance, to quantize all redundant attributes, resulting in lower bitwidth representations with minimal accuracy losses.
    In summary, LightGaussian achieves an average compression rate exceeding 15x while boosting the FPS from 144 to 237 on the representative 3D-GS framework, thereby supporting an efficient representation of complex scenes on Mip-NeRF 360 and Tank \& Temple datasets. The proposed Gaussian pruning approach can also be adapted to other representations (e.g., Scaffold-GS), demonstrating its generalization capability.

  @misc{fan2023lightgaussian, 
    title={LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS}, 
    author={Zhiwen Fan and Kevin Wang and Kairun Wen and Zehao Zhu and Dejia Xu and Zhangyang Wang}, 
    year={2023},
    eprint={2311.17245},
    archivePrefix={arXiv},
    primaryClass={cs.CV} 
  }

Large Spatial Model: End-to-end Unposed Images to Semantic 3D
Zhiwen Fan*, Jian Zhang*, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang
NeurIPS 2024

Project | Paper | Abstract | Bibtex | Code | HF Demo

    Reconstructing and understanding 3D structures from a limited number of images is a classical problem in computer vision. Traditional approaches typically decompose this task into multiple subtasks, involving several stages of complex mappings between different data representations. For example, dense reconstruction using Structure-from-Motion (SfM) requires transforming images into key points, optimizing camera parameters, and estimating structures. Following this, accurate sparse reconstructions are necessary for further dense modeling, which is then input into task-specific neural networks. This multi-stage paradigm leads to significant processing times and engineering complexity.
    In this work, we introduce the Large Spatial Model (LSM), which directly processes unposed RGB images into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward pass and can synthesize versatile label maps by interacting through language at novel views. Built on a general Transformer-based framework, LSM integrates global geometry via pixel-aligned point maps. To improve spatial attribute regression, we adopt local context aggregation with multi-scale fusion, enhancing the accuracy of fine local details. To address the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder parameterizes a set of semantic anisotropic Gaussians, allowing supervised end-to-end learning. Comprehensive experiments on various tasks demonstrate that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

  @misc{fan2024largespatialmodelendtoend,
    title={Large Spatial Model: End-to-end Unposed Images to Semantic 3D}, 
    author={Zhiwen Fan and Jian Zhang and Wenyan Cong and Peihao Wang and Renjie Li and Kairun Wen and Shijie Zhou and Achuta Kadambi and Zhangyang Wang and Danfei Xu and Boris Ivanovic and Marco Pavone and Yue Wang},
    year={2024},
    eprint={2410.18956},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.18956}, 
  }

Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors
Yunlong Lin*, Zhenqi Fu*, Kairun Wen, Tian Ye, Sixiang Chen, Ge Meng, Yingying Wang, Yue Huang, Xiaotong Tu, Xinghao Ding
AAAI 2024

Project | Paper | Abstract | Bibtex | Code

    Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on Diffusion Priors and LookUp Tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixelwise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

  @misc{lin2024unsupervisedlowlightimageenhancement,
    title={Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors}, 
    author={Yunlong Lin and Zhenqi Fu and Kairun Wen and Tian Ye and Sixiang Chen and Ge Meng and Yingying Wang and Yue Huang and Xiaotong Tu and Xinghao Ding},
    year={2024},
    eprint={2409.18899},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2409.18899}, 
  }
  Honors & Awards
National Scholarship (Top 0.2% Nationwide), 2022
National Scholarship (Top 0.2% Nationwide), 2020

  Reviewer Services
International Conference on Machine Learning (ICML), 2025
International Conference on Learning Representations (ICLR), 2025
Conference on Neural Information Processing Systems (NeurIPS), 2025, 2024





Website template from here and here