Kairun Wen 「温凯润」

| CV | Email | Github | Youtube |
| Google Scholar | HuggingFace |

I am a second-year master student in the School of Informatics at Xiamen University, advised by Prof. Xinghao Ding in the SmartDSP group. I am also collaborating with Dr. Zhiwen Fan and Prof. Atlas Wang from the VITA group at the University of Texas at Austin.

My current research focuses on 3D perception, Computational Photography and Multimodal AI and covers following topics:

  • 3D Perception: 3D Reconstruction, NeRF, Gaussian Splatting
  • Computational Photography: Low-Light Enhancement, Image Restoration
  • Multimodal AI: Semantic Understanding

Email: kairunwen [AT] stu.xmu.edu.cn


  Recent News
  • [09/2024] Our NeurIPS'24 (LightGaussian) is selected as spotlight presentation!

  Publications    ( * denotes equal contribution )

InstantSplat: Sparse-view SfM-free Gaussian Splatting in Seconds
Kairun Wen*, Zhiwen Fan*, Wenyan Cong*, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, Yue Wang
Preprint 2024

Project | Paper | Abstract | Bibtex | Code | Video | HF Demo

    While neural 3D reconstruction has advanced substantially, it typically requires densely captured multi-view data with carefully initialized poses (e.g., using COLMAP). However, this requirement limits its broader applicability, as Structure-from-Motion (SfM) is often unreliable in sparse-view scenarios where feature matches are limited, resulting in cumulative errors.
    In this paper, we introduce InstantSplat, a novel and lightning-fast neural reconstruction system that builds accurate 3D representations from as few as 2-3 images. InstantSplat adopts a self-supervised framework that bridges the gap between 2D images and 3D representations using Gaussian Bundle Adjustment (GauBA) and can be optimized in an end-to-end manner. InstantSplat integrates dense stereo priors and co-visibility relationships between frames to initialize pixel-aligned geometry by progressively expanding the scene avoiding redundancy. Gaussian Bundle Adjustment is used to adapt both the scene representation and camera parameters quickly by minimizing gradient-based photometric error. Overall, InstantSplat achieves large-scale 3D reconstruction in mere seconds by reducing the required number of input views, and is compatible with multiple 3D representations (3D-GS, Mip-Splatting). It achieves an acceleration of over 20 times in reconstruction, improves visual quality (SSIM) from 0.3755 to 0.7624 than COLMAP with 3D-GS.

  @misc{fan2024instantsplat,
    title={InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds},
    author={Zhiwen Fan and Wenyan Cong and Kairun Wen and Kevin Wang and Jian Zhang and Xinghao Ding and Danfei Xu and Boris Ivanovic and Marco Pavone and Georgios Pavlakos and Zhangyang Wang and Yue Wang},
    year={2024},
    eprint={2403.20309},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
  }

Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors
Yunlong Lin*, Zhenqi Fu*, Kairun Wen, Tian Ye, Sixiang Chen, Ge Meng, Yingying Wang, Yue Huang, Xiaotong Tu, Xinghao Ding
Preprint 2024

Project | Paper | Abstract | Bibtex | Code

    Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on Diffusion Priors and LookUp Tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixelwise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

  @misc{lin2024unsupervisedlowlightimageenhancement,
    title={Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors}, 
    author={Yunlong Lin and Zhenqi Fu and Kairun Wen and Tian Ye and Sixiang Chen and Ge Meng and Yingying Wang and Yue Huang and Xiaotong Tu and Xinghao Ding},
    year={2024},
    eprint={2409.18899},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2409.18899}, 
  }

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS
Zhiwen Fan*, Kevin Wang*, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
NeurIPS 2024 (Spotlight)

Project | Paper | Abstract | Bibtex | Code | Video

    Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the Structure-from-Motion (SfM) points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency.
    To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs knowledge distillation and pseudo-view augmentation to transfer spherical harmonics coefficients to a lower degree, allowing knowledge convert to more compact representations. LightGaussian also proposes a Gaussian Vector Quantization based on Gaussian global significance, to quantize all redundant attributes, resulting in lower bitwidth representations with minimal accuracy losses.
    In summary, LightGaussian achieves an average compression rate exceeding 15x while boosting the FPS from 144 to 237 on the representative 3D-GS framework, thereby supporting an efficient representation of complex scenes on Mip-NeRF 360 and Tank \& Temple datasets. The proposed Gaussian pruning approach can also be adapted to other representations (e.g., Scaffold-GS), demonstrating its generalization capability.

  @misc{fan2023lightgaussian, 
    title={LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS}, 
    author={Zhiwen Fan and Kevin Wang and Kairun Wen and Zehao Zhu and Dejia Xu and Zhangyang Wang}, 
    year={2023},
    eprint={2311.17245},
    archivePrefix={arXiv},
    primaryClass={cs.CV} 
  }

Large Spatial Model: End-to-end Unposed Images to Semantic 3D
Zhiwen Fan*, Jian Zhang*, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang
NeurIPS 2024

Project | Paper | Abstract | Bibtex | Code | HF Demo

    Reconstructing and understanding 3D structures from a limited number of images is a classical problem in computer vision. Traditional approaches typically decompose this task into multiple subtasks, involving several stages of complex mappings between different data representations. For example, dense reconstruction using Structure-from-Motion (SfM) requires transforming images into key points, optimizing camera parameters, and estimating structures. Following this, accurate sparse reconstructions are necessary for further dense modeling, which is then input into task-specific neural networks. This multi-stage paradigm leads to significant processing times and engineering complexity.
    In this work, we introduce the Large Spatial Model (LSM), which directly processes unposed RGB images into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward pass and can synthesize versatile label maps by interacting through language at novel views. Built on a general Transformer-based framework, LSM integrates global geometry via pixel-aligned point maps. To improve spatial attribute regression, we adopt local context aggregation with multi-scale fusion, enhancing the accuracy of fine local details. To address the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder parameterizes a set of semantic anisotropic Gaussians, allowing supervised end-to-end learning. Comprehensive experiments on various tasks demonstrate that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

  @misc{fan2024largespatialmodelendtoend,
    title={Large Spatial Model: End-to-end Unposed Images to Semantic 3D}, 
    author={Zhiwen Fan and Jian Zhang and Wenyan Cong and Peihao Wang and Renjie Li and Kairun Wen and Shijie Zhou and Achuta Kadambi and Zhangyang Wang and Danfei Xu and Boris Ivanovic and Marco Pavone and Yue Wang},
    year={2024},
    eprint={2410.18956},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.18956}, 
  }
  Honors & Awards
National Scholarship (Top 0.2% Nationwide), 2022
National Scholarship (Top 0.2% Nationwide), 2020

  Reviewer Services
International Conference on Learning Representations (ICLR), 2024
Conference on Neural Information Processing Systems (NeurIPS), 2024





Website template from here and here