MINR: Implicit Neural Representations with Masked Image Modelling

Sua Lee     Joonhun Lee     Myungjoo Kang
Seoul National University


Self-supervised learning methods like masked autoencoders (MAE) have shown significant promise in learning robust feature representations, particularly in image reconstruction-based pretraining task. However, their performance is often strongly dependent on the masking strategies used during training and can degrade when applied to out-of-distribution data. To address these limitations, we introduce the masked implicit neural representations (MINR) framework that synergizes implicit neural representations with masked image modeling. MINR learns a continuous function to represent images, enabling more robust and generalizable reconstructions irrespective of masking strategies. Our experiments demonstrate that MINR not only outperforms MAE in in-domain scenarios but also in out-of-distribution settings, while reducing model complexity. The versatility of MINR extends to various self-supervised learning applications, confirming its utility as a robust and efficient alternative to existing frameworks.

Main Idea

Masked autoencoder (MAE) has been often highlighted for its versatility and success in various tasks. However, a notable limitation of MAE is its dependency on masking strategies, such as mask size and area, as evidenced by several studies. The MAE not only utilizes adjacent patches but also employs explicit information from all visible patches to fill each masked patch.
A schematic illustration of MINR
We introduce the masked implicit neural representations (MINR) framework that combines implicit neural representations (INRs) with masked image modelling to address the limitations of MAE. The advantages of MINR include: i) Leveraging INRs to learn a continuous function less affected by variations of information in visible patches, resulting in performance improvements in both in-domain and out-of-distribution settings; ii) Considerably reduced parameters, alleviating the reliance on heavy pretrained model dependencies; and iii) Learning continuous function rather than discrete representations, which provides greater flexibility in creating embeddings for various downstream tasks.


Qualitative results of mask reconstruction