Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

CVPR 2026

Peter Yongho Kim1*, Juhyeon Park1*, Jungwoo Park1,
Jubin Choi1, Jungwoo Seo1, Jiook Cha1, Taesup Moon1†
1Seoul National University
*Equal contribution, Corresponding author

Tokenizing fMRI Volumes via Off-the-shelf Autoencoders

Main Figure

We introduce TABLeT, a model that processes fMRI volumes tokenized with off-the-shelf 2D autoencoders.
Even without medical fine-tuning, the off-the-shelf autoencoder is able to effectively tokenize fMRI volumes.

Abstract

Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows.

To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM.

Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity.

Memory Bottleneck in voxel-based fMRI Modeling

Existing approaches to fMRI analysis fall into two categories, each with significant limitations. ROI-based methods parcellate the brain into regions of interest and compute functional connectivity matrices. While computationally efficient, they may lose fine-grained spatial information and their performance is dependent on the choice of ROI atlas.

Voxel-based methods (e.g., TFF, SwiFT) process raw 4D fMRI volumes directly, preserving spatial and temporal information. However, the massive scale of fMRI data imposes prohibitive memory demands, severely restricting the temporal "context length" that the model can process. This limits their ability to capture long-range temporal dynamics that unfold over tens of seconds.

TABLeT solves this issue by tokenizing fMRI volumes into a compact set of only 27 tokens (or even less!) per frame using an off-the-shelf 2D image autoencoder, enabling a simple Transformer to process extended sequences with limited VRAM by reducing memory consumption by over 7x than prior voxel-based methods.

Tokenization with a 2D Natural Image Autoencoder

TABLeT uses the encoder from a frozen Deep Compression Autoencoder (DCAE), pre-trained on natural images, to tokenize each 3D fMRI volume. Each volume is sliced along three axes (sagittal, coronal, axial), and each 2D slice is independently tokenized by the DCAE encoder. The resulting latent representations from all three axes are then aggregated according to their spatial positions, producing just 27 compact tokens per fMRI frame.

TABLeT architecture overview

These tokens are fed into a lightweight Transformer encoder, enabling long-range spatiotemporal modeling with small memory and computation overhead. We name this method TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer).

TABLeT tokenization process

Main Experimental Results

Performance comparison on classification and regression tasks across three large-scale datasets. The best results are shown in bold and the second best are underlined. TABLeT shows strong against baselines despite using only 27 tokens per frame.

UKB & ADHD-200

Method UKB Sex UKB Age ADHD Diagnosis
ACC AUC F1 MSE MAE ρ ACC AUC F1
XGBoost 84.1 0.916 0.830 0.698 0.686 0.553 62.3 0.650 0.555
BrainNetCNN 91.7 0.969 0.912 0.597 0.618 0.647 59.2 0.640 0.545
BNT 92.4 0.980 0.919 0.540 0.588 0.685 63.6 0.677 0.624
meanMLP 87.7 0.949 0.919 0.672 0.662 0.586 56.8 0.617 0.532
Brain-JEPA 86.8 0.943 0.862 0.688 0.669 0.574
TFF (T=20) 98.3 0.998 0.982 0.440 0.525 0.760 63.3 0.700 0.608
SwiFT (T=20) 97.4 0.998 0.972 0.366 0.480 0.800 63.3 0.693 0.623
SwiFT (T=50) 98.1 0.999 0.980 0.364 0.477 0.802 63.9 0.701 0.627
TABLeT (T=256) 97.7 0.998 0.976 0.340 0.466 0.814 65.8 0.729 0.630

Since the ADHD-200 dataset contains fMRI data with varying repetition time (TR) values and fewer than 160 frames, the default Brain-JEPA model was not applicable.

HCP

Method Sex Age Intelligence
ACC AUC F1 MSE MAE ρ MSE MAE ρ
XGBoost 82.2 0.890 0.837 0.859 0.769 0.296 0.908 0.779 0.292
BrainNetCNN 86.3 0.937 0.866 0.847 0.749 0.372 0.967 0.788 0.286
BNT 86.3 0.935 0.872 0.794 0.719 0.444 0.920 0.778 0.318
meanMLP 84.5 0.915 0.855 0.846 0.751 0.370 0.887 0.767 0.340
Brain-JEPA 73.9 0.809 0.761 0.814 0.746 0.369 0.959 0.799 0.171
TFF (T=20) 88.1 0.937 0.892 0.888 0.779 0.246 0.898 0.767 0.312
SwiFT (T=20) 93.1 0.978 0.937 0.776 0.719 0.450 0.940 0.782 0.297
SwiFT (T=50) 92.2 0.972 0.929 0.764 0.699 0.460 0.865 0.758 0.354
TABLeT (T=256) 93.8 0.987 0.943 0.773 0.705 0.473 0.835 0.741 0.392

Memory and Computational Efficiency

We compare the memory and computational efficiency of TABLeT and SwiFT on a single GPU. SwiFT can only run up to T=50 before running out of memory. At T=50, TABLeT is 7.33x more memory efficient and 3.80x faster in training time. With a similar memory budget (~30GB), the temporal window can be extended nearly tenfold: from T=40 for SwiFT to T=384 for TABLeT.

Peak memory allocation comparison between TABLeT and SwiFT

(a) Peak memory allocation

Training time per epoch comparison between TABLeT and SwiFT

(b) Training time per epoch

Is it Okay to Use Off-the-shelf Autoencoders?

A key finding of our work is that an off-the-shelf 2D DCAE pre-trained on natural images performs comparably, or even better than a 3D DCAE trained directly on fMRI data in terms of both reconstruction quality (information preservation) and downstream task performance.

This suggests that a 2D DCAE, pre-trained on massive, diverse natural image datasets, learns highly robust and general-purpose low-level feature extractors that generalize well to medical images without any domain-specific fine-tuning. That is why we chose to use the off-the-shelf 2D DCAE for tokenization in TABLeT, instead of training a custom 3D DCAE on fMRI data.

Reconstruction comparison between 3D DCAE and 2D DCAE