Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

CVPR 2026

Peter Yongho Kim^1*, Juhyeon Park^1*, Jungwoo Park¹,
Jubin Choi¹, Jungwoo Seo¹, Jiook Cha¹, Taesup Moon^1†

¹Seoul National University

^*Equal contribution, ^†Corresponding author

Tokenizing fMRI Volumes via Off-the-shelf Autoencoders

We introduce TABLeT, a model that processes fMRI volumes tokenized with off-the-shelf 2D autoencoders.
Even without medical fine-tuning, the off-the-shelf autoencoder is able to effectively tokenize fMRI volumes.

Abstract

Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows.

To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM.

Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity.

Memory Bottleneck in voxel-based fMRI Modeling

Existing approaches to fMRI analysis fall into two categories, each with significant limitations. ROI-based methods parcellate the brain into regions of interest and compute functional connectivity matrices. While computationally efficient, they may lose fine-grained spatial information and their performance is dependent on the choice of ROI atlas.

Voxel-based methods (e.g., TFF, SwiFT) process raw 4D fMRI volumes directly, preserving spatial and temporal information. However, the massive scale of fMRI data imposes prohibitive memory demands, severely restricting the temporal "context length" that the model can process. This limits their ability to capture long-range temporal dynamics that unfold over tens of seconds.

TABLeT solves this issue by tokenizing fMRI volumes into a compact set of only 27 tokens (or even less!) per frame using an off-the-shelf 2D image autoencoder, enabling a simple Transformer to process extended sequences with limited VRAM by reducing memory consumption by over 7x than prior voxel-based methods.

Tokenization with a 2D Natural Image Autoencoder

TABLeT uses the encoder from a frozen Deep Compression Autoencoder (DCAE), pre-trained on natural images, to tokenize each 3D fMRI volume. Each volume is sliced along three axes (sagittal, coronal, axial), and each 2D slice is independently tokenized by the DCAE encoder. The resulting latent representations from all three axes are then aggregated according to their spatial positions, producing just 27 compact tokens per fMRI frame.

These tokens are fed into a lightweight Transformer encoder, enabling long-range spatiotemporal modeling with small memory and computation overhead. We name this method TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer).

Main Experimental Results

Performance comparison on classification and regression tasks across three large-scale datasets. The best results are shown in bold and the second best are underlined. TABLeT shows strong against baselines despite using only 27 tokens per frame.

UKB & ADHD-200

Method	UKB Sex			UKB Age			ADHD Diagnosis
Method	ACC	AUC	F1	MSE	MAE	ρ	ACC	AUC	F1
XGBoost	84.1	0.916	0.830	0.698	0.686	0.553	62.3	0.650	0.555
BrainNetCNN	91.7	0.969	0.912	0.597	0.618	0.647	59.2	0.640	0.545
BNT	92.4	0.980	0.919	0.540	0.588	0.685	63.6	0.677	0.624
meanMLP	87.7	0.949	0.919	0.672	0.662	0.586	56.8	0.617	0.532
Brain-JEPA^†	86.8	0.943	0.862	0.688	0.669	0.574	–	–	–
TFF (T=20)	98.3	0.998	0.982	0.440	0.525	0.760	63.3	0.700	0.608
SwiFT (T=20)	97.4	0.998	0.972	0.366	0.480	0.800	63.3	0.693	0.623
SwiFT (T=50)	98.1	0.999	0.980	0.364	0.477	0.802	63.9	0.701	0.627
TABLeT (T=256)	97.7	0.998	0.976	0.340	0.466	0.814	65.8	0.729	0.630

^†Since the ADHD-200 dataset contains fMRI data with varying repetition time (TR) values and fewer than 160 frames, the default Brain-JEPA model was not applicable.

HCP

Method	Sex			Age			Intelligence
Method	ACC	AUC	F1	MSE	MAE	ρ	MSE	MAE	ρ
XGBoost	82.2	0.890	0.837	0.859	0.769	0.296	0.908	0.779	0.292
BrainNetCNN	86.3	0.937	0.866	0.847	0.749	0.372	0.967	0.788	0.286
BNT	86.3	0.935	0.872	0.794	0.719	0.444	0.920	0.778	0.318
meanMLP	84.5	0.915	0.855	0.846	0.751	0.370	0.887	0.767	0.340
Brain-JEPA	73.9	0.809	0.761	0.814	0.746	0.369	0.959	0.799	0.171
TFF (T=20)	88.1	0.937	0.892	0.888	0.779	0.246	0.898	0.767	0.312
SwiFT (T=20)	93.1	0.978	0.937	0.776	0.719	0.450	0.940	0.782	0.297
SwiFT (T=50)	92.2	0.972	0.929	0.764	0.699	0.460	0.865	0.758	0.354
TABLeT (T=256)	93.8	0.987	0.943	0.773	0.705	0.473	0.835	0.741	0.392

Memory and Computational Efficiency

We compare the memory and computational efficiency of TABLeT and SwiFT on a single GPU. SwiFT can only run up to T=50 before running out of memory. At T=50, TABLeT is 7.33x more memory efficient and 3.80x faster in training time. With a similar memory budget (~30GB), the temporal window can be extended nearly tenfold: from T=40 for SwiFT to T=384 for TABLeT.

Peak memory allocation comparison between TABLeT and SwiFT

(a) Peak memory allocation

Training time per epoch comparison between TABLeT and SwiFT

(b) Training time per epoch

Is it Okay to Use Off-the-shelf Autoencoders?

A key finding of our work is that an off-the-shelf 2D DCAE pre-trained on natural images performs comparably, or even better than a 3D DCAE trained directly on fMRI data in terms of both reconstruction quality (information preservation) and downstream task performance.

This suggests that a 2D DCAE, pre-trained on massive, diverse natural image datasets, learns highly robust and general-purpose low-level feature extractors that generalize well to medical images without any domain-specific fine-tuning. That is why we chose to use the off-the-shelf 2D DCAE for tokenization in TABLeT, instead of training a custom 3D DCAE on fMRI data.

Reconstruction comparison between 3D DCAE and 2D DCAE