We introduce TABLeT, a model that processes fMRI volumes tokenized with off-the-shelf 2D autoencoders.
Even without medical fine-tuning, the off-the-shelf autoencoder is able to effectively tokenize fMRI volumes.
Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows.
To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM.
Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity.
Existing approaches to fMRI analysis fall into two categories, each with significant limitations. ROI-based methods parcellate the brain into regions of interest and compute functional connectivity matrices. While computationally efficient, they may lose fine-grained spatial information and their performance is dependent on the choice of ROI atlas.
Voxel-based methods (e.g., TFF, SwiFT) process raw 4D fMRI volumes directly, preserving spatial and temporal information. However, the massive scale of fMRI data imposes prohibitive memory demands, severely restricting the temporal "context length" that the model can process. This limits their ability to capture long-range temporal dynamics that unfold over tens of seconds.
TABLeT solves this issue by tokenizing fMRI volumes into a compact set of only 27 tokens (or even less!) per frame using an off-the-shelf 2D image autoencoder, enabling a simple Transformer to process extended sequences with limited VRAM by reducing memory consumption by over 7x than prior voxel-based methods.
TABLeT uses the encoder from a frozen Deep Compression Autoencoder (DCAE), pre-trained on natural images, to tokenize each 3D fMRI volume. Each volume is sliced along three axes (sagittal, coronal, axial), and each 2D slice is independently tokenized by the DCAE encoder. The resulting latent representations from all three axes are then aggregated according to their spatial positions, producing just 27 compact tokens per fMRI frame.
These tokens are fed into a lightweight Transformer encoder, enabling long-range spatiotemporal modeling with small memory and computation overhead. We name this method TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer).
Performance comparison on classification and regression tasks across three large-scale datasets. The best results are shown in bold and the second best are underlined. TABLeT shows strong against baselines despite using only 27 tokens per frame.
| Method | UKB Sex | UKB Age | ADHD Diagnosis | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | F1 | MSE | MAE | ρ | ACC | AUC | F1 | |
| XGBoost | 84.1 | 0.916 | 0.830 | 0.698 | 0.686 | 0.553 | 62.3 | 0.650 | 0.555 |
| BrainNetCNN | 91.7 | 0.969 | 0.912 | 0.597 | 0.618 | 0.647 | 59.2 | 0.640 | 0.545 |
| BNT | 92.4 | 0.980 | 0.919 | 0.540 | 0.588 | 0.685 | 63.6 | 0.677 | 0.624 |
| meanMLP | 87.7 | 0.949 | 0.919 | 0.672 | 0.662 | 0.586 | 56.8 | 0.617 | 0.532 |
| Brain-JEPA† | 86.8 | 0.943 | 0.862 | 0.688 | 0.669 | 0.574 | – | – | – |
| TFF (T=20) | 98.3 | 0.998 | 0.982 | 0.440 | 0.525 | 0.760 | 63.3 | 0.700 | 0.608 |
| SwiFT (T=20) | 97.4 | 0.998 | 0.972 | 0.366 | 0.480 | 0.800 | 63.3 | 0.693 | 0.623 |
| SwiFT (T=50) | 98.1 | 0.999 | 0.980 | 0.364 | 0.477 | 0.802 | 63.9 | 0.701 | 0.627 |
| TABLeT (T=256) | 97.7 | 0.998 | 0.976 | 0.340 | 0.466 | 0.814 | 65.8 | 0.729 | 0.630 |
†Since the ADHD-200 dataset contains fMRI data with varying repetition time (TR) values and fewer than 160 frames, the default Brain-JEPA model was not applicable.
| Method | Sex | Age | Intelligence | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | F1 | MSE | MAE | ρ | MSE | MAE | ρ | |
| XGBoost | 82.2 | 0.890 | 0.837 | 0.859 | 0.769 | 0.296 | 0.908 | 0.779 | 0.292 |
| BrainNetCNN | 86.3 | 0.937 | 0.866 | 0.847 | 0.749 | 0.372 | 0.967 | 0.788 | 0.286 |
| BNT | 86.3 | 0.935 | 0.872 | 0.794 | 0.719 | 0.444 | 0.920 | 0.778 | 0.318 |
| meanMLP | 84.5 | 0.915 | 0.855 | 0.846 | 0.751 | 0.370 | 0.887 | 0.767 | 0.340 |
| Brain-JEPA | 73.9 | 0.809 | 0.761 | 0.814 | 0.746 | 0.369 | 0.959 | 0.799 | 0.171 |
| TFF (T=20) | 88.1 | 0.937 | 0.892 | 0.888 | 0.779 | 0.246 | 0.898 | 0.767 | 0.312 |
| SwiFT (T=20) | 93.1 | 0.978 | 0.937 | 0.776 | 0.719 | 0.450 | 0.940 | 0.782 | 0.297 |
| SwiFT (T=50) | 92.2 | 0.972 | 0.929 | 0.764 | 0.699 | 0.460 | 0.865 | 0.758 | 0.354 |
| TABLeT (T=256) | 93.8 | 0.987 | 0.943 | 0.773 | 0.705 | 0.473 | 0.835 | 0.741 | 0.392 |
We compare the memory and computational efficiency of TABLeT and SwiFT on a single GPU. SwiFT can only run up to T=50 before running out of memory. At T=50, TABLeT is 7.33x more memory efficient and 3.80x faster in training time. With a similar memory budget (~30GB), the temporal window can be extended nearly tenfold: from T=40 for SwiFT to T=384 for TABLeT.
(a) Peak memory allocation
(b) Training time per epoch
A key finding of our work is that an off-the-shelf 2D DCAE pre-trained on natural images performs comparably, or even better than a 3D DCAE trained directly on fMRI data in terms of both reconstruction quality (information preservation) and downstream task performance.
This suggests that a 2D DCAE, pre-trained on massive, diverse natural image datasets, learns highly robust and general-purpose low-level feature extractors that generalize well to medical images without any domain-specific fine-tuning. That is why we chose to use the off-the-shelf 2D DCAE for tokenization in TABLeT, instead of training a custom 3D DCAE on fMRI data.