Audio-JEPA is an adaptation of the Joint-Embedding Predictive Architecture (JEPA) for self-supervised audio representation learning. Built upon the I-JEPA paradigm, it uses a Vision Transformer (ViT) backbone to predict latent representations of masked spectrogram patches.