VL-JEPA: Why Predicting Embeddings Beats Generating Tokens for Vision-Language AI
VL-JEPA achieves 50% parameter reduction and 2.85x faster decoding by predicting embeddings instead of generating tokens, offering a compelling alternative to autoregressive vision-language models.