VL-JEPA¶

Imagine you see the 2 hours lecture in home and you have a question, whom to asked ? 2 hours + text put into GPT ?

Hello Data Points,

This problem sloves through VL JEPA. ( Vision Language - Joint Embedding Prediction)

Let's read these research paper.

To know how

𝙎𝙘𝙚𝙣𝙖𝙧𝙞𝙤: You are watching a live video feed of a person in a kitchen. 𝙔𝙤𝙪 𝙖𝙨𝙠 𝙩𝙝𝙚 𝘼𝙄: "What is the person doing right now?"

~~~ 𝗧𝗵𝗲 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗩𝗟𝗠 𝗣𝗿𝗼𝗰𝗲𝘀𝘀 ~~~

Autoregressive: Generative models treat vision-language tasks as next-token prediction. They think in "words." Step-by-Step Thought Process:
Input Encoding: The model takes the video frame and your text query. It converts them into a long sequence of tokens (visual tokens + text tokens).
The Sequential Bottleneck: To answer, the model must start generating text. It predicts the first word: "The".
Self-Attention Loop: To get the second word, it re-reads the video, the query, and the word "The" it just wrote. It predicts: "person". Repeat until
End: It repeats this loop 10–20 times to finish the sentence: "The person is chopping onions."

The Hidden Struggle: • During this time, the model is spending a lot of "brain power" (compute) on grammar, punctuation, and style—things that don't change the actual fact that onions are being chopped. • The model is bound by Causal Attention. It cannot "jump" to the answer because it must maintain the statistical probability of the entire token sequence.

~~~ 𝗧𝗵𝗲 𝗩𝗟-𝗝𝗘𝗣𝗔 𝗣𝗿𝗼𝗰𝗲𝘀𝘀 (𝗝𝗼𝗶𝗻𝘁 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻) ~~~

VL-JEPA treats the task as semantic mapping. It thinks in "concepts." Step-by-Step Thought Process.
Vision Compression (X-Encoder): It uses a frozen V-JEPA 2 backbone to compress the video into a high-level "semantic summary" vector.
Non-Autoregressive Prediction: Instead of starting a sentence, the Predictor (Llama-3 layers with Bi-directional Attention) looks at the video summary and your question simultaneously.
Direct Latent Output: It outputs a single 1,536-dimensional vector (S_Y) that represents the idea of "chopping onions."

Semantic Match: • Retrieval/Classification: If it's a multiple-choice task, it just finds the closest pre-saved vector for "chopping" in its library. No text generation is needed. • Generation: If you need a sentence, a tiny Y-Decoder "reads" that single vector once and expands it into: "The person is chopping onions." • View: VL-JEPA uses Bi-directional Attention in the predictor, allowing visual tokens and query tokens to attend to each other in a single forward pass. There is no sequential dependency.

VL-JEPA¶

Comments