VL-JEPA¶
Imagine you see the 2 hours lecture in home and you have a question, whom to asked ? 2 hours + text put into GPT ?
Hello Data Points,
This problem sloves through VL JEPA. ( Vision Language - Joint Embedding Prediction)
Let's read these research paper.
To know how
๐๐๐๐ฃ๐๐ง๐๐ค: You are watching a live video feed of a person in a kitchen. ๐๐ค๐ช ๐๐จ๐ ๐ฉ๐๐ ๐ผ๐: "What is the person doing right now?"
~~~ ๐ง๐ต๐ฒ ๐ง๐ฟ๐ฎ๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐๐ฒ ๐ฉ๐๐ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ~~~
-
Autoregressive: Generative models treat vision-language tasks as next-token prediction. They think in "words." Step-by-Step Thought Process:
-
Input Encoding: The model takes the video frame and your text query. It converts them into a long sequence of tokens (visual tokens + text tokens).
-
The Sequential Bottleneck: To answer, the model must start generating text. It predicts the first word: "The".
-
Self-Attention Loop: To get the second word, it re-reads the video, the query, and the word "The" it just wrote. It predicts: "person". Repeat until
-
End: It repeats this loop 10โ20 times to finish the sentence: "The person is chopping onions."
The Hidden Struggle: โข During this time, the model is spending a lot of "brain power" (compute) on grammar, punctuation, and styleโthings that don't change the actual fact that onions are being chopped. โข The model is bound by Causal Attention. It cannot "jump" to the answer because it must maintain the statistical probability of the entire token sequence.
~~~ ๐ง๐ต๐ฒ ๐ฉ๐-๐๐๐ฃ๐ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐ (๐๐ผ๐ถ๐ป๐ ๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด ๐ฃ๐ฟ๐ฒ๐ฑ๐ถ๐ฐ๐๐ถ๐ผ๐ป) ~~~
-
VL-JEPA treats the task as semantic mapping. It thinks in "concepts." Step-by-Step Thought Process.
-
Vision Compression (X-Encoder): It uses a frozen V-JEPA 2 backbone to compress the video into a high-level "semantic summary" vector.
-
Non-Autoregressive Prediction: Instead of starting a sentence, the Predictor (Llama-3 layers with Bi-directional Attention) looks at the video summary and your question simultaneously.
-
Direct Latent Output: It outputs a single 1,536-dimensional vector (S_Y) that represents the idea of "chopping onions."
Semantic Match: โข Retrieval/Classification: If it's a multiple-choice task, it just finds the closest pre-saved vector for "chopping" in its library. No text generation is needed. โข Generation: If you need a sentence, a tiny Y-Decoder "reads" that single vector once and expands it into: "The person is chopping onions." โข View: VL-JEPA uses Bi-directional Attention in the predictor, allowing visual tokens and query tokens to attend to each other in a single forward pass. There is no sequential dependency.