Skip to content

VL-JEPA

Imagine you see the 2 hours lecture in home and you have a question, whom to asked ? 2 hours + text put into GPT ?

Hello Data Points,

This problem sloves through VL JEPA. ( Vision Language - Joint Embedding Prediction)

Let's read these research paper.

To know how


๐™Ž๐™˜๐™š๐™ฃ๐™–๐™ง๐™ž๐™ค: You are watching a live video feed of a person in a kitchen. ๐™”๐™ค๐™ช ๐™–๐™จ๐™  ๐™ฉ๐™๐™š ๐˜ผ๐™„: "What is the person doing right now?"


~~~ ๐—ง๐—ต๐—ฒ ๐—ง๐—ฟ๐—ฎ๐—ฑ๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฉ๐—Ÿ๐—  ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ ~~~

  1. Autoregressive: Generative models treat vision-language tasks as next-token prediction. They think in "words." Step-by-Step Thought Process:

  2. Input Encoding: The model takes the video frame and your text query. It converts them into a long sequence of tokens (visual tokens + text tokens).

  3. The Sequential Bottleneck: To answer, the model must start generating text. It predicts the first word: "The".

  4. Self-Attention Loop: To get the second word, it re-reads the video, the query, and the word "The" it just wrote. It predicts: "person". Repeat until

  5. End: It repeats this loop 10โ€“20 times to finish the sentence: "The person is chopping onions."

The Hidden Struggle: โ€ข During this time, the model is spending a lot of "brain power" (compute) on grammar, punctuation, and styleโ€”things that don't change the actual fact that onions are being chopped. โ€ข The model is bound by Causal Attention. It cannot "jump" to the answer because it must maintain the statistical probability of the entire token sequence.


~~~ ๐—ง๐—ต๐—ฒ ๐—ฉ๐—Ÿ-๐—๐—˜๐—ฃ๐—” ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ (๐—๐—ผ๐—ถ๐—ป๐˜ ๐—˜๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ถ๐—ป๐—ด ๐—ฃ๐—ฟ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐˜๐—ถ๐—ผ๐—ป) ~~~

  1. VL-JEPA treats the task as semantic mapping. It thinks in "concepts." Step-by-Step Thought Process.

  2. Vision Compression (X-Encoder): It uses a frozen V-JEPA 2 backbone to compress the video into a high-level "semantic summary" vector.

  3. Non-Autoregressive Prediction: Instead of starting a sentence, the Predictor (Llama-3 layers with Bi-directional Attention) looks at the video summary and your question simultaneously.

  4. Direct Latent Output: It outputs a single 1,536-dimensional vector (S_Y) that represents the idea of "chopping onions."

Semantic Match: โ€ข Retrieval/Classification: If it's a multiple-choice task, it just finds the closest pre-saved vector for "chopping" in its library. No text generation is needed. โ€ข Generation: If you need a sentence, a tiny Y-Decoder "reads" that single vector once and expands it into: "The person is chopping onions." โ€ข View: VL-JEPA uses Bi-directional Attention in the predictor, allowing visual tokens and query tokens to attend to each other in a single forward pass. There is no sequential dependency.

Comments