SEG TOKEN Usage #49

Ruining0916 · 2025-03-05T20:34:13Z

Hi Authors,

Thanks for your excellent work! I am a little confused about the SEG token design from the script llava_sam.py

1.SEG token invalid, I wonder when seg token is invalid, why you need to add the number 5?
2. If I understand it correctly, you put 5 sampled video frames input prompt as tokens, it supposed to generate 1 seg token for each data entry. However, I observed that you used here, which extract the last 5 indices of hidden states, why not 1? Additionally, for batch_size =2, and if frame_per_batch = [5, 5], then the seg_token_counts is [5, 5] instead of [1,1] from the current model. As the self.seg_token_idx is a single integer, are these five seg_tokens the same?

I observed a lot of cases that although you have frame_per_batch = [5, 5] the seg_token_counts is [3, 0], I wonder how to explain the case that only 3 seg tokens are generated instead of 5. How do you deal with the alignment issue?

Thanks a lot for your clarification!

Thanks,
Ruining

HarborYuan · 2025-03-13T03:43:08Z

Hi @Ruining0916 ,

Thanks for your interest in our work.

The 5 in the first question does not actually mean 5 frames. The 5 here means that there are 5 instances in a set of image/video data. That is to say, 5 [SEG] tokens will generate 5 instance masks. The code you mentioned is mainly to support the empty execution for supporting zero3 during the training process to keep different GPUs executing the same code.

During training, the input texts look like this:

<image>
<user>
Please segment obj1.
<assistant>
It is [SEG].
<user>
Please segment obj2.
<assistant>
It is [SEG].
<user>
Please segment obj3.
<assistant>
It is [SEG].
...

Ruining0916 · 2025-03-15T00:46:46Z

Thanks for your clarification! I wonder if the is guaranteed to generate from VLM output, which is saying that will each instance of image/video has a non-empty ?

Thanks,
Ruining

HarborYuan · 2025-03-15T02:52:58Z

Hi @Ruining0916

Can you explain further about your question? I do not understand your question.

Ruining0916 · 2025-03-20T18:49:22Z

Oh I just realized there are some typos in my previous question. I’m wondering whether the token is always generated by the VLM module—specifically, if every image or video instance is guaranteed to have a non-empty token. I am asking as I observed that it is possible that seg_token_count is 0 here

HarborYuan · 2025-03-20T20:53:55Z

Ah, I see your question. In some cases, the VLM will not generate the [SEG] token. In this case, we will consider that there is no such object and do not generate a mask.

Ruining0916 · 2025-03-20T20:59:30Z

Thanks for your clarification! Is this issue caused by the absence of an object in this frame according to the ground truth, or is it due to the VLM's insufficient capability?

HarborYuan · 2025-03-20T21:12:38Z

I think both may happen. In some cases, the VLM will output [SEG] even if there is not this object in the frame. In some other case, VLM may not generate [SEG] even the object is in the frame.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEG TOKEN Usage #49

SEG TOKEN Usage #49

Ruining0916 commented Mar 5, 2025

HarborYuan commented Mar 13, 2025

Ruining0916 commented Mar 15, 2025

HarborYuan commented Mar 15, 2025

Ruining0916 commented Mar 20, 2025

HarborYuan commented Mar 20, 2025

Ruining0916 commented Mar 20, 2025

HarborYuan commented Mar 20, 2025

SEG TOKEN Usage #49

SEG TOKEN Usage #49

Comments

Ruining0916 commented Mar 5, 2025

HarborYuan commented Mar 13, 2025

Ruining0916 commented Mar 15, 2025

HarborYuan commented Mar 15, 2025

Ruining0916 commented Mar 20, 2025

HarborYuan commented Mar 20, 2025

Ruining0916 commented Mar 20, 2025

HarborYuan commented Mar 20, 2025