Following up on an announcement that LLaVA-NeXT had been merged into Huggingface Transformers, someone asked for a “vision-language model which can distinguish the left side from the right side of the frame/picture”.
My response:
There was a preprint paper just recently where they superimpose a grid onto the original image and also add coordinates to the intersections - if I am not mistaken. The model could then figure it out itself! Don’t recall the reference, though.
Update: said paper is this: Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models. It doesn’t use a solid grid!