Vision model spatiality

Following up on an announcement that LLaVA-NeXT had been merged into Huggingface Transformers, someone asked for a “vision-language model which can distinguish the left side from the right side of the frame/picture”.

My response:

There was a preprint paper just recently where they superimpose a grid onto the original image and also add coordinates to the intersections - if I am not mistaken. The model could then figure it out itself! Don’t recall the reference, though.

Update: said paper is this: Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models. It doesn’t use a solid grid!