A poster on LinkedIn generalizes from his experience with the Gemini Advanced system to all vision models:
Vision models are weirdly prone to prompt injection - they are more likely to take (even contradictory) instructions from an image to follow them.
I couldn’t reproduce this:
Not reproducible with the Gemini Pro Vision model via Google Vertex AI. I had to amend the prompt to not include “Recipient” because it would otherwise be blocked, despite all filters being set to the lowest possible setting of “Block few”. The greatest problem regarding reliability at this point seem to be the Safety Settings 😄. What I did notice with GPT-4V: margins matter. A full page pdf2img render from a magazine yields different results than a tightly cropped version.