Visual GPT

Visual GPT, the multi-modal interface that a German Microsoft manager likely referred to in their earlier hints, has been released as source code: Github. And it’s a bit disappointing… at least when running on CPUs only, rather than GPUs.

Details: Early reports are light on details, so here’s what to expect with the “Advice for CPU users”:

the entire thing will occupy 47 GB on your hard disk
“python visual_chatgpt.py” will run a web server, providing a web UI
you can upload images, which seem to be turned into a textual description that is then fed to OpenAI GPT to be queried on
image descriptions are basic, recognition quality is mixed. Examples:
no trace of OCR

Example dialog with GPT, managed by Visual GPT (output to the console, among other background info):

Human: provide a figure named image/a17d1dd2.png. The description is: a woman holding a baby fox in her arms. This information helps you to understand this image, but you should use tools to finish following tasks, rather than directly imagine from my description. If you understand, say “Received”.

AI: Received.

Human: image/a17d1dd2.png What’s her shirt color?

AI: The woman in the image is wearing a white shirt.

Screenshot: Visual GPT with the above dialog

Better results with LLaVA-13B Heise newsticker: with the fox lady:

? describe the image in great detail
! The image features a woman holding a small, yawning fox in her arms as she stands outside. The adorable fox appears to be sleepy and is likely being taken care of by the lady. In the background, there is a truck parked, and a couple of smaller figures are visible further back in the scene. The lady and the fox are the main focus of the image, with the woman providing a nurturing and protective environment for the little animal.
? what’s the color of her shirt?
! The woman holding the fox is wearing a blue shirt.

Also, it seems to work with cropped objects, which Visual GPT does not. (see screenshot, but also hallucination becomes obvious in this one). Picture of two buses, one cropped

Not useful for slide processing, though. Slide with an architecture diagram, in German language

When giving a photo of opening hours, it reveals:

As a visual AI assistant, I am unable to provide specific times or details about the sign’s content, but I can confirm that the sign contains a list of times arranged in a row, written in German.

(Image credits: