OpenAI have updated the API cookbook to walk through the basics of multimodality and using GPT-4o via the API. However, the code sample there proves that GPT-4o does not (yet?) process video natively and instead relies on images extracted. The code sample does this at a fixed sampling rate of 0.5 Hz (so every two seconds). My question if there are established ways to improve this (e.g. by relying on key frames in the MPEG stream) remained unanswered, perhaps highlighting that:
- for practical purposes, one should rely on the (alleged) native video processign capabilites of Google Gemini
- more research in that area is needed