What Can GPT-4 Do for Zero-shot Visual Recognition?

Ching (Chingis)
4 min readDec 6, 2023

Since the launch of ChatGPT in Nov ’22, it has captivated the tech world, spurring a wave of investment in Generative AI. From the unveiling of the large multimodal model GPT-4 in Mar ’23 to the integration of GPT-4 with Vision into the ChatGPT platform in Sep ’23 🎯, it’s been a thrilling journey. To top it off, OpenAI hosted its first DevDay on ChatGPT’s anniversary, releasing the GPT-4V API 👩‍💻. This major milestone opens avenues for extensive academic evaluations 🏫.

The following work studies the capabilities of GPT-4 for the image recognition task. This blog introduces and summarizes the study.

Underlying Task

source

The authors assess the capabilities of GPT4 for visual recognition (classification) by evaluating performance on 16 different benchmarks (images, videos and point cloud).

Videos and point clouds were converted into 2D images. For videos, they take a few frames from videos. For point clouds, they take views from multiple angles, following MVCNN.

--

--

Ching (Chingis)
Ching (Chingis)

Written by Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.