Can AI read emotions in eyes?

Testing some approaches

Jun 23, 2024

In the last few years, many computer vision tasks have gone from virtually impossible to quite easy. A task that is hard even for many humans is recognising emotions from a picture of the eyes alone. To see how easy this is with some recent AI developments, I tried out some approaches (Github repository with implementations).

Embrace Autism | Reading the Mind in the Eyes test | 2 terrified upset arrogant annoyed — Some “upset” eyes

CLIP

The first approach I tried is to use CLIP. CLIP provides a mapping (referred to as an “embedding”) from text to high-dimensional vectors, and a mapping from images to vectors (of the same dimension as for text). If the text describes the image well, then their embedding vectors should point in similar directions.

This suggests a way to detect emotions. Simply embed the image and also embed some possible labels. Then the predicted emotion is the one whose label most closely matches the image. To test this, I used images and options from this test. However, a caveat here is that these images may be in the training dataset of CLIP!

This approach does significantly better than chance. It is correct about 50% of the time (p-value of 0.0029 if there are 4 options) on the sample of 22 images that I tried.

Here, the answer was “uneasy” but CLIP predicted that it shows “dispirited”

This is correctly classified as “desire”

OpenCLIP

One low-hanging fruit improvement is to use a better CLIP model. This didn’t end up helping though: using one of the largest models available in the OpenCLIP library had exactly the same accuracy on my sample as the smaller CLIP model.

Asking a language model

Another approach is to just ask a multimodal language model what emotion it thinks the image shows. This gave a pretty big improvement. I used the Anthropic API to automatically ask Claude 3 to guess what emotion each face showed.

This got up to 72% accuracy!

Interestingly, the best performance was with Claude Haiku, the smallest model. The larger Claude Opus scored only 16%, sometimes failing to even pick one of the multiple choice options. Another interesting quirk is that if you ask Haiku to describe the emotion without providing options, it almost invariably says “intensity”. Which is not totally wrong, for a closeup of the eyes.

Manatee of manatees

Discussion about this post