IMPS 2025
July 18, 2025
Traditional Approach
Expert annotation based on FACS methodology - requires extensive training and manual coding of facial expressions.
CNNs learn to map facial images to emotion labels through supervised training
The goal: encode FACS expertise into learned features and network weights
CNN outputs FER scores - probability distributions across emotion categories
Dual interpretation: uncertainty in the model vs. strength/clarity of the expression
Systematic problems with CNN-based facial expression recognition:
The Blame Game
When CNN models fail, where does the responsibility lie?
CNN architecture → Training data quality → FACS methodology → Human annotation reliability
CNN approaches are locked into classic emotion categories from training datasets
What if we want to explore additional emotions like joy, contempt, or anticipation?
Adding new categories requires extensive data annotation + model retraining - very expensive
Let’s treat FER as an image classification problem
Vision Transformers are capable of zero-shot classification of images
We can specify any labels we want without re-training the model
OpenAI’s CLIP (Radford et al. 2021) model is a great example of this approach
| Category | \(p\) |
|---|---|
| Food that I should eat in order to be a functioning, healthy 37 year old | |
| Category | \(p\) |
|---|---|
| Food that I should eat in order to be a functioning, healthy 37 year old | |
| Food that I eat at 2AM because I have unresolved issues |
| Category | \(p\) |
|---|---|
| Food that I should eat in order to be a functioning, healthy 37 year old | 0.08 |
| Food that I eat at 2AM because I have unresolved issues | 0.92 |
Images and text that describe similar concepts are mapped to nearby locations in the shared embedding space
Available on CRAN 0.1.6 | GitHub
The image_scores() and video_scores() functions support different CLIP variants:
oai-base: openai/clip-vit-base-patch32 - Faster but less accurate - ~2GB RAM required - Good for quick analysis
oai-large: openai/clip-vit-large-patch14
- More accurate but slower - ~4GB RAM required - Better for precision work
eva-8B: BAAI/EVA-CLIP-8B-448 - Very large model with 4-bit quantization - ~8GB RAM (instead of ~32GB) - Highest accuracy available
jina-v2: jinaai/jina-clip-v2 - High accuracy with multilingual support - ~6GB RAM required - 89-language support
# Install and load package
install.packages("transforEmotion")
library(transforEmotion)
# Setup Python environment (one-time only)
setup_miniconda()
# Analyze image emotions
image <- 'slides/images/food.jpg'
labels <- c("Food that I should eat in order to be a functioning, healthy 37-year-old",
"Food that I eat at 2AM because I have unresolved issues")
result <- image_scores(image, labels)Scene from The Witcher TV series showing character Yennefer - labeled as contempt in MAFW dataset