IMPS 2025
July 18, 2025
Traditional Approach
Expert annotation based on FACS methodology - requires extensive training and manual coding of facial expressions.
CNNs learn to map facial images to emotion labels through supervised training
The goal: encode FACS expertise into learned features and network weights
CNN outputs FER scores - probability distributions across emotion categories
Dual interpretation: uncertainty in the model vs. strength/clarity of the expression
Systematic problems with CNN-based facial expression recognition:
The Blame Game
When CNN models fail, where does the responsibility lie?
CNN architecture → Training data quality → FACS methodology → Human annotation reliability
CNN approaches are locked into classic emotion categories from training datasets
What if we want to explore additional emotions like joy, contempt, or anticipation?
Adding new categories requires extensive data annotation + model retraining - very expensive
Let’s treat FER as an image classification problem
Vision Transformers are capable of zero-shot classification of images
We can specify any labels we want without re-training the model
OpenAI’s CLIP (Radford et al. 2021) model is a great example of this approach
Category | |
---|---|
Food that I should eat in order to be a functioning, healthy 37 year old | |
Category | |
---|---|
Food that I should eat in order to be a functioning, healthy 37 year old | |
Food that I eat at 2AM because I have unresolved issues |
Category | |
---|---|
Food that I should eat in order to be a functioning, healthy 37 year old | 0.08 |
Food that I eat at 2AM because I have unresolved issues | 0.92 |
Images and text that describe similar concepts are mapped to nearby locations in the shared embedding space
Available on CRAN 0.1.6 | GitHub
The image_scores()
and video_scores()
functions support different CLIP variants:
oai-base: openai/clip-vit-base-patch32
- Faster but less accurate - ~2GB RAM required - Good for quick analysis
oai-large: openai/clip-vit-large-patch14
- More accurate but slower - ~4GB RAM required - Better for precision work
eva-8B: BAAI/EVA-CLIP-8B-448
- Very large model with 4-bit quantization - ~8GB RAM (instead of ~32GB) - Highest accuracy available
jina-v2: jinaai/jina-clip-v2
- High accuracy with multilingual support - ~6GB RAM required - 89-language support
# Install and load package
install.packages("transforEmotion")
library(transforEmotion)
# Setup Python environment (one-time only)
setup_miniconda()
# Analyze image emotions
image <- 'slides/images/food.jpg'
labels <- c("Food that I should eat in order to be a functioning, healthy 37-year-old",
"Food that I eat at 2AM because I have unresolved issues")
result <- image_scores(image, labels)
Scene from The Witcher TV series showing character Yennefer - labeled as contempt in MAFW dataset
Parameters:
Video description:
A lady debunks the other‘s mind, shaking the head, then comes ahead. The lifted eyebrows and a slight pout.
Zero-shot Classification
Enables flexible emotion detection without retraining
dynEGA Analysis
Reveals hidden structure in emotion time series
DLO Parameters
Capture distinct emotional regulation strategies
Model validation on diverse datasets
Model fine tuning for emotion-specific tasks
Improving parameter recovery methods
Questions?
Contact: atomasevic@ipb.ac.rs
Features learned in early layers detect edges and basic facial structures
Feature Hierarchy
Deeper layers combine features to detect more complex patterns like facial expressions
CLIP processes large batches (e.g., 32,768 pairs) simultaneously:
Positive pair (matched): - Image: Food wheel - Text: “One sentence sums up how to eat healthy”
Negative pairs (contrastive): - Same image + unrelated texts from batch: - “Tesla announces new electric vehicle” - “Cat rescued from tree” - “Stock market reaches all-time high”
::::
Maximize similarity between: - Food image ↔︎ “healthy eating” text
Minimize similarity between: - Food image ↔︎ “Tesla announces…” text - “Healthy eating” text ↔︎ Car image
Key Insight
The model learns by contrasting correct pairs against incorrect ones within each training batch
Black Box Pipeline
Key Question
Why would we abandon models grounded in expert knowledge?
Model | precision_negative | recall_negative | macro_f1 | weighted_f1 | accuracy |
---|---|---|---|---|---|
CNN-fer | 0.721 | 0.588 | 0.502 | 0.567 | 0.552 |
CNN-strong | 0.627 | 0.642 | 0.507 | 0.530 | 0.533 |
CNN-weak | NaN | 0.000 | 0.459 | 0.194 | 0.322 |
CLIP | 0.623 | 0.728 | 0.502 | 0.518 | 0.546 |
CLIP-NE | 0.609 | 0.914 | 0.490 | 0.509 | 0.609 |
MetaCLIP | 0.597 | 0.934 | 0.453 | 0.496 | 0.602 |
MetaCLIP-NE | 0.597 | 0.934 | 0.444 | 0.493 | 0.600 |
appleCLIP | 0.786 | 0.257 | 0.439 | 0.385 | 0.367 |
appleCLIP-NE | 0.767 | 0.346 | 0.474 | 0.439 | 0.415 |