Vanderbilt QMDA Colloquium
Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade
January 5, 2026
Three Bottlenecks
These bottlenecks prevent communication researchers from pursuing theory-driven emotion research at scale.
Traditional Approach
Expert annotation based on FACS methodology - requires certified coders and extensive manual work (Goodfellow et al. 2013)
CNNs learn to map facial images to emotion labels through supervised training on FACS-annotated data.
Goal: Encode FACS expertise into learned features and network weights
The Blame Game
When CNN models fail, where does responsibility lie?
CNN architecture → Training data quality → FACS methodology → Human annotation reliability
Key Insight
No task-specific training required - just provide your labels!
| Category | p |
|---|---|
| Food that I should eat to be healthy | ? |
| Food that I eat at 2AM because I have issues | ? |
What will CLIP predict?
| Category | p |
|---|---|
| Food that I should eat to be healthy | 0.08 |
| Food that I eat at 2AM because I have issues | 0.92 |
Tip
No training required for these specific labels!
Images and text describing similar concepts are mapped to nearby locations in shared embedding space
Trained on 400 million image-text pairs
CLIP processes large batches (e.g., 32,768 pairs) simultaneously:
Positive pair (matched):
Negative pairs (contrastive):
Trump Portraits - Zero-shot Emotion Classification
Same approach works for faces - dramatically different emotional profiles from official portraits
# Basic emotion labels
labels_basic <- c("angry", "happy", "sad", "fearful")
# Adjective forms (often better)
labels_adj <- c("an angry person", "a happy person",
"a sad person", "a fearful person")
# Context-specific
labels_context <- c("a politician showing anger",
"a politician expressing joy")Key Insight
Label phrasing significantly affects classification accuracy. Experiment with your domain!
Key Feature
Focuses on complex scenes, not just individual faces
Trust
Fear
Anticipation
| Approach | Joy Score |
|---|---|
| Adjectives | 0.5304 |
| Person Descriptions | 0.6153 |
| Scene Descriptions | 0.8245 |
Adjectives: joyful, angry, fearful, sad, surprised, disgusted, trusting, anticipating
Person Descriptions: a joyful person, an angry person, a fearful person, a sad person, a surprised person, a disgusted person, a trusting person, an anticipating person
Scene Descriptions: a joyful scene, an angry scene, a fearful scene, a sad scene, a surprised scene, a disgusted scene, a trusting scene, an anticipating scene
| Model | Label Set | Hit@1 | Hit@2 |
|---|---|---|---|
| CLIP ViT-Base | Adjectives | 33.48% | 50.68% |
| Person Descriptions | 31.45% | 47.96% | |
| Scene Descriptions | 31.22% | 47.29% | |
| CLIP ViT-Large | Adjectives | 21.27% | 40.50% |
| Person Descriptions | 27.83% | 48.19% | |
| Scene Descriptions | 33.03% | 46.38% |
Hit@1: Highest score is correct.
Hit@2: Correct label appears in top 2 highest scores.
8-class setup on 442 test images.
For each batch of N image-text pairs:
| True Label | Correct Prompt | Wrong Prompts (Negatives) |
|---|---|---|
| Joy image | “person laughing out loud” | “person shouting angrily”, “person crying”, “person frozen in terror” |
| Sadness image | “person crying with tears” | “person laughing”, “person shouting”, “person celebrating” |
| Fear image | “person with wide frightened eyes” | “person laughing”, “person shouting”, “person crying” |
The model learns to maximize similarity for correct pairs and minimize it for incorrect ones.
| Model | Label Set | Hit@1 | Hit@2 |
|---|---|---|---|
| CLIP ViT-Base | Adjectives | 25.79% | 45.25% |
| Person Descriptions | 22.85% | 47.06% | |
| Scene Descriptions | 23.30% | 43.67% | |
| CLIP ViT-Large | Adjectives | 41.40% | 56.11% |
| Person Descriptions | 42.53% | 59.05% | |
| Scene Descriptions | 44.12% | 61.99% |
Fine-tuned models using contrastive learning with class-weighted loss. Small fine-tuning attempt with 2,489 training images (10 epochs).
All functions work on standard laptops without GPU.
# Install from CRAN
install.packages("transforEmotion")
# Load package (auto-configures Python)
library(transforEmotion)
# Optional: pre-warm environment
setup_modules()Key Feature
Automatic Python environment management via uv - no manual setup required!
Default Model: DistilRoBERTa
Fast & efficient for text emotion detection ~ 2GB RAM
# Example: MLK's "I Have a Dream"
text <- "So even though we face the
difficulties of today and tomorrow,
I still have a dream. It is a dream
deeply rooted in the American dream."
# Define custom emotions
emotions <- c("anger", "fear", "joy",
"sadness", "optimism",
"hope", "surprise")
# Run analysis
results <- transformer_scores(
text = text,
classes = emotions
)Results:
| Emotion | Score |
|---|---|
| Hope | 0.22 |
| Surprise | 0.18 |
| Optimism | 0.15 |
| Joy | 0.14 |
openai/clip-vit-base-patch32
openai/clip-vit-large-patch14
jinaai/jina-clip-v2
Multi-modal Affective Facial Expression in the Wild (Liu et al. 2022)
When analyzing video emotion time series:
Direct prediction
vad <- vad_scores(
input = text,
input_type = "text",
dimensions = c("valence",
"arousal",
"dominance")
)Using an transformer model for direct prediction based on valence, arousal, and dominance label sets.
Lexicon mapping
Uses the NRC-VAD lexicon (Mohammad 2018) to map discrete emotion profiles into continuous valence, arousal, and dominance coordinates.
# Emo8 emotion scores from image analysis
emo8_scores <- data.frame(
joy = 0.65, trust = 0.15, fear = 0.05,
surprise = 0.10, sadness = 0.02,
disgust = 0.01, anger = 0.01, anticipation = 0.01
)
# Load NRC VAD lexicon
nrc_vad <- textdata::lexicon_nrc_vad()
# Map to VAD coordinates
vad_result <- map_discrete_to_vad(
emo8_scores,
vad_lexicon = nrc_vad
)
# Result: valence = 0.72, arousal = 0.58, dominance = 0.64Maps discrete Emo8 emotion profile into continuous VAD coordinates for cross-modal comparison.
# Register a new model
add_vision_model(
name = "align-base",
model_id = "kakaobrain/align-base",
architecture = "align",
description = "KakaoBrain ALIGN base"
)
# Now use it
image_scores(img, labels, model = "align-base")Extensibility
Add your own fine-tuned models or new architectures!
# Example text
text <- "He slammed the door and walked away without a word."
# Structured emotion classification
result <- rag(
text = text,
task = "emotion",
transformer = "Gemma3-4B",
output = "table",
labels_set = c("joy", "anger", "fear", "sadness")
)
# Returns:
# doc_id label confidence
# doc_1 anger 0.95Practical Defaults
retriever = "vector", similarity_top_k = 10, temperature = 0 for reproducibility
Co-authors: Hudson Golino (UVA), Alexander P. Christensen (Vanderbilt)
Questions?