transforEmotion in Practice: Multimodal Emotion Analysis in R

Vanderbilt QMDA Colloquium

Aleksandar Tomašević

Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade

January 5, 2026

About Me

  • PhD in Sociology
  • Institute of Physics, University of Belgrade
  • Fulbright Visiting Scholar at University of Virginia
  • Research interests:
    • Emotional aspects of political communication
    • Computational methods for social sciences
    • Network dynamics of emotions

Face of Populism: Emotional Expressions of Populist Leaders

Research Project

  • Large-scale study of non-verbal political communication (Major and Tomašević 2025)
  • Analyzed emotional expressions of political leaders
  • 220 YouTube videos, global coverage
  • Based on CNNs (6 basic emotions + neutral)
  • Populist leaders show higher negative emotion scores and lower neutrality

Current Challenges in Computational Emotion Analysis

Three Bottlenecks

  1. Modal Fragmentation - Separate tools for text, image, and video analysis
  2. Taxonomic Rigidity - Locked into Ekman’s 6 basic emotions
  3. Reproducibility Barriers - Commercial API dependence

These bottlenecks prevent communication researchers from pursuing theory-driven emotion research at scale.

transforEmotion: Removing the Bottlenecks

# CRAN (stable)
install.packages("transforEmotion")

# GitHub (dev)
remotes::install_github("atomashevic/transforEmotion")
  • Unified multimodal processing (text, image, video)
  • Zero-shot classification with arbitrary emotion labels
  • Local inference - no APIs required
  • Standard laptop compatible (no GPU needed)

Talk Outline

Part I: Visual Emotion Recognition

  • Methods
    • CNNs vs transformers for emotion detection
  • CLIP
    • Zero-shot classification basics
  • Strategies
    • Labeling and fine-tuning

Part II: Tour of transforEmotion

  • Core Functions
    • Text, image, and video analysis
    • VAD mapping and RAG
  • Future Directions
    • Open questions and extensions

Part I: Visual Emotion Recognition

  • Traditional FER approaches
  • transformer models with zero-shot classification

Traditional Facial Expression Recognition

FER Approaches

FACS Methodology

Traditional Approach

Expert annotation based on FACS methodology - requires certified coders and extensive manual work (Goodfellow et al. 2013)

CNN-Based Facial Expression Recognition

CNNs learn to map facial images to emotion labels through supervised training on FACS-annotated data.

Goal: Encode FACS expertise into learned features and network weights

Output Interpretation

  • Probability distribution across emotions
  • Dual meaning: Model uncertainty OR expression blend
  • Key limitation: Fixed label sets from training data

Problems with CNN Approaches

  • Dataset bias - Limited diversity, lab conditions
  • Overfitting to benchmarks (FER2013, AffectNet)
  • Poor generalization to “in-the-wild” scenarios

The Blame Game

When CNN models fail, where does responsibility lie?

CNN architecture → Training data quality → FACS methodology → Human annotation reliability

The Major Bottleneck: Fixed Label Sets

The Problem

  • CNNs locked into classic categories
  • What if we want: joy, contempt, anticipation?
  • Adding new categories requires:
    • Extensive data annotation
    • Model retraining
    • Very expensive!

Alternative Approach: Zero-shot Classification

The Paradigm Shift

  • Treat FER as an image classification problem
  • Vision Transformers enable zero-shot classification
  • Specify any labels we want without re-training
  • OpenAI’s CLIP model exemplifies this approach (Radford et al. 2021)

Key Insight

No task-specific training required - just provide your labels!

OpenAI CLIP: Model Size in Context

  • ViT-Base: ~150M parameters
  • ViT-Large: ~430M parameters
  • Open weights and orders of magnitude smaller than frontier LLMs
  • Still strong zero-shot transfer for vision-language tasks

Zero-shot Classification Demo

Categories

Category p
Food that I should eat to be healthy ?
Food that I eat at 2AM because I have issues ?

What will CLIP predict?

Zero-shot Classification with CLIP

Results

Category p
Food that I should eat to be healthy 0.08
Food that I eat at 2AM because I have issues 0.92

Tip

No training required for these specific labels!

CLIP’s Joint Embedding Space

Images and text describing similar concepts are mapped to nearby locations in shared embedding space

How CLIP Learns: Contrastive Training

Training Process

  • Large batches (32,768 pairs)
  • Positive pairs: matched image-text
  • Negative pairs: mismatched from batch

Objective

  • Maximize similarity for correct pairs
  • Minimize similarity for incorrect pairs

Trained on 400 million image-text pairs

How Did We Label Those Image-Text Pairs?

The Training Process

CLIP processes large batches (e.g., 32,768 pairs) simultaneously:

Positive pair (matched):

  • Image: Food wheel
  • Text: “One sentence sums up how to eat healthy”

Negative pairs (contrastive):

  • Same image + unrelated texts from batch: “Tesla announces new electric vehicle”, “Cat rescued from tree”

CLIP for Political Emotion Analysis

Trump Portraits - Zero-shot Emotion Classification

Same approach works for faces - dramatically different emotional profiles from official portraits

Labeling Strategies for Zero-Shot FER

# Basic emotion labels
labels_basic <- c("angry", "happy", "sad", "fearful")

# Adjective forms (often better)
labels_adj <- c("an angry person", "a happy person",
                "a sad person", "a fearful person")

# Context-specific
labels_context <- c("a politician showing anger",
                    "a politician expressing joy")

Key Insight

Label phrasing significantly affects classification accuracy. Experiment with your domain!

Introducing the FindingEmo Dataset

Large-scale emotion recognition “in the wild”

  • 25,869 public images with annotations
  • Multi-person social scenes annotated at scene level
  • Plutchik’s Wheel: Emo8: joy, trust, fear, surprise, sadness, disgust, anger, anticipation
  • Annotated by 655 Prolific participants (Mertens et al. 2024)

Key Feature

Focuses on complex scenes, not just individual faces

Examples

Trust

Fear

Anticipation

Labeling Impact: Sports Celebration Example

CLIP Base Model Performance

Approach Joy Score
Adjectives 0.5304
Person Descriptions 0.6153
Scene Descriptions 0.8245

Label Sets Used

Adjectives: joyful, angry, fearful, sad, surprised, disgusted, trusting, anticipating

Person Descriptions: a joyful person, an angry person, a fearful person, a sad person, a surprised person, a disgusted person, a trusting person, an anticipating person

Scene Descriptions: a joyful scene, an angry scene, a fearful scene, a sad scene, a surprised scene, a disgusted scene, a trusting scene, an anticipating scene

FindingEmo: Zero-shot Results by Labeling Approach

Model Label Set Hit@1 Hit@2
CLIP ViT-Base Adjectives 33.48% 50.68%
Person Descriptions 31.45% 47.96%
Scene Descriptions 31.22% 47.29%
CLIP ViT-Large Adjectives 21.27% 40.50%
Person Descriptions 27.83% 48.19%
Scene Descriptions 33.03% 46.38%

Hit@1: Highest score is correct.

Hit@2: Correct label appears in top 2 highest scores.

8-class setup on 442 test images.

Contrastive Fine-tuning: How It Works

For each batch of N image-text pairs:

  • Positive pairs: Matching image + emotion prompt → High similarity
  • Negative pairs: Mismatched combinations → Low similarity

Example Batch (N=4)

True Label Correct Prompt Wrong Prompts (Negatives)
Joy image “person laughing out loud” “person shouting angrily”, “person crying”, “person frozen in terror”
Sadness image “person crying with tears” “person laughing”, “person shouting”, “person celebrating”
Fear image “person with wide frightened eyes” “person laughing”, “person shouting”, “person crying”

The model learns to maximize similarity for correct pairs and minimize it for incorrect ones.

Contrastive Fine-tuning: Technical Details

Key Components

  1. Symmetric InfoNCE Loss
    • \(loss = (loss_{image→text} + loss_{text→image}) / 2\)
    • Cross-entropy in both directions
    • Encourages bidirectional alignment
  2. Learnable Temperature
    • Controls similarity distribution sharpness
    • Initial: 0.07 (CLIP default)
    • Bounded: [0.01, 0.5] for stability
  3. Class-Weighted Loss
    • Handles imbalanced classes (Joy: 690 vs Surprise: 54)
    • Upsamples minority classes during training
  4. Contrastive Prompts
    • Designed to distinguish confusing pairs

Fine-tuning Impact by Labeling Approach

Model Label Set Hit@1 Hit@2
CLIP ViT-Base Adjectives 25.79% 45.25%
Person Descriptions 22.85% 47.06%
Scene Descriptions 23.30% 43.67%
CLIP ViT-Large Adjectives 41.40% 56.11%
Person Descriptions 42.53% 59.05%
Scene Descriptions 44.12% 61.99%

Fine-tuned models using contrastive learning with class-weighted loss. Small fine-tuning attempt with 2,489 training images (10 epochs).

Part II: Tour of transforEmotion

  • Core Functions: text, image, and video analysis
  • New Features: VAD mapping and RAG integration
  • Practical Demos: Real-world examples

Core Functions Summary

  • Emotion scores for text
  • Emotion scores for images
  • Emotion scores for video
  • Valence-arousal mapping
  • RAG for structured interpretation

All functions work on standard laptops without GPU.

Getting Started

# Install from CRAN
install.packages("transforEmotion")

# Load package (auto-configures Python)
library(transforEmotion)

# Optional: pre-warm environment
setup_modules()

Key Feature

Automatic Python environment management via uv - no manual setup required!

Text Sentiment Analysis

Default Model: DistilRoBERTa
Fast & efficient for text emotion detection ~ 2GB RAM

# Example: MLK's "I Have a Dream"
text <- "So even though we face the 
difficulties of today and tomorrow, 
I still have a dream. It is a dream 
deeply rooted in the American dream."

# Define custom emotions
emotions <- c("anger", "fear", "joy", 
              "sadness", "optimism", 
              "hope", "surprise")

# Run analysis
results <- transformer_scores(
  text = text,
  classes = emotions
)

Results:

Emotion Score
Hope 0.22
Surprise 0.18
Optimism 0.15
Joy 0.14

Image Emotion Analysis

CLIP Models

  • oai-base: openai/clip-vit-base-patch32
    • ~2GB RAM, faster inference
  • oai-large: openai/clip-vit-large-patch14
    • ~4GB RAM, more accurate

Multilingual

  • jina-v2: jinaai/jina-clip-v2
    • ~6GB RAM
    • 89-language support
# Define emotion categories
emotions <- c("anger", "fear", "disgust",
              "sadness", "optimism",
              "hope", "surprise")

# Analyze the image
result <- image_scores(
  image = "trump1.jpg",
  classes = emotions,
  model = "oai-base"
)

Video Analysis: MAFW Dataset

Multi-modal Affective Facial Expression in the Wild (Liu et al. 2022)

Video Analysis

Emotion Time Series
video_results <- video_scores(
  video = "path/to/video.mp4",
  classes = emotions,
  nframes = 100,
  model = "oai-large"
)

Time Series Perspective: Why Accuracy Isn’t Everything

Single-Frame Accuracy ≠ Signal Quality

When analyzing video emotion time series:

  • Classification noise becomes measurement error
  • Temporal structure provides redundant information
  • Continuous probability scores > discrete labels

Psychometric Solutions

  • Reliability estimation: Quantify frame-to-frame consistency
  • State-space models: Filter noise, extract latent emotion states (Tomašević, Golino, and Christensen 2024)
  • Change-point detection: Identify genuine emotional transitions
  • Aggregation: Window-based smoothing reduces error

NEW: Valence-Arousal-Dominance Mapping

Two Approaches

Direct prediction

vad <- vad_scores(
  input = text,
  input_type = "text",
  dimensions = c("valence",
                 "arousal",
                 "dominance")
)

Using an transformer model for direct prediction based on valence, arousal, and dominance label sets.

Lexicon mapping

vad_mapped <- map_discrete_to_vad(
  fer_results,
  vad_lexicon = nrc_vad
)

Uses the NRC-VAD lexicon (Mohammad 2018) to map discrete emotion profiles into continuous valence, arousal, and dominance coordinates.

Example: Emo8 → VAD Mapping

# Emo8 emotion scores from image analysis
emo8_scores <- data.frame(
  joy = 0.65, trust = 0.15, fear = 0.05,
  surprise = 0.10, sadness = 0.02,
  disgust = 0.01, anger = 0.01, anticipation = 0.01
)

# Load NRC VAD lexicon
nrc_vad <- textdata::lexicon_nrc_vad()

# Map to VAD coordinates
vad_result <- map_discrete_to_vad(
  emo8_scores,
  vad_lexicon = nrc_vad
)
# Result: valence = 0.72, arousal = 0.58, dominance = 0.64

Maps discrete Emo8 emotion profile into continuous VAD coordinates for cross-modal comparison.

NEW: Extensible Model Registry

# Register a new model
add_vision_model(
  name = "align-base",
  model_id = "kakaobrain/align-base",
  architecture = "align",
  description = "KakaoBrain ALIGN base"
)

# Now use it
image_scores(img, labels, model = "align-base")

Extensibility

Add your own fine-tuned models or new architectures!

NEW: Retrieval-Augmented Generation

What is RAG?

  • Retrieves relevant passages
  • Generates contextual response
  • Runs locally with TinyLLAMA
  • No API calls needed
rag_result <- rag(
  text = document,
  query = "What emotion is
           expressed?",
  retriever = "vector",
  transformer = "TinyLLAMA"
)

RAG with Structured Outputs

# Example text
text <- "He slammed the door and walked away without a word."

# Structured emotion classification
result <- rag(
  text = text,
  task = "emotion",
  transformer = "Gemma3-4B",
  output = "table",
  labels_set = c("joy", "anger", "fear", "sadness")
)

# Returns:
# doc_id  label  confidence
# doc_1   anger  0.95

Practical Defaults

retriever = "vector", similarity_top_k = 10, temperature = 0 for reproducibility

Open Questions & Future Work

Methods

  • Multilingual labels: How do non-English prompts perform?
  • Time series reliability: Measurement models for FER scores

Integration

  • Triangulation: Combining voice, text, and video signals
  • Temporal alignment: Synchronizing multi-stream data

Practical Considerations

  • When LLMs aren’t the answer: Simpler models may suffice
  • Bias and fairness: Demographic and cultural validation

Thank You!

Resources

Acknowledgments

Co-authors: Hudson Golino (UVA), Alexander P. Christensen (Vanderbilt)

Questions?

References

Goodfellow, Ian J, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, et al. 2013. Challenges in representation learning: A report on three machine learning contests.” In International Conference on Neural Information Processing, 117–24. Springer.
Liu, Yuanyuan, Wei Dai, Chuanxu Feng, Guanghao Wang Wenbin andm Yin, Jiabei Zeng, and Shiguang Shan. 2022. MAFW: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild.” In Proceedings of the 30th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3503161.3548190.
Major, Sara, and Aleksandar Tomašević. 2025. “The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning.” Journal of Computational Social Science 8 (3). https://doi.org/10.1007/s42001-025-00392-w.
Mertens, Laurent, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. FindingEmo: An image dataset for Emotion Recognition in the wild.” arXiv [Cs.CV]. https://doi.org/10.48550/arxiv.2402.01355.
Mohammad, Saif. 2018. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Iryna Gurevych and Yusuke Miyao, 174–84. Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1017.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. Learning Transferable Visual Models From Natural Language Supervision.” arXiv [Cs.CV]. http://arxiv.org/abs/2103.00020.
Tomašević, Aleksandar, Hudson Golino, and Alexander P Christensen. 2024. Decoding emotion dynamics in videos using dynamic Exploratory Graph Analysis and zero-shot image classification: A simulation and tutorial using the transforEmotion R package.” PsyArXiv. https://doi.org/10.31234/osf.io/hf3g7.