transforEmotion in Practice: Multimodal Emotion Analysis in R

Vanderbilt QMDA Colloquium

Aleksandar Tomašević

atomasevic@ipb.ac.rs

Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade

January 5, 2026

About Me

PhD in Sociology
Institute of Physics, University of Belgrade
Fulbright Visiting Scholar at University of Virginia
Research interests:
- Emotional aspects of political communication
- Computational methods for social sciences
- Network dynamics of emotions

Face of Populism: Emotional Expressions of Populist Leaders

Research Project

Large-scale study of non-verbal political communication (Major and Tomašević 2025)
Analyzed emotional expressions of political leaders
220 YouTube videos, global coverage
Based on CNNs (6 basic emotions + neutral)
Populist leaders show higher negative emotion scores and lower neutrality

Current Challenges in Computational Emotion Analysis

Three Bottlenecks

Modal Fragmentation - Separate tools for text, image, and video analysis
Taxonomic Rigidity - Locked into Ekman’s 6 basic emotions
Reproducibility Barriers - Commercial API dependence

These bottlenecks prevent communication researchers from pursuing theory-driven emotion research at scale.

transforEmotion: Removing the Bottlenecks

# CRAN (stable)
install.packages("transforEmotion")

# GitHub (dev)
remotes::install_github("atomashevic/transforEmotion")

Unified multimodal processing (text, image, video)
Zero-shot classification with arbitrary emotion labels
Local inference - no APIs required
Standard laptop compatible (no GPU needed)

Talk Outline

Part I: Visual Emotion Recognition

Methods
- CNNs vs transformers for emotion detection
CLIP
- Zero-shot classification basics
Strategies
- Labeling and fine-tuning

Part II: Tour of transforEmotion

Core Functions
- Text, image, and video analysis
- VAD mapping and RAG
Future Directions
- Open questions and extensions

Part I: Visual Emotion Recognition

Traditional FER approaches
transformer models with zero-shot classification

Traditional Facial Expression Recognition

Traditional Approach

Expert annotation based on FACS methodology - requires certified coders and extensive manual work (Goodfellow et al. 2013)

CNN-Based Facial Expression Recognition

CNNs learn to map facial images to emotion labels through supervised training on FACS-annotated data.

Goal: Encode FACS expertise into learned features and network weights

Output Interpretation

Probability distribution across emotions
Dual meaning: Model uncertainty OR expression blend
Key limitation: Fixed label sets from training data

Problems with CNN Approaches

Dataset bias - Limited diversity, lab conditions
Overfitting to benchmarks (FER2013, AffectNet)
Poor generalization to “in-the-wild” scenarios

The Blame Game

When CNN models fail, where does responsibility lie?

CNN architecture → Training data quality → FACS methodology → Human annotation reliability

The Major Bottleneck: Fixed Label Sets

The Problem

CNNs locked into classic categories
What if we want: joy, contempt, anticipation?
Adding new categories requires:
- Extensive data annotation
- Model retraining
- Very expensive!

Alternative Approach: Zero-shot Classification

The Paradigm Shift

Treat FER as an image classification problem
Vision Transformers enable zero-shot classification
Specify any labels we want without re-training
OpenAI’s CLIP model exemplifies this approach (Radford et al. 2021)

Key Insight

No task-specific training required - just provide your labels!

OpenAI CLIP: Model Size in Context

ViT-Base: ~150M parameters
ViT-Large: ~430M parameters
Open weights and orders of magnitude smaller than frontier LLMs
Still strong zero-shot transfer for vision-language tasks

Zero-shot Classification Demo

Category	p
Food that I should eat to be healthy	?
Food that I eat at 2AM because I have issues	?

Zero-shot Classification with CLIP

Results

Category	p
Food that I should eat to be healthy	0.08
Food that I eat at 2AM because I have issues	0.92

Tip

No training required for these specific labels!

CLIP’s Joint Embedding Space

Images and text describing similar concepts are mapped to nearby locations in shared embedding space

How CLIP Learns: Contrastive Training

Training Process

Large batches (32,768 pairs)
Positive pairs: matched image-text
Negative pairs: mismatched from batch

Objective

Maximize similarity for correct pairs
Minimize similarity for incorrect pairs

Trained on 400 million image-text pairs

How Did We Label Those Image-Text Pairs?

The Training Process

CLIP processes large batches (e.g., 32,768 pairs) simultaneously:

Positive pair (matched):

Image: Food wheel
Text: “One sentence sums up how to eat healthy”

Negative pairs (contrastive):

Same image + unrelated texts from batch: “Tesla announces new electric vehicle”, “Cat rescued from tree”

CLIP for Political Emotion Analysis

Trump Portraits - Zero-shot Emotion Classification

Same approach works for faces - dramatically different emotional profiles from official portraits

Labeling Strategies for Zero-Shot FER

# Basic emotion labels
labels_basic <- c("angry", "happy", "sad", "fearful")

# Adjective forms (often better)
labels_adj <- c("an angry person", "a happy person",
                "a sad person", "a fearful person")

# Context-specific
labels_context <- c("a politician showing anger",
                    "a politician expressing joy")

Key Insight

Label phrasing significantly affects classification accuracy. Experiment with your domain!

Introducing the FindingEmo Dataset

Large-scale emotion recognition “in the wild”

25,869 public images with annotations
Multi-person social scenes annotated at scene level
Plutchik’s Wheel: Emo8: joy, trust, fear, surprise, sadness, disgust, anger, anticipation
Annotated by 655 Prolific participants (Mertens et al. 2024)

Key Feature

Focuses on complex scenes, not just individual faces

Examples

Trust

Fear

Anticipation

Labeling Impact: Sports Celebration Example

CLIP Base Model Performance

Approach	Joy Score
Adjectives	0.5304
Person Descriptions	0.6153
Scene Descriptions	0.8245

Label Sets Used

Adjectives: joyful, angry, fearful, sad, surprised, disgusted, trusting, anticipating

Person Descriptions: a joyful person, an angry person, a fearful person, a sad person, a surprised person, a disgusted person, a trusting person, an anticipating person

Scene Descriptions: a joyful scene, an angry scene, a fearful scene, a sad scene, a surprised scene, a disgusted scene, a trusting scene, an anticipating scene

FindingEmo: Zero-shot Results by Labeling Approach

Model	Label Set	Hit@1	Hit@2
CLIP ViT-Base	Adjectives	33.48%	50.68%
	Person Descriptions	31.45%	47.96%
	Scene Descriptions	31.22%	47.29%
CLIP ViT-Large	Adjectives	21.27%	40.50%
	Person Descriptions	27.83%	48.19%
	Scene Descriptions	33.03%	46.38%

Hit@1: Highest score is correct.

Hit@2: Correct label appears in top 2 highest scores.

8-class setup on 442 test images.

Contrastive Fine-tuning: How It Works

For each batch of N image-text pairs:

Positive pairs: Matching image + emotion prompt → High similarity
Negative pairs: Mismatched combinations → Low similarity

Example Batch (N=4)

True Label	Correct Prompt	Wrong Prompts (Negatives)
Joy image	“person laughing out loud”	“person shouting angrily”, “person crying”, “person frozen in terror”
Sadness image	“person crying with tears”	“person laughing”, “person shouting”, “person celebrating”
Fear image	“person with wide frightened eyes”	“person laughing”, “person shouting”, “person crying”

The model learns to maximize similarity for correct pairs and minimize it for incorrect ones.

Contrastive Fine-tuning: Technical Details

Key Components

Symmetric InfoNCE Loss
- \(loss = (loss_{image→text} + loss_{text→image}) / 2\)
- Cross-entropy in both directions
- Encourages bidirectional alignment
Learnable Temperature
- Controls similarity distribution sharpness
- Initial: 0.07 (CLIP default)
- Bounded: [0.01, 0.5] for stability
Class-Weighted Loss
- Handles imbalanced classes (Joy: 690 vs Surprise: 54)
- Upsamples minority classes during training
Contrastive Prompts
- Designed to distinguish confusing pairs

Fine-tuning Impact by Labeling Approach

Model	Label Set	Hit@1	Hit@2
CLIP ViT-Base	Adjectives	25.79%	45.25%
	Person Descriptions	22.85%	47.06%
	Scene Descriptions	23.30%	43.67%
CLIP ViT-Large	Adjectives	41.40%	56.11%
	Person Descriptions	42.53%	59.05%
	Scene Descriptions	44.12%	61.99%

Fine-tuned models using contrastive learning with class-weighted loss. Small fine-tuning attempt with 2,489 training images (10 epochs).

Part II: Tour of transforEmotion

Core Functions: text, image, and video analysis
New Features: VAD mapping and RAG integration
Practical Demos: Real-world examples

Core Functions Summary

Emotion scores for text
Emotion scores for images
Emotion scores for video
Valence-arousal mapping
RAG for structured interpretation

All functions work on standard laptops without GPU.

Getting Started

# Install from CRAN
install.packages("transforEmotion")

# Load package (auto-configures Python)
library(transforEmotion)

# Optional: pre-warm environment
setup_modules()

Key Feature

Automatic Python environment management via uv - no manual setup required!

Text Sentiment Analysis

Default Model: DistilRoBERTa
Fast & efficient for text emotion detection ~ 2GB RAM

# Example: MLK's "I Have a Dream"
text <- "So even though we face the 
difficulties of today and tomorrow, 
I still have a dream. It is a dream 
deeply rooted in the American dream."

# Define custom emotions
emotions <- c("anger", "fear", "joy", 
              "sadness", "optimism", 
              "hope", "surprise")

# Run analysis
results <- transformer_scores(
  text = text,
  classes = emotions
)

Results:

Emotion	Score
Hope	0.22
Surprise	0.18
Optimism	0.15
Joy	0.14

Image Emotion Analysis

CLIP Models

oai-base: openai/clip-vit-base-patch32
- ~2GB RAM, faster inference
oai-large: openai/clip-vit-large-patch14
- ~4GB RAM, more accurate

Multilingual

jina-v2: jinaai/jina-clip-v2
- ~6GB RAM
- 89-language support

# Define emotion categories
emotions <- c("anger", "fear", "disgust",
              "sadness", "optimism",
              "hope", "surprise")

# Analyze the image
result <- image_scores(
  image = "trump1.jpg",
  classes = emotions,
  model = "oai-base"
)

Video Analysis: MAFW Dataset

Multi-modal Affective Facial Expression in the Wild (Liu et al. 2022)

video_results <- video_scores(
  video = "path/to/video.mp4",
  classes = emotions,
  nframes = 100,
  model = "oai-large"
)

Time Series Perspective: Why Accuracy Isn’t Everything

Single-Frame Accuracy ≠ Signal Quality

When analyzing video emotion time series:

Classification noise becomes measurement error
Temporal structure provides redundant information
Continuous probability scores > discrete labels

Psychometric Solutions

Reliability estimation: Quantify frame-to-frame consistency
State-space models: Filter noise, extract latent emotion states (Tomašević, Golino, and Christensen 2024)
Change-point detection: Identify genuine emotional transitions
Aggregation: Window-based smoothing reduces error

NEW: Valence-Arousal-Dominance Mapping

Two Approaches

Direct prediction

vad <- vad_scores(
  input = text,
  input_type = "text",
  dimensions = c("valence",
                 "arousal",
                 "dominance")
)

Using an transformer model for direct prediction based on valence, arousal, and dominance label sets.

Lexicon mapping

vad_mapped <- map_discrete_to_vad(
  fer_results,
  vad_lexicon = nrc_vad
)

Uses the NRC-VAD lexicon (Mohammad 2018) to map discrete emotion profiles into continuous valence, arousal, and dominance coordinates.

Example: Emo8 → VAD Mapping

# Emo8 emotion scores from image analysis
emo8_scores <- data.frame(
  joy = 0.65, trust = 0.15, fear = 0.05,
  surprise = 0.10, sadness = 0.02,
  disgust = 0.01, anger = 0.01, anticipation = 0.01
)

# Load NRC VAD lexicon
nrc_vad <- textdata::lexicon_nrc_vad()

# Map to VAD coordinates
vad_result <- map_discrete_to_vad(
  emo8_scores,
  vad_lexicon = nrc_vad
)
# Result: valence = 0.72, arousal = 0.58, dominance = 0.64

Maps discrete Emo8 emotion profile into continuous VAD coordinates for cross-modal comparison.

NEW: Extensible Model Registry

# Register a new model
add_vision_model(
  name = "align-base",
  model_id = "kakaobrain/align-base",
  architecture = "align",
  description = "KakaoBrain ALIGN base"
)

# Now use it
image_scores(img, labels, model = "align-base")

Extensibility

Add your own fine-tuned models or new architectures!

NEW: Retrieval-Augmented Generation

What is RAG?

Retrieves relevant passages
Generates contextual response
Runs locally with TinyLLAMA
No API calls needed

rag_result <- rag(
  text = document,
  query = "What emotion is
           expressed?",
  retriever = "vector",
  transformer = "TinyLLAMA"
)

RAG with Structured Outputs

# Example text
text <- "He slammed the door and walked away without a word."

# Structured emotion classification
result <- rag(
  text = text,
  task = "emotion",
  transformer = "Gemma3-4B",
  output = "table",
  labels_set = c("joy", "anger", "fear", "sadness")
)

# Returns:
# doc_id  label  confidence
# doc_1   anger  0.95

Practical Defaults

retriever = "vector", similarity_top_k = 10, temperature = 0 for reproducibility

Open Questions & Future Work

Methods

Multilingual labels: How do non-English prompts perform?
Time series reliability: Measurement models for FER scores

Integration

Triangulation: Combining voice, text, and video signals
Temporal alignment: Synchronizing multi-stream data

Practical Considerations

When LLMs aren’t the answer: Simpler models may suffice
Bias and fairness: Demographic and cultural validation

Thank You!

Resources

🔬 Interactive Demo: Open in Google Colab
💻 GitHub: github.com/atomashevic/transforEmotion
📄 Paper: doi:10.5117/CCR2025.2.2.TOMA
📧 Email: atomasevic@ipb.ac.rs

Acknowledgments

Co-authors: Hudson Golino (UVA), Alexander P. Christensen (Vanderbilt)

Questions?

References

Goodfellow, Ian J, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, et al. 2013. “Challenges in representation learning: A report on three machine learning contests.” In International Conference on Neural Information Processing, 117–24. Springer.

Liu, Yuanyuan, Wei Dai, Chuanxu Feng, Guanghao Wang Wenbin andm Yin, Jiabei Zeng, and Shiguang Shan. 2022. “MAFW: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild.” In Proceedings of the 30th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3503161.3548190.

Major, Sara, and Aleksandar Tomašević. 2025. “The Face of Populism: Examining Differences in Facial Emotional Expressions of Political Leaders Using Machine Learning.” Journal of Computational Social Science 8 (3). https://doi.org/10.1007/s42001-025-00392-w.

Mertens, Laurent, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. “FindingEmo: An image dataset for Emotion Recognition in the wild.” arXiv [Cs.CV]. https://doi.org/10.48550/arxiv.2402.01355.

Mohammad, Saif. 2018. “Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Iryna Gurevych and Yusuke Miyao, 174–84. Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1017.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models From Natural Language Supervision.” arXiv [Cs.CV]. http://arxiv.org/abs/2103.00020.

Tomašević, Aleksandar, Hudson Golino, and Alexander P Christensen. 2024. “Decoding emotion dynamics in videos using dynamic Exploratory Graph Analysis and zero-shot image classification: A simulation and tutorial using the transforEmotion R package.” PsyArXiv. https://doi.org/10.31234/osf.io/hf3g7.

transforEmotion in Practice: Multimodal Emotion Analysis in R

About Me

Face of Populism: Emotional Expressions of Populist Leaders

Research Project

Current Challenges in Computational Emotion Analysis

transforEmotion: Removing the Bottlenecks

Talk Outline

Part I: Visual Emotion Recognition

Part II: Tour of transforEmotion

Part I: Visual Emotion Recognition

Traditional Facial Expression Recognition

CNN-Based Facial Expression Recognition

Output Interpretation

Problems with CNN Approaches

The Major Bottleneck: Fixed Label Sets

The Problem

Alternative Approach: Zero-shot Classification

The Paradigm Shift

OpenAI CLIP: Model Size in Context

Zero-shot Classification Demo

Categories

Zero-shot Classification with CLIP

Results

CLIP’s Joint Embedding Space

How CLIP Learns: Contrastive Training

Training Process

Objective

How Did We Label Those Image-Text Pairs?

The Training Process

CLIP for Political Emotion Analysis

Labeling Strategies for Zero-Shot FER

Introducing the FindingEmo Dataset

Large-scale emotion recognition “in the wild”

Examples

Labeling Impact: Sports Celebration Example

CLIP Base Model Performance

Label Sets Used

FindingEmo: Zero-shot Results by Labeling Approach

Contrastive Fine-tuning: How It Works

Example Batch (N=4)

Contrastive Fine-tuning: Technical Details

Key Components

Fine-tuning Impact by Labeling Approach

Part II: Tour of transforEmotion

Core Functions Summary

Getting Started

Text Sentiment Analysis

Image Emotion Analysis

CLIP Models

Multilingual

Video Analysis: MAFW Dataset

Time Series Perspective: Why Accuracy Isn’t Everything

Single-Frame Accuracy ≠ Signal Quality

Psychometric Solutions

NEW: Valence-Arousal-Dominance Mapping

Two Approaches

Example: Emo8 → VAD Mapping

NEW: Extensible Model Registry

NEW: Retrieval-Augmented Generation

What is RAG?

RAG with Structured Outputs

Open Questions & Future Work

Methods

Integration

Practical Considerations

Thank You!

Resources

Acknowledgments

References