transforEmotion: Multimodal Emotion Analysis in R

About Me

PhD in Sociology
Institute of Physics, University of Belgrade
Fulbright Visiting Scholar at University of Virginia
Research focus: Emotional aspects of political communication

Why Not Just Use LLMs?

LLMs Excel At

Complex reasoning & context
Few items with nuanced interpretation
Generating explanations
Unstructured tasks

But For Emotion Analysis…

Scale: Thousands/millions of items
Cost: API calls add up quickly
Reproducibility: Same input = same output
Privacy: Data stays local
Speed: Batch processing efficiency

Key Insight

Specialized models outperform LLMs for well-defined classification tasks at scale

transforEmotion: Unified Emotion Analysis

# Install from CRAN
install.packages("transforEmotion")

Key Features

Unified multimodal (text, image, video)
Zero-shot classification - any emotion labels
Local inference - no APIs
No GPU needed - runs on laptops

Open in Google Colab

Text Sentiment Analysis

Model: BART Large MNLI (~2GB RAM)

# MLK's "I Have a Dream"
text <- "So even though we face the
difficulties of today and tomorrow,
I still have a dream. It is a dream
deeply rooted in the American dream."

# Custom emotion labels
emotions <- c("anger", "fear", "joy",
              "sadness", "optimism",
              "hope", "surprise")

# Analyze with BART
results <- transformer_scores(
  text = text,
  classes = emotions,
  model = "bart-large-mnli"
)

Zero-shot

No training needed - define any labels you want!

Try it in Google Colab

CLIP: Connecting Vision and Language

What is CLIP?

Contrastive Language-Image Pre-training
Developed by OpenAI (2021)
Learns to match images with text descriptions
Enables zero-shot image classification

Why CLIP for Emotions?

No emotion-specific training needed
Works with any emotion labels you define
Captures visual context, not just faces

Let’s see CLIP in action with political portraits…

CLIP for Political Emotion Analysis

Trump Portraits - Zero-shot Emotion Classification

Same approach works for faces - dramatically different emotional profiles from official portraits

CLIP’s Joint Embedding Space

Images and text describing similar concepts are mapped to nearby locations in shared embedding space

How CLIP Learns: Contrastive Training

Learning from the Web

Scrape image + surrounding text
Example: Article headline → food image
400 million image-text pairs (Radford et al. 2021)

Training Objective

Match: “A balanced diet” ↔︎ food image
Contrast: “A balanced diet” ↔︎ cat image

This massive scale enables generalization to new concepts - including emotions

Image Emotion Analysis with transforEmotion

Available Models

oai-base: ~2GB RAM, faster
oai-large: ~4GB RAM, more accurate
jina-v2: ~6GB RAM, 89 languages

# Define emotions
emotions <- c("anger", "fear",
              "disgust", "sadness",
              "joy", "surprise")

# Analyze image
result <- image_scores(
  image = "portrait.jpg",
  classes = emotions,
  model = "oai-base"
)

Key Insight

Label phrasing significantly affects accuracy. Try “a happy person” vs “happy” vs “a joyful scene”!

Video Emotion Analysis

# Analyze video emotions
video_results <- video_scores(
  video = "speech.mp4",
  classes = c("anger", "fear",
              "joy", "sadness",
              "surprise", "neutral"),
  nframes = 100
)

# Returns emotion scores
# for each sampled frame

Time Series Output

Track emotional dynamics across video frames - great for speeches, debates, or social media content

FindingEmo: Benchmarking in the Wild

Large-scale emotion dataset

25,869 public images with annotations
Multi-person social scenes
Emo8 labels: joy, trust, fear, surprise, sadness, disgust, anger, anticipation
Annotated by 655 participants (Mertens et al. 2024)

Labeling Impact: Sports Celebration

CLIP Base Model - Joy Scores

Labeling Approach	Joy Score
Adjectives (“joyful”)	0.53
Person (“a joyful person”)	0.62
Scene (“a joyful scene”)	0.82

Label Engineering Matters

Scene descriptions capture context better for social images

FindingEmo Benchmark Results

Model	Label Set	Hit@1	Hit@2
CLIP ViT-Base	Adjectives	33.5%	50.7%
CLIP ViT-Large	Scene Descriptions	33.0%	46.4%
Fine-tuned ViT-Large	Scene Descriptions	44.1%	62.0%

Hit@1: Highest score is correct. Hit@2: Correct label in top 2.

8-class setup on 442 test images. Fine-tuning with just 2,489 images improves Hit@1 by 11 percentage points.

Fine-tuned Models Available

Contrastive Fine-tuning

Class-weighted loss for imbalanced data
Custom prompts for emotion disambiguation
Models available via registry

# Use fine-tuned model
result <- image_scores(
  image = "scene.jpg",
  classes = emo8_labels,
  model = "findingemo-large"
)

Improvement

Metric	Before	After
Hit@1	33%	44%
Hit@2	46%	62%

Small dataset (2,489 images) yields significant gains

When to Use transforEmotion

Best For

Large-scale analysis (thousands of items)
Custom emotion taxonomies
Local/offline processing needs
Reproducible research pipelines
Time series emotion tracking

Consider LLMs When

Complex contextual understanding needed
Few items to analyze
Nuanced interpretation required
Structured reasoning matters

Practical Advice

Start with transforEmotion for scale, validate edge cases with LLMs if needed.

Thank You!

Resources

Acknowledgments

Co-authors: Hudson Golino (UVA), Alexander P. Christensen (Vanderbilt)

Questions?

References

Mertens, Laurent, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. “FindingEmo: An image dataset for Emotion Recognition in the wild.” arXiv [Cs.CV]. https://doi.org/10.48550/arxiv.2402.01355.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models From Natural Language Supervision.” arXiv [Cs.CV]. http://arxiv.org/abs/2103.00020.