transforEmotion: Multimodal Emotion Analysis in R

IAAP Webinar: AI & Emotion in Digital Life

Aleksandar Tomašević

Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade

January 22, 2026

About Me

  • PhD in Sociology
  • Institute of Physics, University of Belgrade
  • Fulbright Visiting Scholar at University of Virginia
  • Research focus: Emotional aspects of political communication

Why Not Just Use LLMs?

LLMs Excel At

  • Complex reasoning & context
  • Few items with nuanced interpretation
  • Generating explanations
  • Unstructured tasks

But For Emotion Analysis…

  • Scale: Thousands/millions of items
  • Cost: API calls add up quickly
  • Reproducibility: Same input = same output
  • Privacy: Data stays local
  • Speed: Batch processing efficiency

Key Insight

Specialized models outperform LLMs for well-defined classification tasks at scale

transforEmotion: Unified Emotion Analysis

# Install from CRAN
install.packages("transforEmotion")

Key Features

  • Unified multimodal (text, image, video)
  • Zero-shot classification - any emotion labels
  • Local inference - no APIs
  • No GPU needed - runs on laptops

Open in Google Colab

Text Sentiment Analysis

Model: BART Large MNLI (~2GB RAM)

# MLK's "I Have a Dream"
text <- "So even though we face the
difficulties of today and tomorrow,
I still have a dream. It is a dream
deeply rooted in the American dream."

# Custom emotion labels
emotions <- c("anger", "fear", "joy",
              "sadness", "optimism",
              "hope", "surprise")

# Analyze with BART
results <- transformer_scores(
  text = text,
  classes = emotions,
  model = "bart-large-mnli"
)

Zero-shot

No training needed - define any labels you want!

Try it in Google Colab

CLIP: Connecting Vision and Language

What is CLIP?

  • Contrastive Language-Image Pre-training
  • Developed by OpenAI (2021)
  • Learns to match images with text descriptions
  • Enables zero-shot image classification

Why CLIP for Emotions?

  • No emotion-specific training needed
  • Works with any emotion labels you define
  • Captures visual context, not just faces

Let’s see CLIP in action with political portraits…

CLIP for Political Emotion Analysis

Trump Portraits - Zero-shot Emotion Classification

Same approach works for faces - dramatically different emotional profiles from official portraits

CLIP’s Joint Embedding Space

Images and text describing similar concepts are mapped to nearby locations in shared embedding space

How CLIP Learns: Contrastive Training

Learning from the Web

  • Scrape image + surrounding text
  • Example: Article headline → food image
  • 400 million image-text pairs (Radford et al. 2021)

Training Objective

  • Match: “A balanced diet” ↔︎ food image
  • Contrast: “A balanced diet” ↔︎ cat image

This massive scale enables generalization to new concepts - including emotions

Image Emotion Analysis with transforEmotion

Available Models

  • oai-base: ~2GB RAM, faster
  • oai-large: ~4GB RAM, more accurate
  • jina-v2: ~6GB RAM, 89 languages
# Define emotions
emotions <- c("anger", "fear",
              "disgust", "sadness",
              "joy", "surprise")

# Analyze image
result <- image_scores(
  image = "portrait.jpg",
  classes = emotions,
  model = "oai-base"
)

Key Insight

Label phrasing significantly affects accuracy. Try “a happy person” vs “happy” vs “a joyful scene”!

Video Emotion Analysis

# Analyze video emotions
video_results <- video_scores(
  video = "speech.mp4",
  classes = c("anger", "fear",
              "joy", "sadness",
              "surprise", "neutral"),
  nframes = 100
)

# Returns emotion scores
# for each sampled frame

Time Series Output

Track emotional dynamics across video frames - great for speeches, debates, or social media content

FindingEmo: Benchmarking in the Wild

Large-scale emotion dataset

  • 25,869 public images with annotations
  • Multi-person social scenes
  • Emo8 labels: joy, trust, fear, surprise, sadness, disgust, anger, anticipation
  • Annotated by 655 participants (Mertens et al. 2024)

Labeling Impact: Sports Celebration

CLIP Base Model - Joy Scores

Labeling Approach Joy Score
Adjectives (“joyful”) 0.53
Person (“a joyful person”) 0.62
Scene (“a joyful scene”) 0.82

Label Engineering Matters

Scene descriptions capture context better for social images

FindingEmo Benchmark Results

Model Label Set Hit@1 Hit@2
CLIP ViT-Base Adjectives 33.5% 50.7%
CLIP ViT-Large Scene Descriptions 33.0% 46.4%
Fine-tuned ViT-Large Scene Descriptions 44.1% 62.0%

Hit@1: Highest score is correct. Hit@2: Correct label in top 2.

8-class setup on 442 test images. Fine-tuning with just 2,489 images improves Hit@1 by 11 percentage points.

Fine-tuned Models Available

Contrastive Fine-tuning

  • Class-weighted loss for imbalanced data
  • Custom prompts for emotion disambiguation
  • Models available via registry
# Use fine-tuned model
result <- image_scores(
  image = "scene.jpg",
  classes = emo8_labels,
  model = "findingemo-large"
)

Improvement

Metric Before After
Hit@1 33% 44%
Hit@2 46% 62%

Small dataset (2,489 images) yields significant gains

When to Use transforEmotion

Best For

  • Large-scale analysis (thousands of items)
  • Custom emotion taxonomies
  • Local/offline processing needs
  • Reproducible research pipelines
  • Time series emotion tracking

Consider LLMs When

  • Complex contextual understanding needed
  • Few items to analyze
  • Nuanced interpretation required
  • Structured reasoning matters

Practical Advice

Start with transforEmotion for scale, validate edge cases with LLMs if needed.

Thank You!

Resources

Acknowledgments

Co-authors: Hudson Golino (UVA), Alexander P. Christensen (Vanderbilt)

Questions?

References

Mertens, Laurent, Elahe Yargholi, Hans Op de Beeck, Jan Van den Stock, and Joost Vennekens. 2024. FindingEmo: An image dataset for Emotion Recognition in the wild.” arXiv [Cs.CV]. https://doi.org/10.48550/arxiv.2402.01355.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. Learning Transferable Visual Models From Natural Language Supervision.” arXiv [Cs.CV]. http://arxiv.org/abs/2103.00020.