Capturing Emotional Dynamics: Integrating Transformer Models with Dynamic Exploratory Graph Analysis

Motivation

Research Context

Emotional aspects of political communication (Major and Tomašević 2025)
Network dynamics of emotions in politics (Tomašević and Major 2024)
Emotion dynamics modeling with DLO (Tomašević, Golino, and Christensen 2024)

Face of Populism Study

Outline

Part I: ML Emotion Detection from Images and Videos

Traditional FER approaches
transformer models with zero-shot classification

Part II: transforEmotion Package and time series analysis

R package for sentiment and emotion analysis
Multivariate emotion time series
Dynamic Exploratory Graph Analysis (dynEGA)

Part III: Dynamical Systems

Damped Linear Oscillator (DLO) framework
Emotion dynamics over time
Parameter recovery methods

Discussion

Part I: Emotion Detection from Images and Videos

Classic FER: Expert-Labeled Images

Traditional Approach

Expert annotation based on FACS methodology - requires extensive training and manual coding of facial expressions.

CNN Training on Labeled Data

CNNs learn to map facial images to emotion labels through supervised training

The goal: encode FACS expertise into learned features and network weights

CNN Classification Process

CNN outputs FER scores - probability distributions across emotion categories

Dual interpretation: uncertainty in the model vs. strength/clarity of the expression

Issues with CNN Approaches

Systematic problems with CNN-based facial expression recognition:

Dataset bias - limited diversity and artificial lab conditions
Overfitting to famous benchmark datasets (FER2013, AffectNet)
Poor generalization to real-world “in-the-wild” scenarios . . .

The Blame Game

When CNN models fail, where does the responsibility lie?

CNN architecture → Training data quality → FACS methodology → Human annotation reliability

The Major Bottleneck: Fixed Label Sets

CNN approaches are locked into classic emotion categories from training datasets

What if we want to explore additional emotions like joy, contempt, or anticipation?

Adding new categories requires extensive data annotation + model retraining - very expensive

Alternative Approach: Zero-shot Classification

Let’s treat FER as an image classification problem

Vision Transformers are capable of zero-shot classification of images

We can specify any labels we want without re-training the model

OpenAI’s CLIP (Radford et al. 2021) model is a great example of this approach

Zero-shot Classification

Category	$p$
Food that I should eat in order to be a functioning, healthy 37 year old

Zero-shot Classification

Category	$p$
Food that I should eat in order to be a functioning, healthy 37 year old
Food that I eat at 2AM because I have unresolved issues

Zero-shot Classification with CLIP

Category	$p$
Food that I should eat in order to be a functioning, healthy 37 year old	0.08
Food that I eat at 2AM because I have unresolved issues	0.92

CLIP’s Joint Embedding Space

Images and text that describe similar concepts are mapped to nearby locations in the shared embedding space

CLIP Example

Part II: transforEmotion Package and Time Series Analysis

transforEmotion Package

R Package for Transformer-Based Emotion Analysis

Available on CRAN 0.1.6 | GitHub

Text sentiment analysis using transformer models
Image emotion detection with CLIP models
Video frame analysis for emotion tracking
Local processing - no external APIs required

Available Models

The image_scores() and video_scores() functions support different CLIP variants:

Base Models

oai-base: openai/clip-vit-base-patch32 - Faster but less accurate - ~2GB RAM required - Good for quick analysis

oai-large: openai/clip-vit-large-patch14
- More accurate but slower - ~4GB RAM required - Better for precision work

Advanced Models

eva-8B: BAAI/EVA-CLIP-8B-448 - Very large model with 4-bit quantization - ~8GB RAM (instead of ~32GB) - Highest accuracy available

jina-v2: jinaai/jina-clip-v2 - High accuracy with multilingual support - ~6GB RAM required - 89-language support

Quick Start Example


# Install and load package
install.packages("transforEmotion")
library(transforEmotion)

# Setup Python environment (one-time only)
setup_miniconda()

# Analyze image emotions
image <- 'slides/images/food.jpg'

labels <- c("Food that I should eat in order to be a functioning, healthy 37-year-old",
            "Food that I eat at 2AM because I have unresolved issues")

result <- image_scores(image, labels)

MAFW Dataset: Multi-modal Affective Facial Expression (Liu et al. 2022)

Dataset Overview

10,045 video clips from movies, TV dramas, short videos
Multi-cultural: China, Japan, Korea, Europe, America, India

Emotion Categories

11 single expressions: anger, disgust, fear, happiness, sadness, surprise, contempt, anxiety, helplessness, disappointment, neutral

MAFW Example: Contempt

Scene from The Witcher TV series showing character Yennefer - labeled as contempt in MAFW dataset

Video Analysis with transforEmotion

Original MAFW Video

Time Series of FER scores

Dynamic Exploratory Graph Analysis (dynEGA)

Network psychometric method combining dynamical systems analysis with EGA
Estimates communities of variables that change together over time

Key Components

Time-delay embedding - Phase-space transformation
GLLA - Generalized Local Linear Approximation for derivatives

What it reveals

Zero-order derivatives: Variable co-occurrence patterns
First-order derivatives: Synchronized rate of change
Network community structure: Dimensions of similar dynamics

dynEGA Results

Network Scores Time Series

Network Scores in EGA

Based on node strength
Computed as formative composites
“Simple structure”: mirrors sum scores/CFA

Network Score Dynamics

Part III: Dynamical Systems

Damped Linear Oscillator (DLO) (Ollero et al. 2023; Tomašević, Golino, and Christensen 2024)

$\frac{d^{2} X}{d t^{2}} = η X + ζ \frac{d X}{d t} + q (t)$

Parameters:

$η$ (eta): Oscillation frequency
$ζ$ (zeta): Damping ratio
$q (t)$ : Stochastic forcing/noise term

Realistic Emotion Time Series Generation

Neural Network for Parameter Recovery

Method & Architecture

Hybrid Model: CNN for local patterns, LSTM for temporal dynamics.
Training:
- Complex loss (Correlation, MSE, Physics penalties).
- AdamW optimizer with cosine annealing.
Output: Separate prediction heads for $η$ , $ζ$ , $σ_{q}$

Parameter Recovery Example

Interpretation

Video description:

A lady debunks the other‘s mind, shaking the head, then comes ahead. The lifted eyebrows and a slight pout.

High noise sensitivity ( $σ_{q} = 0.838$ ): quick activation
Strong damping ( $ζ = - 1.901$ ): Controlled, brief expressions
Stable baseline ( $η = - 1.942$ ): Returns quickly to sustained contempt

Takeaway

Zero-shot Classification

Enables flexible emotion detection without retraining

dynEGA Analysis

Reveals hidden structure in emotion time series

DLO Parameters

Capture distinct emotional regulation strategies

Future Directions

Model validation on diverse datasets

Model fine tuning for emotion-specific tasks

Improving parameter recovery methods

Thank you!

Questions?

Contact: atomasevic@ipb.ac.rs

Convolution Layer 2: Edge Detection

Features learned in early layers detect edges and basic facial structures

Deeper Layers: Complex Pattern Recognition

Feature Hierarchy

Deeper layers combine features to detect more complex patterns like facial expressions

How Does CLIP Learn This? Contrastive Training

The Training Process

CLIP processes large batches (e.g., 32,768 pairs) simultaneously:

Positive pair (matched): - Image: Food wheel - Text: “One sentence sums up how to eat healthy”

Negative pairs (contrastive): - Same image + unrelated texts from batch: - “Tesla announces new electric vehicle” - “Cat rescued from tree” - “Stock market reaches all-time high”

::::

Contrastive Learning in Action

What CLIP Learns

Maximize similarity between: - Food image ↔︎ “healthy eating” text

Minimize similarity between: - Food image ↔︎ “Tesla announces…” text - “Healthy eating” text ↔︎ Car image

Why This Works

No need for manufactured negative examples
Natural contrastive pairs from different articles
Learns rich multimodal representations
Enables zero-shot classification on new categories

Key Insight

The model learns by contrasting correct pairs against incorrect ones within each training batch

Pure Black Box Approach

Black Box Pipeline

No expert knowledge required
No known features required
Arbitrary label set - define any emotions you want

Key Question

Why would we abandon models grounded in expert knowledge?

DLO Parameter Effects

Model Comparison

Model	precision_negative	recall_negative	macro_f1	weighted_f1	accuracy
CNN-fer	0.721	0.588	0.502	0.567	0.552
CNN-strong	0.627	0.642	0.507	0.530	0.533
CNN-weak	NaN	0.000	0.459	0.194	0.322
CLIP	0.623	0.728	0.502	0.518	0.546
CLIP-NE	0.609	0.914	0.490	0.509	0.609
MetaCLIP	0.597	0.934	0.453	0.496	0.602
MetaCLIP-NE	0.597	0.934	0.444	0.493	0.600
appleCLIP	0.786	0.257	0.439	0.385	0.367
appleCLIP-NE	0.767	0.346	0.474	0.439	0.415

References

Liu, Yuanyuan, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, and Shiguang Shan. 2022. “MAFW: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild.” In Proceedings of the 30th ACM International Conference on Multimedia. ACM. https://doi.org/10.1145/3503161.3548190.

Major, Sara, and Aleksandar Tomašević. 2025. “The face of populism: examining differences in facial emotional expressions of political leaders using machine learning.” Journal of Computational Social Science 8 (3). https://doi.org/10.1007/s42001-025-00392-w.

Ollero, Mar JF, Eduardo Estrada, Michael D Hunter, et al. 2023. “Characterizing Affect Dynamics with a Damped Linear Oscillator Model: Theoretical Considerations and Recommendations for Individual-Level Applications.” Psychological Methods.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 8748–63. PmLR.

Tomašević, Aleksandar, Hudson Golino, and Alexander P Christensen. 2024. “Decoding emotion dynamics in videos using dynamic Exploratory Graph Analysis and zero-shot image classification: A simulation and tutorial using the transforEmotion R package.” PsyArXiv. https://doi.org/10.31234/osf.io/hf3g7.

Tomašević, Aleksandar, and Sara Major. 2024. “Dynamic exploratory graph analysis of emotions in politics.” Advances.in/Psychology 2 (1): e312144. https://doi.org/10.56296/aip00021.