Transformers vs CNN Architectures

Understanding the differences between Transformers and CNN architectures is crucial for developing effective emotion recognition models. While CNNs are traditionally favored for image-related tasks, Vision Transformers have recently been adapted for certain computer vision tasks, offering promising results thanks to their attention mechanisms. We explore both architectures to evaluate their performance in the context of our project.

In the realm of machine learning, particularly within the scope of image processing and facial expression recognition, two primary model architectures have emerged as frontrunners: Convolutional Neural Networks (CNNs) and Vision Transformers. Each architecture brings its unique strengths to the table, shaping the way we implement emotion recognition models.

Convolutional Neural Networks (CNNs)

ResNet-18 Architecture

CNNs have long been the backbone of image processing tasks, renowned for their efficiency in handling spatial data. At their core, CNNs utilize convolutional layers to filter inputs for useful information, using pooling layers to reduce dimensionality and fully connected layers to make predictions. This hierarchical approach enables CNNs to capture spatial hierarchies in images, making them particularly adept at recognizing facial features and expressions. Their robustness and relative simplicity have made them a go-to choice for tasks requiring the analysis of visual data.

  • Excellent at capturing spatial hierarchies and patterns within images.
  • Efficient in terms of computational resources, especially for image-related tasks.
  • Well-suited for tasks with a strong emphasis on texture, shape, and local features recognition.

Vision Transformers

Vision Transformer Architecture

Originally designed for natural language processing (NLP) tasks, Vision Transformers have recently made significant strides in the field of image recognition. Unlike CNNs, Vision Transformers do not inherently process data in a sequential or hierarchical manner. Instead, they rely on self-attention mechanisms to weigh the importance of different parts of the input data, irrespective of their position. This allows Vision Transformers to capture global dependencies and relationships within the data, offering a new perspective on tasks traditionally dominated by CNNs.

  • Capable of capturing long-range dependencies within the data, offering a broader understanding of context and relationships.
  • Highly flexible and adaptable to various types of input data, beyond just images.
  • Demonstrates superior performance on tasks requiring an understanding of complex patterns and global features.

Implications for Emotion Recognition

When it comes to emotion recognition, both architectures offer compelling advantages. CNNs, with their adeptness at detecting spatial patterns and features, excel at identifying the subtle nuances of facial expressions. However, their sometimes limited ability to grasp the broader context or more abstract emotional cues can be a drawback.

On the other hand, Vision Transformers, with their ability to understand global dependencies, offer a novel approach to recognizing emotions. They can potentially capture more complex emotional expressions that span across the entire face, rather than just local features. This makes them an exciting alternative for advancing emotion recognition technologies, especially in applications where understanding the nuanced expressions of children is crucial.

In our project, we aim to harness the strengths of both architectures, exploring how each can contribute to creating a more nuanced and accurate emotion recognition model. By comparing their performance across diverse datasets, including those focused exclusively on children's faces, we seek to uncover insights that will drive the future of emotional recognition technology forward.

Comparison of Transformers and CNN architectures in emotion recognition, featuring ResNet-18 and Vision Transformer models.

Transfer Learning

By leveraging the knowledge gained from pretrained models, we can significantly improve the performance of our emotion recognition system. This section outlines our approach to transfer learning, including the models and datasets we used.

Training on Embeddings

Our primary transfer learning approach focused on embeddings. Utilizing pretrained models to process images, we extracted and saved the output of the last layer as an embedding, which distilled hundreds of pretrained features into a format suitable for further classification tasks.

We explored two methods:

  1. Full Layer Training: Fine-tuning all layers of a pretrained model to harness comprehensive embeddings for subsequent classifiers.
  2. Partial Layer Training: Selectively freezing the initial layers of a pretrained model. These earlier layers typically capture universal features like textures and contours, while the latter layers, which are left unfrozen, are refined to discern more complex and task-specific patterns. This method allows us to concentrate on deep features that are more directly relevant to recognizing the nuances of facial expressions for emotion detection.

This tailored approach allowed us to leverage pretrained model strengths, concentrating the learning on the most relevant features for detecting children's facial expressions and thereby boosting performance and efficiency.

Training Loss for Transfer Learning
Training Accuracy for Transfer Learning

Top: Training loss comparison. Bottom: Training accuracy comparison.

Data Augmentation Techniques

To prepare our models for the variability encountered in real-world scenarios, we employed a series of data augmentation techniques. These methods enrich our dataset with images that simulate a range of environmental conditions and potential occlusions that one might encounter in practical applications.

Utilizing OpenCV, we introduced occlusions by overlaying black squares over the eye regions in images, mimicking situations where parts of the face are obscured. We also varied the brightness of images within a specific range to replicate different lighting conditions. These augmentations not only increased the robustness of our models but also effectively doubled our dataset size, thereby enhancing the models' ability to generalize to new, unseen data.

Data Augmentation Techniques