Artificial ways of understanding and utilizing human communication have been possible with NLP techniques. Sentiment analysis using text datasets is one of the successful domains of machine learning. Along with NLP, audio processing and computer vision have been evolving into more sophisticated areas of data science. Feature engineering is enriched when we combine the textual information, facial gestures, and audio signal properties.

1. Challenges faced by using only text

Using text data, the models fail to detect user emotion. It fails to recognize whether the sentence was said when the user was angry or happy. Even the humor or sarcasm events are misclassified with text data. This is a fact that humans understand emotion based on the tune of audio and the sentence being said. There is a lot of information that gets expressed by facial expressions and audio tonality which gets missed by doing sentiment analysis on text only.

E.g. Sentence: Ram is really doing great.

Text-based emotion detector models would classify this sentence in a positive/happy class but the sentence could be said sarcastically as well. One interesting thing about audio models is that the above sentence could be classified in Positive/Happy, Negative/Angry, or Neutral based on audio features of the file, and hence audio features open up a new dimension for detecting the emotion of the sentence. 

2. How to overcome these challenges using audio?

Sentiments and emotions are key properties contained by different audio features. The most common audio features used in emotion analysis are MFCC, pitch, spectral centroid, spectral flux, beat histogram, etc. These audio features target the intention behind saying the particular sentence. 

For example: 

Sentence – “Kids are talking by the door”

Text Sentiment – POSITIVE with confidence(0.934)

We can get different emotions from the same sentence using audio features

Kids are talking by the door – Happy

Kids are talking by the door – Angry

Kids are talking by the door – Neutral

Kids are talking by the door – Disgust


Hence, considering audio features while building emotion detection models becomes very crucial.

3. How to overcome these challenges using facial recognition?

There are a lot of rich cues in visual data. According to a survey, around 80% of the data on the internet is visual data. Visual features include facial expressions which are of paramount importance in capturing sentiments and emotions. The most predictive cue in visual data is a smile.

Let’s try to understand this with a simple example.

 While using chat platforms like Whatsapp, Facebook messenger, etc. We fail to recognize the true sentiment of the person but this doesn’t happen while having a video call. Face to face communication is preferred by all over text-based communication because facial expressions have a lot of information regarding the current sentiment of the person. All this information can be represented in the form of a numeric vector and can be used in the training of emotion detection models. 


As we can see a lot of information regarding sentiment is hidden in the faces especially the smile of a person.

4. The multimodal solution is the future

Multimodal sentiment analysis is a new dimension of the traditional text-based sentiment analysis, which goes beyond the analysis of texts, and includes other modalities such as audio and visual data. It can be a combination of multiple models based on the audio, facial features along with the text. There is an extensive amount of social media data available in the form of images and videos and building models on such data enhances emotion detection capabilities. The three kinds of features which are used extensively used in multimodal solutions are :

  1. Textual features
  2. Audio features
  3. Visual features

We can combine these features i.e. fuse them on feature – level, decision – level, or in a hybrid way.

Feature – level fusion

This is sometimes called early fusion. In this technique we gather features from all the modality, join them together into a single vector and then pass it to the classification algorithm.

Decision – level fusion

This is sometimes called late fusion. In this technique, we feed data from each modality independently into its classification algorithm and obtain the final sentiment classification results by fusing each result into a single decision vector.

Hybrid – level fusion

This is a combination of the decision – level, and feature – level fusion. In this technique, we exploit information from both methods during the classification process. 


5. Use – cases of multimodal solutions

They can be applied in the development of different forms of recommender systems such as analysis of product reviews, give product or service recommendation, analysis of movie reviews. They can be extensively used in the development of virtual assistants through the application of NLP and ML techniques. In the healthcare domain, they can be used in stress, anxiety, depression detection. And the application goes limitless….

Such is the power of multimodal solutions. Truly multimodal solutions are the future of customer service enhancement and will play a crucial role in customer retention and service delivery.


  1. Kdnuggets 
  2. Wikipedia