EXTRACTING MFCC AND GTCC FEATURES FOR EMOTION RECOGNITION FROM AUDIO SPEECH SIGNALS

Please download to get full document.

View again

of 18
172 views
PDF
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Emotion recognition from speech has an increasing interest in recent years given the broad field of applications. The recognition system developed here uses Mel Frequency Cepstrum Coefficient (MFCC) and Gammatone Cepstrum Coefficient (GTCC) as the
Document Share
Document Tags
Document Transcript
    INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS www.ijrcar.com   Vol.2 Issue 8, Pg.: 46-63 August 2014 Minu Babu et al Page 46 INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345   EXTRACTING MFCC AND GTCC FEATURES FOR EMOTION RECOGNITION FROM AUDIO SPEECH SIGNALS Minu Babu 1 , Dr. Arun Kumar M.N 2 , Mrs. Susanna M. Santhosh 3 1  MTtech Scholar, Department of Computer Science and Engineering, Federal Institute of Science and Technology (FISAT), Mahatma Gandhi University, Kottayam, Kerala minubabu4@gmail.com   2  Associate Professor, Department of Computer Science and Engineering, Federal Institute of Science and Technology (FISAT), Mahatma Gandhi University, Kottayam, Kerala akmar_mn11@rediffmail.com 3  Assistant Professor, Department of Computer Science and Engineering, Mar Baselios Institute of Technology and Science  (  MBITS  ),  Mahatma Gandhi University, Kottayam, Kerala  susannasanthosh@gmail.com   Abstract Emotion recognition from speech has an increasing interest in recent years given the broad field of applications. The recognition system developed here uses Mel Frequency Cepstrum Coefficient (MFCC) and Gammatone Cepstrum Coefficient (GTCC) as the feature vectors for recognizing emotions in a speech signal. MFCC is the most commonly used feature vector for classification. But, MFCC systems usually do not perform well under noisy conditions because extracted features are distorted by noise, causing mismatched likelihood calculation. By introducing a novel speaker feature, gammatone cepstral coefficient (GTCC), based on an auditory periphery model, and show that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions. An important finding in the study is that GTCC features outperform conventional MFCC features under noisy conditions. These features are then used to train the classifier. The classifier used for the system is a cascade feed forward back propagation neural network. The database consists of 240 speech samples. Among them, 180 samples are used for training the system and the remaining 60 samples are used for testing the system. This study compares the differences between MFCC and GTCC for recognizing emotion from speech. The error rate of the system corresponds to MFCC and GTCC is 0.009703 and 0.0090822 respectively. Keywords :  Automatic speech recognition, Pre-processing, Feature extraction, Classification, Mel-frequency cepstral coefficient, Gammatone cepstral coefficient. 1.   Introduction Speech is a complex signal which contains information about the message, speaker, language and emotions. Speech is one of the most natural communication forms between human beings. Emotion on other side is an individual mental state that arises spontaneously rather than through conscious effort. Humans also express    INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS www.ijrcar.com   Vol.2 Issue 8, Pg.: 46-63 August 2014 Minu Babu et al Page 47 their emotion via written and spoken language. Enabling systems to interpret user utterances for a more intuitive human machine interaction therefore suggests also understanding transmitted emotional aspects. The actual user emotion may help system track the user's behaviour by adapting to his inner mental state. Generally recognition of emotions is in the scope of research in the human-machine-interaction. Among other modalities like mimic speech is one of the most promising and established modalities for the recognition. There are several emotional hints carried within the speech signal. The database for the speech emotion recognition system is the emotional speech samples. The classifiers are used to differentiate emotions such as anger, happiness, sadness, surprise, fear, neutral state, etc. The classification performance is based on extracted features. The features extracted from these speech samples are: the energy, pitch, linear prediction cepstrum coefficient, mel frequency cepstrum coefficient. General layout of a Speech Emotion Recognition System is shown in Figure 1.1:   Figure 1.1: General Layout of a Speech Emotion Recognition System   Like typical pattern recognition systems, speech emotion recognition system contains four main modules: speech input, feature extraction, feature selection, classification, and emotion output. Since a human cannot classify easily natural emotions, it is difficult to expect that machines can offer a higher correct classification. Affective computing is a field of research that deals with recognizing, interpreting and processing emotions or other affective phenomena. It plays an increasingly important role in assistive technologies. With the help of affective computing, computers are no longer indifferent logical machines. They may be capable of understanding a user’s feelings, needs, and wants and giving feedback in a manner that is much easier for users to accept. Emotion recognition is an essential component in affective computing. In daily communication, identifying emotion in speech is a key to deciphering the underlying intention of the speaker. Computers with the ability to recognize different emotional states could help people who have difficulties in understanding and identifying emotions. Many studies have been conducted in an attempt to automatically determine emotional states in speech. Some of them used acoustic features such as Mel frequency cepstral coefficients (MFCCs) and fundamental frequency to detect emotional cues, while other studies employed prosodic features in speech to achieve higher accuracy of the classification. Various classifiers were applied to recognizing emotions, Hidden Markov Models (HMM), Naïve Bayes classifier, and decision tree classifier. Deep neural network techniques have recently yielded impressive performance gains across a wide variety of competitive tasks and challenges. For example, a number of the world's leading industrial speech recognition groups have reported significant recognition performance gains through deep network techniques. The large scale image recognition challenge has also recently been won (by a wide margin) through the use of deep neural networks. The Emotion Recognition in the wild challenge is based on an extended form of the acted facial expressions in the wild database in which short video clips extracted from feature length movies have been annotated for different emotions. A core aspect of this approach is the use of a deep convolution neural network for frame-based facial expression classification. To train this model, an additional data composed of images of faces with expressions labelled as one of seven basic emotions (angry, disgust, fear, happy, sad, surprise and neutral). The use of this additional data seems to have made a big difference in the performance by allowing it to train high capacity models without over fitting to the relatively small challenge training data. Importantly, a direct measure of per frame errors on the challenge data does not yield performance that is superior to the challenge baseline; however, the strategy of using the challenge training data to learn how to aggregate the per frame predictions was able to boost performance substantially. These efforts lead to a number of contributions and a number of insights which believe may be more broadly applicable. First, believe that the approach of using the large scale mining of imagery from data image search to train deep neural networks has helped to avoid over fitting in facial expression model. Perhaps counter intuitively, found that the convolution network models learned using only our additional static frame training data sets were able to yield higher validation set  performance if the labelled video data from the challenge was only used to learn the aggregation model and the static frames of the challenge training set were not used to train the underlying convolution network. It believed that this effect is also explained in part by the fact that many of the video frames in isolation are not representative of the emotional tag and their inclusion in the training set for the static frame deep neural network classifier further exacerbates the problem of over fitting, adding noise to the training set. The problem of over fitting had both direct consequences on per-model performance on the validation set as well as indirect    INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS www.ijrcar.com   Vol.2 Issue 8, Pg.: 46-63 August 2014 Minu Babu et al Page 48 consequences on the ability to combine model predictions. The analysis of simple model averaging showed that no combination of models could yield superior performance to an SVM applied to the outputs of our audio-video models. The efforts to create both SVM and Multi-Layer Perceptron (MLP) aggregator models lead to similar observations in those models quickly over fit the training data and no settings of hyper parameters could  be found which would yield increased validation set performance. This is due to the fact that the activity recognition and bag of mouth models severely over fit the challenge training set and the SVM and MLP aggregation techniques over fit the data and in such a way that no traditional hyper parameter tuning could yield validation set performance gains. These observations led to develop the novel technique of aggregating the model and per class predictions via random search over simple weighted averages. The resulting aggregation technique is therefore of extremely low complexity and the underlying prediction was therefore highly constrained - using simple weighted combinations of complex deep network models, each of which did reasonably well at this task. As this yielded a marked increase in performance on both the challenge validation and test sets it leads us to the interpretation that given the presence of models that over fit the training data, it may be better practice to search a moderate space of simple combination models compared to more traditional approaches such as searching over the smaller space of SVM hype parameters or even a moderately sized space of traditional MLP hyper parameters including the number of hidden layers and the number of units per layer. In recent seventy years, much research has been done on speech recognition, the human speech processing and converting it into a sequence of words referring to. Although a lot of processing on speech recognition  performance, but we are still far from having a natural interaction between human and machine, the machine does not understand human emotion states. This new research field has introduced a sense of speech recognition. Researchers believe that this sense of the speech recognition can be useful to extract meaning from speech and can improve the performance of speech recognition systems.   The detected emotions recognized are used in man-machine interfaces to recognize errors in the man machine- interaction by a negative user emotion. If a user seems annoyed after a system reaction, error-recovery strategies are started. On the other hand a joyful user encourages a system to train user models without supervision. First or higher order user preferences can be trained to constrain the potential intention sphere for erroneously recognition instances like speech or gesture input. To do so a system online needs a reference value like a positive user reaction. Furthermore the system initiatively provides help for a seemingly irritated user. Control or induction of user emotions is another field of application that requires the knowledge of the actual emotion. For example in high risk-tasks it seems useful to calm down a nervous person, do not distract her by shortening dialogues, or keep a tired user awake. Other general applications of an emotion recognition system are:    Dialog system  for detecting angry users    Tutoring system   for detecting student’s interest/certainty    Lie detection    Social interaction system  for detecting frustration, disappointment, surprise etc 2.   Literature Survey Several studies were conducted in the field of emotion recognition systems. The literature survey for the project includes the earlier works done in the field of emotion recognition from speech. It also describes the various feature vectors and classifiers for recognizing emotion from speech. 2.1    Feature Extraction and Selection A speech signal composed of large number of parameters which indicates emotion contents of it. Changes in these parameters indicate changes in the emotions. Proper choice of feature vectors is one of the most important tasks in speech recognition. A speech signal composed of large number of parameters which indicates emotion contents of it. Changes in these parameters indicate changes in the emotions. Proper choice of feature vectors is one of the most important tasks in speech recognition. The feature vectors can be classified as long-time and short-time feature vectors. The long-time ones are estimated over the entire length of the utterance. The short-time ones are determined over window of usually less than 100 ms. The long-time approach identifies emotions more efficiently. Commonly used features [1] are energy and related features (the energy is the basic and most important feature in speech signal. The statistics of energy in the whole speech sample can be obtained by calculating the energy, such as mean value, max value, variance, variation range etc.), pitch and related features (the value of pitch frequency can be calculated in each speech frame) and qualitative features (emotional contents of a utterance is strongly related with its voice quality). The acoustic parameters related to speech quality are voice level, signal amplitude, energy and duration, voice pitch, phrase, word, feature boundaries, temporal structures, Linear Prediction Cepstrum Coefficients (LPCC), Mel-Frequency Cepstrum Coefficients    INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS www.ijrcar.com   Vol.2 Issue 8, Pg.: 46-63 August 2014 Minu Babu et al Page 49 (MFCC) and Perceptual Linear Predictive (PLP) coefficients etc. LPCC embodies the characteristics of  particular channel of speech. The Linear Predictive analysis is based on the assumption that the shape of the vocal tract governs the nature of the sound being produced. So these feature coefficients can be used to identify the emotions contained in speech. MFCC is based on the characteristics of the human ear’s hearing. It uses a nonlinear frequency unit to simulate the human auditory system. Mel frequency scale is the most widely used feature of the speech, with a simple calculation, good ability of the distinction, anti-noise and other advantages. In 2013, Dipti D. Joshi and Prof. M. B. Zalte [1] of Mumbai University published a report regarding the various feature vectors and classifiers for emotion recognition from speech. According to their report, feature vectors can be classified as long-time and short-time feature vectors. The long-time ones are estimated over the entire length of the utterance. The short-time ones are determined over window of usually less than 100 ms. The long-time approach identifies emotions more efficiently. COMMONLY USED FEATRURES 1)    Energy and related features The Energy is the basic and most important feature in speech signal. To obtain the statistics of energy in the whole speech sample by calculating the energy, such as: mean value, max value, variance, variation range, contour of energy. 2)    Pitch and related features The value of pitch frequency can be calculated in each speech frame. 3)    Qualitative Features Emotional contents of a utterance is strongly related with its voice quality. The acoustic parameters related to speech quality are voice level (signal amplitude, energy and duration) ,voice pitch, phrase, word and feature boundaries, temporal structures. 4)    Linear Prediction Cepstrum Coefficients (LPCC) LPCC embodies the characteristics of particular channel of speech. The Linear Predictive analysis is  based on the assumption that the shape of the vocal tract governs the nature of the sound being  produced. So these feature coefficients can be used to identify the emotions contained in speech. 5)    Mel-Frequency Cepstrum Coefficients (MFCC) MFCC is based on the characteristics of the human ear’s hearing. It uses a nonlinear frequency unit to simulate the human auditory system. Mel frequency scale is the most widely used feature of the speech, with a simple calculation, good ability of the distinction, anti-noise and other advantages. 6)    Wavelet Based features Speech signal is a non-stationary signal, with sharp transitions, drifts and trends which is hard to analyze. A time-frequency representation of such signals can be performed using wavelets. For speaker emotional state identification applications the Discrete Wavelet Transform offers the best solution. 2.2    Classifier Selection    Selection of classifier depends on the geometry of the input feature vector. Some classifiers are more efficient with certain type of class distributions. Various Classifiers used are:   Hidden Markov Model (HMM), Gaussian Mixtures Model (GMM), Support Vector Machine (SVM), Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), and Decision Trees.    INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS www.ijrcar.com   Vol.2 Issue 8, Pg.: 46-63 August 2014 Minu Babu et al Page 50 1)    Hidden Markov Model (HMM)    The Hidden Markov Model (HMM) [2] is a popular statistical tool for modelling a wide range of time series data. In the context of natural language processing (NLP), HMMs have been applied with great success to problems such as part-of-speech tagging and noun-phrase chunking. HMM   has been used widely for speech emotion recognition due to its advantage on dynamic time warping capability. That is, its ability to estimate the similarity between two temporal sequences which may vary in time or speed. The Hidden Markov Model (HMM) is a powerful statistical tool for modeling generative sequences that can be characterised by an underlying process generating an observable sequence. HMMs have found application in many areas interested in signal processing, and in particular speech  processing, but have also been applied with success to low level NLP tasks such as part-of-speech tagging, phrase chunking, and extracting target information from documents. Andrei Markov gave his name to the mathematical theory of Markov processes in the early twentieth century, but it was Baum and his colleagues that developed the theory of HMMs in the 1960s. However, the classify property of HMM is not satisfactory. 2)    Gaussian Mixtures Model (GMM)    GMMs are suitable for developing emotion recognition model when large number of feature vector is available. Gaussian Mixture Models (GMMs) [3] are among the most statistically matured methods for clustering and for density estimation. The GMM and the HMM, are the most used ones for speech emotion recognition. A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a  parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm. GMMs are often used in biometric systems, most notably in speaker recognition systems, due to their capability of representing a large class of sample distributions. One of the powerful attributes of the GMM is its ability to form smooth approximations to arbitrarily shaped densities. The classical uni-modal Gaussian model represents feature distributions by a position (mean vector) and a elliptic shape (covariance matrix) and a vector quantizer (VQ) or nearest neighbour model represents a distribution  by a discrete set of characteristic templates. A GMM acts as a hybrid between these two models by using a discrete set of Gaussian functions, each with their own mean and covariance matrix, to allow a  better modeling capability. The GMM not only provides a smooth overall distribution fit, its components also clearly detail the multi-modal nature of the density. The use of a GMM for representing feature distributions in a biometric system may also be motivated by the intuitive notion that the individual component densities may model some underlying set of hidden   classes. For example, in speaker recognition, it is reasonable to assume the acoustic space of spectral related features corresponding to a speaker’s broad phonetic events, such as vowels, nasals or fricatives. These acoustic classes reflect some general speaker dependent vocal tract configurations that are useful for characterizing speaker identity. 3)    Artificial Neural Network (ANN) Another common classifier, used for many pattern recognition applications is the artificial neural network (ANN) [4]. There are a large number of different types of networks, but they all are characterized by the following components: a set of nodes, and connections between nodes. The nodes can be seen as computational units. They receive inputs, and process them to obtain an output. This  processing might be very simple (such as summing the inputs), or quite complex (a node might contain another network). The connections determine the information flow between nodes. They can be unidirectional, when the information flows only in one sense, and bidirectional, when the information flows in either sense. The interactions of nodes though the connections lead to a global behaviour of the network, which cannot be observed in the elements of the network. This global behaviour is said to  be emergent. This means that the abilities of the network super cede the ones of its elements, making networks a very powerful tool.  Networks are used to model a wide range of phenomena in physics, computer science, biochemistry, mathematics, sociology, economics, telecommunications, and many other areas. This is because many
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks