Extraction of User Preference for Video Stimuli Using EEG-Based User Responses

Please download to get full document.

View again

of 10
19 views
PDF
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Extraction of User Preference for Video Stimuli Using EEG-Based User Responses
Document Share
Document Tags
Document Transcript
  ETRI Journal, Volume 35, Number 6, December 2013 © 2013   Jinyoung Moon et al.   1105 http://dx.doi.org/10.4218/etrij.13.0113.0194   Owing to the large number of video programs available, a method for accessing preferred videos efficiently through personalized video summaries and clips is needed. The automatic recognition of user states when viewing a video is essential for extracting meaningful video segments. Although there have been many studies on emotion recognition using various user responses, electroencephalogram (EEG)-based research on preference recognition of videos is at its very early stages. This paper proposes classification models based on linear and nonlinear classifiers using EEG features of band power (BP) values and asymmetry scores for four preference classes. As a result, the quadratic-discriminant-analysis-based model using BP features achieves a classification accuracy of 97.39% (±0.73%), and the models based on the other nonlinear classifiers using the BP features achieve an accuracy of over 96%, which is superior to that of previous work only for binary preference classification. The result proves that the proposed approach is sufficient for employment in personalized video segmentation with high accuracy and classification power. Keywords: Preference, video, EEG, classification, feature selection, brain-computer interface. Manuscript received Feb. 28, 2013; revised Aug. 20, 2013; accepted Sept. 16, 2013. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2010-0028631, 13SS1110). Jinyoung Moon (phone: +82 42 860 6712, jymoon@etri.re.kr) is with the Software Research Laboratory, ETRI, Daejeon, Rep. of Korea and is also affiliated with KAIST, Daejeon, Rep. of Korea. Youngrae Kim (youngrae@etri.re.kr), Hyungjik Lee (leehj@etri.re.kr), and Changseok Bae (csbae@etri.re.kr) are with the Software Research Laboratory, ETRI, Daejeon, Rep. of Korea. Wan Chul Yoon (corresponding author, wcyoon@kaist.ac.kr) is with the Department of Industrial and System Engineering, KAIST, Daejeon, Rep. of Korea. I. Introduction There is a flood of video content from the Internet and the many television channels available through terrestrial, cable, and satellite systems. The average length of video viewing per  person exceeds twenty hours per week in some European countries and in the United States of America [1]. There is an increasing tendency for viewers to re-watch impressive sections of videos, recall memorable information, or share interesting parts with their acquaintances. Therefore, people require video clips or summaries of video segments that are meaningful to them to efficiently obtain access to the viewed videos. It is cumbersome for the average viewer to extract significant video segments or generate annotations of them by manually using a video editing tool. Therefore, the automatic recognition of user states during video viewing is essential for automatically generating personalized clips and summaries. There have been some studies on the automatic recognition of user states, such as interest and attention, by analyzing user responses to viewed videos. To measure the degree of user interest in viewed videos, such studies used the fact that users show their current feelings through facial expressions, body movements, and various peripheral responses [2]-[5]. Joho and others   [2] proposed facial expression models based on three  pronounced levels of facial expressions and their change rate. Although facial expressions are obtained from an observing camera in an unobtrusive way, it is hard for people to accurately interpret facial expressions. In general, people have neutral facial expressions. Peng and others   [3] proposed an interest meter for measuring the interest score of a user for home videos by analyzing a captured video containing the user’s upper body. Their framework recognized facial expressions for their emotion model and detected head motions, Extraction of User Preference for Video Stimuli   Using EEG-Based User Responses   Jinyoung Moon, Youngrae Kim, Hyungjik Lee, Changseok Bae, and Wan Chul Yoon    1106   Jinyoung Moon et al.   ETRI Journal, Volume 35, Number 6, December 2013 http://dx.doi.org/10.4218/etrij.13.0113.0194  blinks, and saccades for their attention model. Because people do not always give significant facial expressions in home videos, they added an analysis of eye and head movements. However, the highs and lows of the interest score could not segregate the entire video into funny/attentive or other segments, although the interest scores were relatively high in the funny/attentive video segments during their experiment. Money and Agius [4], [5] presented an analysis framework for  processing peripheral signals, including electrodermal response, respiration amplitude, respiration rate, blood volume pulse, and heart rate. The peripheral signals delicately reflect changes of user states compared to their facial expressions and body movements. The framework clarified the characteristics of each peripheral signal for each genre of film and suggested a rank value representing the significance of the user response. However, the rank values were useful only for summarizing videos in the comedy and horror genres, which arouse noticeably significant user responses. An electroencephalogram (EEG) is a recorded signal of the electrical activity of the brain influenced by the central nervous system. This type of signal is recorded from the scalp and is different from a peripheral signal, which is influenced by the autonomic nervous system. Because the brain has been recognized as the center for cognitive activities in cognitive science and psychology [6], there have been many studies on analyzing user states through the use of EEG signals [7]-[20]. For emotion recognition, the studies provided subjects with emotion-inducing visual or auditory stimuli, that is, images, sounds, music excerpts, and video clips [7]-[18]. The studies classified the collected EEG signal segments into selected  primitive emotion types, which are distributed over the valance-arousal emotion space in the two-dimensional emotion model [21]. The EEG-based methods for emotion recognition achieved similar or better classification accuracy compared to the method based on the peripheral signals [16]-[18]. For attention measurement, the studies classified EEG signals into concentrated and non-concentrated states by having the subjects perform concentration and non-concentration tasks [19] and hold a focused and unfocused gaze [20]. Extracting meaningful video segments requires a method to measure the preference toward videos under the assumption that preferred video segments are more significant to viewers than other video segments. However, the proposed methods for analyzing user states on emotion and attention [7]-[20] are not enough to measure user preference toward videos because induced positive-valence and high-arousal emotions do not always create high preference. For example, many people have  been attracted to sad music or sad movies, which evoke negative-valence emotions. In addition, horror movies are  popular with some people but not with others. Some people have preferences for negative emotion in music or videos for various reasons, and all people have their own optimal arousal levels at which they feel comfortable [22], [23]. Compared to many studies on EEG-based emotion recognition, research on EEG-based recognition of user  preference is at its very early stage. Aurup [24] targeted product  preference induced by product images. Product preference inducing purchases has a close relationship with strong  pleasantness corresponding to positive-valence and high-arousal emotions. Hadjidimitriou and Hadjileontiadis [25]  performed an EEG-based preference experiment using music as audio stimuli. For the first time, Koelstra and others [17] achieved an average accuracy of 57.9% for like/dislike binary classification using two-minute clips of music video as stimuli and EEG samples collected from five subjects. In their next  publication, Yazdani and others [18] achieved an average accuracy of 70.25% using the same stimuli and samples, which was better than that of their previous result. However, their method was designed only for binary preference. In addition, an accuracy of 70.25% is not satisfactory for personalized video segmentation in video applications. Therefore, we focus on an approach to assess not the emotion aroused but the preference for video stimuli by analyzing EEG signals collected during video viewing. We  propose classification models based on linear and nonlinear classifiers using band power (BP) and asymmetry score (AS) features, which are targeted for the following four preference classes: most preferred, preferred, less preferred, and least  preferred. All models based on nonlinear classifiers using the BP features of all the frequency bands achieve over a 96% accuracy, which is an outstanding performance compared to  previous work [17], [18]. Additionally, the accuracy is maintained in the models using only 43% to 70% of all BP features, which are reduced by a filter method for feature selection. The results prove that the proposed approach is sufficient to be employed for personalized video segmentation with high accuracy and classification power. II. EEG-Based Methods for Extracting Preference The proposed methods of this study follow the typical  procedure for EEG signal analysis, as shown in Fig. 1. 1. EEG Data Acquisition EEG data is collected from fifteen healthy right-handed subjects (eight males and seven females) whose ages range from 23 to 43. All of the subjects are graduate students or workers for a research institute. The video stimuli are designed to establish ground truth  ETRI Journal, Volume 35, Number 6, December 2013 Jinyoung Moon et al.   1107 http://dx.doi.org/10.4218/etrij.13.0113.0194   Fig. 1. Typical procedure for classifying user states using EEGsignals. EEG data acquisition EEG data  preprocessing Feature extraction ClassificationRaw EEG Preprocessed EEG Feature vector (FV) classFeature reduction Reduced FV FV Fig. 2. Experimental procedure consisting of pre-inspection stagefor personalized video stimuli and experimental stage for EEG data and preference labels.   Pre-inspection stage Provide the list of songs and a downscaled version Select the 2 best & 2 worst songs from the responses Receive the responses of the 3 best and 3 worst songs Add other 2 songs excluded in the responses 6-songvideo stimuliWear the EEG headset Assign preference classes for the stimuli View the entire personalized video stimuli Check if EEG signals are good RecordedEEG data Experimental stage Preference labels labels on video datasets for measuring user preferences. Television music shows are selected as the source of the video dataset because they can be definitively segmented with a song as the unit, and the segmented songs can be clearly ranked by each viewer according to their own preference. Twenty-one Korean pop songs around three minutes long are extracted from music show programs from terrestrial television networks with the full HD resolution (1920 × 1080). The video dataset contains songs of different genres, including ballads, dance,  pop rock, hip hop, electro pop, and swing. The experimental protocol of this study is divided into pre-inspection and experimental stages, as shown in Fig. 2. The  pre-inspection stage is necessary to generate personalized visual stimuli reflecting subject preference for a video dataset. First, subjects receive an explanation for the goal of the pre-inspection and experiment. Second, each subject completes a questionnaire to select the three best and three worst songs out of the twenty-one videos. The list of the twenty-one songs, including their titles and singers, and a downscaled version (480 × 320) of the twenty-one videos are provided to the subjects. Subjects who are unfamiliar with the latest Korean  pop songs can use the downscaled version for a Fig. 3.Experimental environment. While subjects wearing EEGheadset view theirpersonalized video stimuli, raw EEGdata from EEG headset is accumulated through wirelesstransmission.   EEG headset Playing video stimuli EEG dataacquisition WirelessUSB receiver  preview of all videos. Third, the video stimuli includes the two  best and two worst songs chosen from the questionnaire responses for the well-distributed preference level of the stimuli. Fourth, the video stimuli add two other songs selected randomly from the rest of the fifteen songs excluded in the questionnaire responses. Finally, the personalized video stimuli consist of six songs out of the twenty-one songs of the video dataset. In the experimental stage, a subject and two operators of the experiment are allowed to access the laboratory. While wearing an EEG headset, the subject receives explanations for the goal of the experiment, the EEG device, the experimental procedure, the time required, and the personalized stimuli. The purpose of the explanations is to reduce the subject’s anxiety about the experiment using the unfamiliar EEG device. In addition, the subject is instructed to concentrate on viewing the video stimuli in a nearly static position. After checking whether the EEG signals from all electrodes of the EEG headset are sufficient, the experimenter has the subject watch a 33-second introduction of a music show and then six video segments corresponding to six Korean pop songs, as shown in Fig. 3. As there are no intermissions or neutral videos between the video segments, it takes about 20 minutes for each subject to finish the experiment. The EEG data of the subject is gathered from the EEG headset while the subject views the personalized stimuli. As shown in Fig. 4, the EEG device collects the subject’s EEG signals from the following fourteen channels of wet electrodes placed on the scalp: AF3, AF4, F7, F8, F3, F4, FC5, FC6, T7, T8, P7, P8, O1, and O2. EEG signals are recorded by the EEG device at a sampling rate of 128 Hz. The bandwidth of the recorded EEG signals is from 0.2 Hz to 45 Hz using a  1108   Jinyoung Moon et al.   ETRI Journal, Volume 35, Number 6, December 2013 http://dx.doi.org/10.4218/etrij.13.0113.0194 Fig. 4. Placement of fourteen electrodes used in international 10-20 system, which is internationally recognized method todescribe location of electrodes on scalp for EEGexperiments or applications [26].  Nasion Fz Fp2 AF8 F10FT10T10A2Fp1 AF7 F9   FT9   A1 T9 AFzF5   F1 F2 AF3 AF4 F7   F3 Fz F4 F6 F8 FC5   FT7   FC3   FC1   FCzFC2   FC4   FC6   FC8T7 T8C6 C4   C2   CzC1   C3 C5 P7 P8 P4 PzP9 P1 P3   P5 Inion P10 P6 P2 POzPO4   PO8   PO3   PO7   O1   O2   Oz   Iz  Nz TP10TP7 CP5   CP3   CP1   CPzCP2   CP4   CP6   TP6TP9 0.16-Hz high-pass filter, an 83-Hz low-pass filter, and 50-Hz and 60-Hz notch filters. The notch filters are employed for removing the power line artifacts. The environmental artifacts caused by the amplifier and aliasing are removed from the hardware [27]. To employ the proposed approach when extracting video segments meaningful to users, this study subdivides the binary  preference into two-level positive and two-level negative  preferences. One of the ground truth labels (most preferred,  preferred, less preferred, or least preferred) is assigned to the video stimuli by each subject immediately after viewing all video stimuli. The preference for a video is influenced by both the video itself and the context of videos given together and the sequential order among them in the general environment of video viewing. Accordingly, the subject can accurately assign a  preference label for one song according to the degree of  preference, not locally but globally after viewing all stimuli. There is no heavy cognitive load to the subject when the subject assigns preference labels because the subject is already familiar with the four songs chosen from the questionnaire responses among the personalized stimuli. Furthermore, the total number of video stimuli does not exceed the capacity of the subject’s working memory [28]. 2. EEG Data Preprocessing EEG signal preprocessing means that recorded EEG data is  prepared for feature extraction and further analysis. As shown in Fig. 5, the 33-second introduction of a music show and the Fig. 5. Preprocessing EEG data. Thirty-three-second introductionof music show and one-second start and end of each videosegment are excluded. Each video segment is assignedtoone preference class, that is, most preferred, preferred,less preferred, or least preferred. video1video2video3video4 video5 video633-second Intro Time (second)Most preferredPreferredLess preferred Least preferredPreference classes  one-second start and end of each video segment are extracted from the recorded EEG data. The preference classes of the remaining EEG data are assigned according to the results of the questionnaire completed by the subject. In the baseline removal, the mean of the EEG signal from each channel is subtracted from the EEG signal for each channel so that all EEG signal values across the channels are distributed to around zero. The biological artifacts resulting from blinks and eyeball movements are avoided by excluding artifact-influenced frequency bands in the feature extraction. To prevent artifacts  by muscle and eyeball movements, this study instructs subjects to view the video stimuli in front of the television in a nearly static position. Because it is impossible to prevent all biological artifacts from the live subjects, additional effort is necessary to minimize the influence of the biological artifacts after the EEG recording stage. The influence of biological artifacts can be reduced by automatic or manual artifact removal or by the rejection of artifact-related frequency bands. The artifacts by heartbeat and artifacts by blinks and eyeball movements appear at around 1.2 Hz and below 4 Hz, respectively. In contrast, the artifacts by muscle movements are most dominant above 50 Hz [7], [11]. A method to reject the frequency bands affected by such artifacts is employed in this study because the time-consuming artifact removal is inappropriate for a real-time signal analysis. The EEG values of each channel are filtered by a Butterworth filter with a passband between 4 Hz and 40 Hz, which is an artifact-resilient frequency range. The Butterworth filter is employed because it was designed to have as flat a frequency response as possible in the passband [29]. 3. Feature Extraction One of the most common approaches for investigating EEG data is to analyze the activated and inactivated power values or their changes at significant frequency bands, which are delta (<4 Hz), theta (4 Hz to 7 Hz), alpha (8 Hz to 13 Hz), beta  ETRI Journal, Volume 35, Number 6, December 2013 Jinyoung Moon et al.   1109 http://dx.doi.org/10.4218/etrij.13.0113.0194   Table 1. Extracted features and number of features.  Feature No. of features BP 56 = 14 channels × 4 bands AS 28 = 7 channels × 4 bands BPAS (both BP and AS) 84 = 21 channels × 4 bands (14 Hz to 30 Hz), and gamma (>30 Hz) [30]. The frequency range of each frequency band varies little in the EEG literature. When the EEG data is separated according to the range of the frequency bands, the frequency bands provide more information on the neural activities of normal people or a sign of psychological disorders. In this study, four frequency bands in the range of 4 Hz to 40 Hz are selected as follows: theta (4 Hz to 7 Hz), alpha (8 Hz to 13 Hz), beta (14 Hz to 30 Hz), and gamma (31 Hz to 40 Hz). The total power values of each frequency band and each channel are typically extracted as features in the frequency domain by using a discrete Fourier transform (DFT). The DFT coefficient  X  ( k  ) of a one-second non-overlapping EEG segment, whose length is  N  , is obtained by (1). The commonly used fast Fourier transform algorithm is adopted. The power spectrum is calculated by the square of the absolute value of  X  ( k  ) [31]. 10 2 π ()()exp()  N n  X k x n j kn N       , k   = 0,1,…,  N   –1. (1) For four different frequency bands per channel, the total power values are obtained by the sum of the power values within the range of each frequency band. The total power of each band is transformed in a natural log scale because the power values are usually positively skewed [32]. In addition, there are features derived using the asymmetry resulting from the spectral difference between the power values of EEG electrodes that constitute a symmetrical pair. There have been many studies conducted on the relationship between hemispheric asymmetry and emotion since 1979 [33]. In [32], the features were extracted from (2), wherein the right and left EEG power spectra of a symmetric pair are  P   R  and  P   L , respectively. ln()ln().  R L  Asymmetry score P P     (2) For four different frequency bands per channel, the ASs are also extracted as features. All extracted features are listed in Table 1. The signal smoothing for extracted feature values is aimed at mitigating fluctuations of the power values and reducing the remaining noises. An autoregressive moving average filter with a five-second window is employed to smooth the power values. Finally, all smoothed features are normalized with the standard score before classification. The normalization by the standard score generates the new feature values of each frequency band from each channel with a mean of 0 and a standard deviation of 1. The normalization is required to reduce the dependency on features with high values or a large margin. 4. Classification This study employs five classifiers for recognizing four  preference classes (most preferred, preferred, less preferred, and least preferred) and comparing the classification accuracies of the classifiers used. The classifiers include three nonlinear classifiers, that is, the k   nearest neighbors, a support vector machine (SVM) with a radial basis function (RBF) kernel, and a quadratic discriminant analysis (QDA), and two linear classifiers, that is, an SVM with a linear kernel and a linear discriminant analysis (LDA). The k   nearest neighbor algorithm ( k  -NN) is an intuitive method for assigning a test instance to the dominant class among its k   nearest neighbors in Euclidean distance using the majority vote [11], [14], [31], [34]. This study selects a value of 4 for k  , which is a small positive integer, because the maximum number of BPAS features and the curse of dimensionality in k  -NN are considered. An   SVM is one of the most popular supervised learning techniques for classification and regression [8]-[11], [13], [15], [27], [31], [34], [35]. An SVM performs classification with an  N  -dimensional separating hyperplane constructed for minimizing the classification errors on the test set and maximizing its margins. To compare the performance of linear and nonlinear classifiers, this study employs two SVMs with a linear kernel and an RBF kernel. An SVM with a linear kernel requires the value of the cost parameter, and an SVM with an RBF kernel needs both cost and γ  parameters. This study uses default values for the parameters provided by the employed SVM library, which are cost = 1 and γ  = 1/(data dimension), without the tuning of the parameters by cross-validation for determining the best combination of cost and γ  parameters  because it is a time-consuming process, inadequate for real-time analysis. LDA is a method to find a linear combination of features to express a categorical independent variable, such as class, which is contrary to a numerical independent variable of a linear regression analysis. Because of its low computational requirement, LDA has commonly been used for systems using EEG signals [10], [14], [31]. QDA [10], [11], [31] is a more general version of LDA using a quadratic surface, which has no assumption that each class is normally distributed. In both
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks