Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News


ACM International Conference on Multimodal Interaction 2014


Joseph G. Ellis, Brendan Jou, and Shih Fu Chang
Digital Video and Multimedia Lab
Columbia University


[paper]    [poster]



We present a multimodal sentiment study performed on a novel collection of videos mined from broadcast and cable television news programs. To the best of our knowledge, this is the first dataset released for studying sentiment in the domain of broadcast video news. We describe our algorithm for the processing and creation of person-specific segments from news video, yielding 929 sentence-length videos, and are annotated via Amazon Mechanical Turk. The spoken transcript and the video content itself are each annotated for their expression of positive, negative or neutral sentiment.

Based on these gathered user annotations, we demonstrate for news video the importance of taking into account multimodal information for sentiment prediction, and in particular, challenging previous text-based approaches that rely solely on available transcripts. We show that as much as 21.54% of the sentiment annotation of users for transcripts differ from their respective sentiment annotations when the video clip itself is presented. We present audio and visual classification baselines over a three-way sentiment prediction of positive, negative and neutral, as well as person-dependent versus person-independent classification influence on performance. Finally, we release the dataset to the greater research community.




Although text-based sentiment analysis has been studied in great detail recently, video based sentiment analysis is still somewhat in it's infancy. Much of the opinion mining analysis is done in domains that have heavily polarized lexicons and obvious sentiment polarity. For example, a very popular domain for sentiment analysis can be movie and product reviews, where the text available is heavily polarized and there is little room for ambiguity. Statements like "I absolutely loved this movie" or the "the acting was terrible", have very clear and polarized sentiment that can be attributed to them.

However, in more complicated domains, such as news video transcripts or news articles, the sentiment attached to a statement can be much less obvious. For example, take the statement that has been relevant in the news in the past year, ``Russian troops have entered into Crimea''. This statement by itself is not polarizing as positive or negative and is in fact quite neutral. However, if it was stated by a U.S. politician it would probably have very negative connotations and if stated by a Russian politician it could have a very positive sentiment associated with it. Therefore, in more complicated domains such as news the text content is often not sufficient to determine the sentiment of a particular statement. For some ambiguous statements it is important to take into account the way that words are spoken (audio) and the gestures and facial expressions (visual) that accompany the sentence to be able to more accurately determine the sentiment of the statement.




The specific contributions of this work are as follows: The release of a video dataset in the novel domain of video news, annotated for multimodal sentiment. A study demonstrating the importance of the audio and visual components of a statement in determining sentiment in news video. Baseline audio and visual classifiers and experiments for the dataset are presented. Experiments demonstrating improved performance with person-centric audio and visual sentiment classification models compared to global models.



News Rover Sentiment Dataset


Mining News Speaker Segments: Each speaker segment is mined by first performing audio speaker diarization and segmenting the audio portion of a news program into speech segments. These speech segments are further refined using the closed caption transcript. We then automatically mine the locations that names of speakers appear on screen, and then apply the names to the speech segments. We link the names from similar speech segments together to obtain the named speech segments. Finally, we cut the speech segments into single sentence videos, and these are the basic components within our study




Gathering Sentiment Annotations: We gathered annotations for each of the videos via Amazon Mechanical Turk. We first presented the spoken sentence to the user in text form, and had them annotate the sentiment of the sentence. Then we had the turkers watch the video and reannotate the sentiment. We found that on 21.54% of the annotations that we collected the users changed their sentiment annotation between watching the text and the video. A screenshot of the Human Intelligence Tasks that we created can be seen below.



Annotation Statistics: The videos used for this dataset were recorded and processed between August 13, 2013 and December 25, 2013, and are taken from a large variety of American news programs and channels. A breakdown of the dataset by person, with their occupation and amount of videos within the study can be seen below. We limit the length of the videos used in the study to be between 4 and 15 seconds long. This is done because it can be difficult to decipher sentiment for very short videos with little speech, and videos longer than 15 seconds could have multiple statements with opposing sentiment. In total we collected 929 sentence length videos with annotations.




Feature Extraction: We perform feature extraction from the audio and visual components of each sentence length video. We extract audio features using the openSMILE audio emotion feature extraction toolkit. We use a template matching approach over the mouth region of consecutive subsampled fames in a video to detect which, if any, of the visual speakers are speaking on stage. From the detected speakers we extract LBP based BoW features from each of the frames. For a more detailed explanation of our feature extraction pipeline please see the published paper. The entire video processing pipeline from start to finish can be seen below.




Experiments and Results

We conduct experiments on the News Rover Sentiment dataset presented above. We trained both global and person-specific sentiment models using linear SVMs and attempted to predict the sentiment within each video. We chose 4-fold cross-validation as our training metric, and presented in this section is the average accuracy across the 4-folds. For the original dataset that was created we tested all 929 videos with the extracted audio features, but we only were able to automatically detect 455 videos with speaking faces. Therefore, we only tested the visual features using these 455 videos. To provide a consistent dataset where audio and video can be tested and compared against each other we also went through each of the 929 videos, and found each of the "clean" videos in the dataset. The clean videos are those videos which have the correct name applied to the video, and the speaker can be seen visually speaking. The clean dataset is composed of 650 videos.


Original News Rover Sentiment Dataset Results:



"Clean" News Rover Sentiment Dataset Results:




Data and Code

We provide the links to each of the News Rover datset videos, and the extracted features that we have extracted from each of the videos. We also provide the clean and original version of the dataset with aggregated annotations for each video. As well as the data we also provide code to replicate the performance and results that are seen above. For access to the dataset, please fill out the following information. You will receive an email with file link.

Full Name:


[Thanks to Yong Jae Lee for the webpage template]