Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis

Sun, Zhongkai; Sarma, Prathusha K.; Sethares, William A.; Liang, Yingyu

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-85106643433

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis

Title	Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis
Authors	Sun, Zhongkai Sarma, Prathusha K.Sethares, William A.Liang, Yingyu
Issue Date	2020
Citation	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, p. 8992-8999 How to Cite?
Abstract	Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.
Persistent Identifier	http://hdl.handle.net/10722/341311

DC Field	Value	Language
dc.contributor.author	Sun, Zhongkai	-
dc.contributor.author	Sarma, Prathusha K.	-
dc.contributor.author	Sethares, William A.	-
dc.contributor.author	Liang, Yingyu	-
dc.date.accessioned	2024-03-13T08:41:49Z	-
dc.date.available	2024-03-13T08:41:49Z	-
dc.date.issued	2020	-
dc.identifier.citation	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, p. 8992-8999	-
dc.identifier.uri	http://hdl.handle.net/10722/341311	-
dc.description.abstract	Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.	-
dc.language	eng	-
dc.relation.ispartof	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence	-
dc.title	Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-85106643433	-
dc.identifier.spage	8992	-
dc.identifier.epage	8999	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats