Does the inclusion of other modalities enhance the performance of speech emotion recognition systems?

Liu, JunchenJames, JesinNathwani, Karan2025-01-092025-01-092024-12-01Nineteenth Australasian International Conference on Speech Science and Technology, Melbourne, Australia, 03 Dec 2024 - 05 Dec 2024. Proceedings of the Nineteenth Australasian International Conference on Speech Science and Technology. Australasian Speech Science and Technology Association. 32-36. 01 Dec 20242207-1296https://hdl.handle.net/2292/71017The pursuit of natural human-computer interaction has driven the advancement of emotion recognition technology. Speech emotion recognition (SER) has gained widespread attention due to its high applicability. Recently, some researchers have been interested in developing multi-modal emotion recognition (MER)systemsthat integrate speech with text and video modalities to enhance robustness and accuracy. We analyse the performance of these systems using the IEMOCAP and RAVDESS datasets, highlighting the impact of different modality combinations on emotion recognition accuracy. This paper aims to guide future research in optimising MER by leveraging the complementary advantages of various modalities.Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher.https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htmDoes the inclusion of other modalities enhance the performance of speech emotion recognition systems?Conference ItemCopyright: ASSTA