Abstract:
<jats:title>Abstract</jats:title>
<jats:p><jats:bold>Background/ introduction: </jats:bold>We propose the large emotion-labelled dataset consisted of tweets labelling representative emotions hashtags posted over 12 years to train a specific emotion detection model. The dataset is available at https://github.com/ EmotionDetection/6H-AP_emotion_labelled_tweets. Prediction of human emotion has been and remains a major challenge in many research fields such as psychology, neuroscience, and computer science. Tweets are considered as a suitable source for collecting big data using emotion hashtags as reliable emotion annotations. However, little is known about data collection criteria on how to apply emotion hashtags (i.e., type and position of emotion hashtags). <jats:bold>Methods: </jats:bold>To elucidate unclear criteria, this paper collected over five million tweets that were divided into six datasets. Five traditional ML algorithms trained on six different datasets were evaluated on both internal test sets (30 analyses) of six datasets and external test set (30 analyses). <jats:bold>Results:</jats:bold> We propose the emotion labelled dataset ( n =1,478,116; any position of representative emotions hashtags) that achieved the highest F1 score. Furthermore, this paper compared the model trained on the proposed dataset with the model trained on a small dataset. We find that this large dataset further improved the model performance in deep learning (18 analyses) than in traditional ML algorithms (30 analyses). <jats:bold>Conclusions: </jats:bold>Finally, we share the proposed dataset with other researchers to contribute to future specific emotion detection model studies, provide reliable baseline results for this data set.</jats:p>