A Sentiment Analysis Model of Spanish Tweets. Case Study: Colombia 2014 Presidential Election

Abstract. What people say on social media has turned into a rich source of information to understand social behavior. Sentiment analysis of Twitter data has been widely used to capture trends in public opinion regarding important events such as political elections. However, current research in socia...

Full description

Autores:
Cerón-Guzmán, Jhon Adrián
Tipo de recurso:
Fecha de publicación:
2016
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/56482
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/56482
http://bdigital.unal.edu.co/52257/
Palabra clave:
0 Generalidades / Computer science, information and general works
38 Comercio, comunicaciones, transporte / Commerce, communications and transportation
46 Lenguas española y portuguesa / Specific languages
Social media
Twitter
Spanish tweets
Spammer detection
Lexical normalization
Sentiment analysis
Voting intention inference
Politics
Presidential election
Colombia
tweets en español
Detección de spammers
Normalización léxica
Análisis de sentimientos
Inferencia de intención de votación,
Política
Elecciones presidenciales
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
Description
Summary:Abstract. What people say on social media has turned into a rich source of information to understand social behavior. Sentiment analysis of Twitter data has been widely used to capture trends in public opinion regarding important events such as political elections. However, current research in social media analysis in political domains faces two major problems, namely: sentiment analysis methods implemented are often too simple, and most of the researches have assumed that all users and their tweets are trustworthy. This thesis is aimed at dealing with these problems to achieve more reliable public opinion measurements. Colombia 2014 presidential election was proposed as case study. First, a research on social spammer detection on Twitter was carried out by following machine learning approaches to distinguish spammer accounts from non-spammer ones. Because of the brevity of tweets and the widespread use of mobile devices, Twitter is also a rich source of noisy data containing many non-standard word forms. Since this is a task that exploits the large amount of user-generated texts, the performance of sentiment analysis may drop significantly if several lexical variation phenomena are not dealt with. For that reason, a lexical normalization system of Spanish tweets was developed to improve the quality of natural language analysis, using finite-state transducers and statistical language modeling. Lastly, a sentiment analysis system of Spanish tweets was developed by implementing a supervised classification approach. The system was applied in the Colombian election to infer voting intention. Experimental results highlight the importance of denoising in Twitter data to achieve more reliable public opinion measurements. Together with this, results show the potential of social media analysis to infer vote share, obtaining the lowest mean absolute error and correctly ranking the highest-polling candidates in the first round election. However, such an important method cannot be put forward as a substitute of the traditional polling.