Automatic authorship analysis using Deep neural networks

Authorship analysis helps to study the characteristics that distinguish how two different persons write. Writing style can be extracted in several ways, like using bag of words strategies or handcrafted features. However, with the growing of Internet, we have been able to witness an increase in the...

Full description

Autores:
Sierra Loaiza, Sebastian Ernesto
Tipo de recurso:
Fecha de publicación:
2018
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/76747
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/76747
http://bdigital.unal.edu.co/73495/
Palabra clave:
Machine learning
Supervised Learning
Representation Learning
Automatic Authorship Analysis
Authorship Attribution
Author Profiling
Multimodal Author Profiling
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_260825b80962f5df92f9d838df4431b2
oai_identifier_str oai:repositorio.unal.edu.co:unal/76747
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
dc.title.spa.fl_str_mv Automatic authorship analysis using Deep neural networks
title Automatic authorship analysis using Deep neural networks
spellingShingle Automatic authorship analysis using Deep neural networks
Machine learning
Supervised Learning
Representation Learning
Automatic Authorship Analysis
Authorship Attribution
Author Profiling
Multimodal Author Profiling
title_short Automatic authorship analysis using Deep neural networks
title_full Automatic authorship analysis using Deep neural networks
title_fullStr Automatic authorship analysis using Deep neural networks
title_full_unstemmed Automatic authorship analysis using Deep neural networks
title_sort Automatic authorship analysis using Deep neural networks
dc.creator.fl_str_mv Sierra Loaiza, Sebastian Ernesto
dc.contributor.author.spa.fl_str_mv Sierra Loaiza, Sebastian Ernesto
dc.contributor.spa.fl_str_mv González Osorio, Fabio Augusto
dc.subject.proposal.spa.fl_str_mv Machine learning
Supervised Learning
Representation Learning
Automatic Authorship Analysis
Authorship Attribution
Author Profiling
Multimodal Author Profiling
topic Machine learning
Supervised Learning
Representation Learning
Automatic Authorship Analysis
Authorship Attribution
Author Profiling
Multimodal Author Profiling
description Authorship analysis helps to study the characteristics that distinguish how two different persons write. Writing style can be extracted in several ways, like using bag of words strategies or handcrafted features. However, with the growing of Internet, we have been able to witness an increase in the amount of user generated data in social networks like Facebook or Twitter. There is an increasing need in generating automatic methods capable of analyzing the style of a document for tasks like: determining the age of the author, determining the gender of the author, determining the authorship of the document given a set of possible authors, etc. Previous tasks are better known as author profiling and authorship attribution. Although capturing the style of an author can be a challenging task, in this thesis we explore representation learning strategies, in order to take advantage of the large amount of data generated by social media. In this thesis, we learned proper representations for the text inputs that were able to learn such patterns that are only distinguishable to an author (authorship attribution) or a social group of authors (author profiling). Proposed methods were compared using different publicly available datasets using social media data. Both author profiling and authorship attribution tasks are addressed using representation learning techniques such as convolutional neural networks and gated multimodal units. Our unimodal author profiling approach was submitted to the profiling shared task of the laboratory on digital forensics and stylometry(PAN). For authorship attribution, we proposed a convolutional neural network using character n-grams as input. We found that our approach outperformed standard attribution based methods as well as word based convolutional neural networks. For the author profiling task, we proposed one convolutional neural network for unimodal author profiling and adapted a gated multimodal unit for multimodal author profiling. The multimodal nature of user generated content consists of a scenario where the social group of an author can be determined not only using his/her written texts but using also the images that the user shared across the social networks. Gated multimodal units outperformed standard information fusion strategies: early and late fusion.
publishDate 2018
dc.date.issued.spa.fl_str_mv 2018-08-31
dc.date.accessioned.spa.fl_str_mv 2020-03-30T06:28:02Z
dc.date.available.spa.fl_str_mv 2020-03-30T06:28:02Z
dc.type.spa.fl_str_mv Trabajo de grado - Maestría
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/masterThesis
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TM
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/76747
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/73495/
url https://repositorio.unal.edu.co/handle/unal/76747
http://bdigital.unal.edu.co/73495/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.spa.fl_str_mv http://www.ingenieria.unal.edu.co/mindlab/
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de Sistemas
Ingeniería de Sistemas
dc.relation.haspart.spa.fl_str_mv 0 Generalidades / Computer science, information and general works
62 Ingeniería y operaciones afines / Engineering
dc.relation.references.spa.fl_str_mv Sierra Loaiza, Sebastian Ernesto (2018) Automatic authorship analysis using Deep neural networks. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/76747/1/SebastianSierraLoaiza.2019.pdf
https://repositorio.unal.edu.co/bitstream/unal/76747/2/SebastianSierraLoaiza.2019.pdf.jpg
bitstream.checksum.fl_str_mv a3018b7db3fcc9c64f8ca0b05546038f
a1e38afe0c92ba62db3a77f2eb880981
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1806886287385296896
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2González Osorio, Fabio AugustoSierra Loaiza, Sebastian Ernesto3200e0c6-3c38-4fb2-b114-0d5541934ece3002020-03-30T06:28:02Z2020-03-30T06:28:02Z2018-08-31https://repositorio.unal.edu.co/handle/unal/76747http://bdigital.unal.edu.co/73495/Authorship analysis helps to study the characteristics that distinguish how two different persons write. Writing style can be extracted in several ways, like using bag of words strategies or handcrafted features. However, with the growing of Internet, we have been able to witness an increase in the amount of user generated data in social networks like Facebook or Twitter. There is an increasing need in generating automatic methods capable of analyzing the style of a document for tasks like: determining the age of the author, determining the gender of the author, determining the authorship of the document given a set of possible authors, etc. Previous tasks are better known as author profiling and authorship attribution. Although capturing the style of an author can be a challenging task, in this thesis we explore representation learning strategies, in order to take advantage of the large amount of data generated by social media. In this thesis, we learned proper representations for the text inputs that were able to learn such patterns that are only distinguishable to an author (authorship attribution) or a social group of authors (author profiling). Proposed methods were compared using different publicly available datasets using social media data. Both author profiling and authorship attribution tasks are addressed using representation learning techniques such as convolutional neural networks and gated multimodal units. Our unimodal author profiling approach was submitted to the profiling shared task of the laboratory on digital forensics and stylometry(PAN). For authorship attribution, we proposed a convolutional neural network using character n-grams as input. We found that our approach outperformed standard attribution based methods as well as word based convolutional neural networks. For the author profiling task, we proposed one convolutional neural network for unimodal author profiling and adapted a gated multimodal unit for multimodal author profiling. The multimodal nature of user generated content consists of a scenario where the social group of an author can be determined not only using his/her written texts but using also the images that the user shared across the social networks. Gated multimodal units outperformed standard information fusion strategies: early and late fusion.Maestríaapplication/pdfspahttp://www.ingenieria.unal.edu.co/mindlab/Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de SistemasIngeniería de Sistemas0 Generalidades / Computer science, information and general works62 Ingeniería y operaciones afines / EngineeringSierra Loaiza, Sebastian Ernesto (2018) Automatic authorship analysis using Deep neural networks. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.Automatic authorship analysis using Deep neural networksTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/acceptedVersionTexthttp://purl.org/redcol/resource_type/TMMachine learningSupervised LearningRepresentation LearningAutomatic Authorship AnalysisAuthorship AttributionAuthor ProfilingMultimodal Author ProfilingORIGINALSebastianSierraLoaiza.2019.pdfapplication/pdf2485868https://repositorio.unal.edu.co/bitstream/unal/76747/1/SebastianSierraLoaiza.2019.pdfa3018b7db3fcc9c64f8ca0b05546038fMD51THUMBNAILSebastianSierraLoaiza.2019.pdf.jpgSebastianSierraLoaiza.2019.pdf.jpgGenerated Thumbnailimage/jpeg4270https://repositorio.unal.edu.co/bitstream/unal/76747/2/SebastianSierraLoaiza.2019.pdf.jpga1e38afe0c92ba62db3a77f2eb880981MD52unal/76747oai:repositorio.unal.edu.co:unal/767472023-07-15 23:04:03.358Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co