Multimodal representation learning with neural networks

Abstract: Representation learning methods have received a lot of attention by researchers and practitioners because of their successful application to complex problems in areas such as computer vision, speech recognition and text processing [1]. Many of these promising results are due to the develop...

Full description

Autores:
Arevalo Ovalle, John Edilson
Tipo de recurso:
Doctoral thesis
Fecha de publicación:
2018
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/63866
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/63866
http://bdigital.unal.edu.co/64463/
Palabra clave:
0 Generalidades / Computer science, information and general works
37 Educación / Education
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Multimodal-learning
Representation-learning
Information-fusion
GMU
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_8deae92050544ac92309b0e640b4e596
oai_identifier_str oai:repositorio.unal.edu.co:unal/63866
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
dc.title.spa.fl_str_mv Multimodal representation learning with neural networks
title Multimodal representation learning with neural networks
spellingShingle Multimodal representation learning with neural networks
0 Generalidades / Computer science, information and general works
37 Educación / Education
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Multimodal-learning
Representation-learning
Information-fusion
GMU
title_short Multimodal representation learning with neural networks
title_full Multimodal representation learning with neural networks
title_fullStr Multimodal representation learning with neural networks
title_full_unstemmed Multimodal representation learning with neural networks
title_sort Multimodal representation learning with neural networks
dc.creator.fl_str_mv Arevalo Ovalle, John Edilson
dc.contributor.author.spa.fl_str_mv Arevalo Ovalle, John Edilson
dc.contributor.spa.fl_str_mv Gonzalez, Fabio A
Solorio, Thamar
dc.subject.ddc.spa.fl_str_mv 0 Generalidades / Computer science, information and general works
37 Educación / Education
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
topic 0 Generalidades / Computer science, information and general works
37 Educación / Education
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Multimodal-learning
Representation-learning
Information-fusion
GMU
dc.subject.proposal.spa.fl_str_mv Multimodal-learning
Representation-learning
Information-fusion
GMU
description Abstract: Representation learning methods have received a lot of attention by researchers and practitioners because of their successful application to complex problems in areas such as computer vision, speech recognition and text processing [1]. Many of these promising results are due to the development of methods to automatically learn the representation of complex objects directly from large amounts of sample data [2]. These efforts have concentrated on data involving one type of information (images, text, speech, etc.), despite data being naturally multimodal. Multimodality refers to the fact that the same real-world concept can be described by different views or data types. Addressing multimodal automatic analysis faces three main challenges: feature learning and extraction, modeling of relationships between data modalities and scalability to large multimodal collections [3, 4]. This research considers the problem of leveraging multiple sources of information or data modalities in neural networks. It defines a novel model called gated multimodal unit (GMU), designed as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The GMU can be used as a building block for different kinds of neural networks and can be seen as a form of intermediate fusion. The model was evaluated on four supervised learning tasks in conjunction with fully-connected and convolutional neural networks. We compare the GMU with other early and late fusion methods, outperforming classification scores in the evaluated datasets. Strategies to understand how the model gives importance to each input were also explored. By measuring correlation between gate activations and predictions, we were able to associate modalities with classes. It was found that some classes were more correlated with some particular modality. Interesting findings in genre prediction show, for instance, that the model associates the visual information with animation movies while textual information is more associated with drama or romance movies. During the development of this project, three new benchmark datasets were built and publicly released. The BCDR-F03 dataset which contains 736 mammography images and serves as benchmark for mass lesion classification. The MM-IMDb dataset containing around 27000 movie plots, poster along with 50 metadata annotations and that motivates new research in multimodal analysis. And the Goodreads dataset, a collection of 1000 books that encourages the research on success prediction based on the book content. This research also facilitates reproducibility of the present work by releasing source code implementation of the proposed methods.
publishDate 2018
dc.date.issued.spa.fl_str_mv 2018
dc.date.accessioned.spa.fl_str_mv 2019-07-02T22:14:06Z
dc.date.available.spa.fl_str_mv 2019-07-02T22:14:06Z
dc.type.spa.fl_str_mv Trabajo de grado - Doctorado
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/doctoralThesis
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_db06
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TD
format http://purl.org/coar/resource_type/c_db06
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/63866
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/64463/
url https://repositorio.unal.edu.co/handle/unal/63866
http://bdigital.unal.edu.co/64463/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de Sistemas
Ingeniería de Sistemas
dc.relation.references.spa.fl_str_mv Arevalo Ovalle, John Edilson (2018) Multimodal representation learning with neural networks. Doctorado thesis, Universidad Nacional de Colombia - Sede Bogotá.
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/63866/1/multimodal-representation-learning.pdf
https://repositorio.unal.edu.co/bitstream/unal/63866/2/multimodal-representation-learning.pdf.jpg
bitstream.checksum.fl_str_mv 98fc2e44524863dc71bbd09d8eee81ab
99fd27d1fe57f23ffa1b9fc074181fec
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1814089796970086400
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Gonzalez, Fabio ASolorio, ThamarArevalo Ovalle, John Edilsone48aeebb-5062-4b15-bf4d-5abfd0618c1f3002019-07-02T22:14:06Z2019-07-02T22:14:06Z2018https://repositorio.unal.edu.co/handle/unal/63866http://bdigital.unal.edu.co/64463/Abstract: Representation learning methods have received a lot of attention by researchers and practitioners because of their successful application to complex problems in areas such as computer vision, speech recognition and text processing [1]. Many of these promising results are due to the development of methods to automatically learn the representation of complex objects directly from large amounts of sample data [2]. These efforts have concentrated on data involving one type of information (images, text, speech, etc.), despite data being naturally multimodal. Multimodality refers to the fact that the same real-world concept can be described by different views or data types. Addressing multimodal automatic analysis faces three main challenges: feature learning and extraction, modeling of relationships between data modalities and scalability to large multimodal collections [3, 4]. This research considers the problem of leveraging multiple sources of information or data modalities in neural networks. It defines a novel model called gated multimodal unit (GMU), designed as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The GMU can be used as a building block for different kinds of neural networks and can be seen as a form of intermediate fusion. The model was evaluated on four supervised learning tasks in conjunction with fully-connected and convolutional neural networks. We compare the GMU with other early and late fusion methods, outperforming classification scores in the evaluated datasets. Strategies to understand how the model gives importance to each input were also explored. By measuring correlation between gate activations and predictions, we were able to associate modalities with classes. It was found that some classes were more correlated with some particular modality. Interesting findings in genre prediction show, for instance, that the model associates the visual information with animation movies while textual information is more associated with drama or romance movies. During the development of this project, three new benchmark datasets were built and publicly released. The BCDR-F03 dataset which contains 736 mammography images and serves as benchmark for mass lesion classification. The MM-IMDb dataset containing around 27000 movie plots, poster along with 50 metadata annotations and that motivates new research in multimodal analysis. And the Goodreads dataset, a collection of 1000 books that encourages the research on success prediction based on the book content. This research also facilitates reproducibility of the present work by releasing source code implementation of the proposed methods.Doctoradoapplication/pdfspaUniversidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial Ingeniería de SistemasIngeniería de SistemasArevalo Ovalle, John Edilson (2018) Multimodal representation learning with neural networks. Doctorado thesis, Universidad Nacional de Colombia - Sede Bogotá.0 Generalidades / Computer science, information and general works37 Educación / Education6 Tecnología (ciencias aplicadas) / Technology62 Ingeniería y operaciones afines / EngineeringMultimodal-learningRepresentation-learningInformation-fusionGMUMultimodal representation learning with neural networksTrabajo de grado - Doctoradoinfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_db06Texthttp://purl.org/redcol/resource_type/TDORIGINALmultimodal-representation-learning.pdfapplication/pdf6973468https://repositorio.unal.edu.co/bitstream/unal/63866/1/multimodal-representation-learning.pdf98fc2e44524863dc71bbd09d8eee81abMD51THUMBNAILmultimodal-representation-learning.pdf.jpgmultimodal-representation-learning.pdf.jpgGenerated Thumbnailimage/jpeg4164https://repositorio.unal.edu.co/bitstream/unal/63866/2/multimodal-representation-learning.pdf.jpg99fd27d1fe57f23ffa1b9fc074181fecMD52unal/63866oai:repositorio.unal.edu.co:unal/638662024-05-01 23:11:30.293Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co