Recognizing Multimodal Entailment

Welcome

How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web. With ever-larger amounts of unstructured and limited labels, organizing and reconciling information from different sources and modalities is a central challenge in machine learning.

This cutting-edge tutorial aims to introduce the multimodal entailment task, which can be useful for detecting semantic alignments when a single modality alone does not suffice for a whole content understanding. Starting with a brief overview of natural language processing, computer vision, structured data and neural graph learning, we lay the foundations for the multimodal sections to follow. We then discuss recent multimodal learning literature covering visual, audio and language streams, and explore case studies focusing on tasks which require fine-grained understanding of visual and linguistic semantics question answering, veracity and hatred classification. Finally, we introduce a new dataset for recognizing multimodal entailment, exploring it in a hands-on collaborative section.

Overall, this tutorial gives an overview of multimodal learning, introduces a multimodal entailment dataset, and encourages future research in the topic.

Venue

The Recognizing Multimodal Entailment tutorial will be held live virtually at ACL-IJCNLP 2021: The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing on August 1st 2021 from 13:00 to 17:00 (UTC).

This and the full list of ACL-IJCNLP 2021 tutorials can be found on the conference's program page, and this tutorial session is held for event attendees on the Underline platform.

Outline

Section	Subsection	min
Introduction	The landscape of online content	5
Introduction	A case for multimodal entailment inferences	5
Natural Language Processing	From word embeddings to contextualized representations	15
	Fine-tuning pretrained models on downstream tasks	5
	The textual entailment problem	5
Structured Data	Semi-structured and tabular text	5
Structured Data	Knowledge graphs	5
Neural Graph Learning	Leveraging structured signals with Neural Structured Learning	10
Computer Vision	Foundations of Computer Vision	20
Break	-	10
Multimodal Learning	Attention Bottlenecks for Multimodal Fusion: state-of-the-art audio-visual classifications	15
	Self-Supervised Multimodal Versatile Networks: visual, audio and language streams	25
	Case studies: cross-modal fine-grained reasoning	15
Break	-	10
Multimodal entailment	Multimodal models for entailment inferences	20
Multimodal entailment	Multimodal entailment dataset	10
Final considerations	Closing notes	5
Final considerations	Q&A	30
Total	–	210

Slides

Dataset

Colab notebook for Google Research recognizing-multimodal-entailment dataset.

A Tensorflow Keras baseline model authored by Sayak Paul using pre-trained ResNet50V2 and BERT-base encoders is available on this well documented Keras.io page, with the accompanying repository.

Illustrative example

Example of multimodal entailment where texts or images alone would not suffice for semantic understanding or pairwise classifications.

Reading list

Presenters

Afsaneh Shirazi,
Google

Arjun Gopalan,
Google Research

Arsha Nagrani,
Google Research

Cesar Ilharco,
Google

Christina Liu,
Google Research

Gabriel Barcik,
Google Research

Jannis Bulian,
Google Research

Jared Frank,
Google

Lucas Smaira,
DeepMind

Qin Cao,
Google Research

Ricardo Marino,
Google

Roma Patel,
Brown University

Organizers

Afsaneh Shirazi, Alex Ku, Arjun Gopalan, Arsha Nagrani, Blaž Bratanič, Cesar Ilharco, Chris Bregler‎, Christina Liu, Felipe Ferreira, Gabriel Barcik, Gabriel Ilharco, Georg Osang, Jannis Bulian, Jared Frank, Lucas Smaira, Qin Cao, Ricardo Marino, Thomas Leung and Vaiva Imbrasaite.

Acknowledgements

We would like to thank Abby Schantz, Abe Ittycheriah, Aliaksei Severyn, Allan Heydon, Aly Grealish, Andrey Vlasov, Arkaitz Zubiaga, Ashwin Kakarla, Chen Sun, Clayton Williams, Cong Yu, Cordelia Schmid, Da-Cheng Juan, Dan Finnie, Dani Valevski, Daniel Rocha, David Chiang, David Price, David Sklar, Devi Krishna, Elena Kochkina, Elizabeth Hamon Reid, Enrique Alfonseca, Françoise Beaufays, Isabel Kraus-Liang, Isabelle Augenstein, Iulia Turc, Jacob Eisenstein, Jialu Liu, John Cantwell, John Palowitch, Jordan Boyd-Graber, Kenton Lee, Lei Shi, Luís Valente, Maria Voitovich, Mehmet Aktuna, Min Zhang, Mogan Brown, Mohammad Khan, Mor Naaman, Natalia P, Nidhi Hebbar, Pandu Nayak, Pete Aykroyd, Rahul Sukthankar, Richa Dixit, Sara Goetz, Sayak Paul, Sol Rosenberg, Steve Pucci, Tania Bedrax-Weiss, Tim Dettmers, Tobias Kaufmann, Tom Boulos, Tu Tsao, Vladimir Chtchetkine, Yair Kurzion, Yifan Xu and Zach Hynes.

Welcome

Venue

Outline

Slides

Dataset

Illustrative example

Reading list

Natural Language Processing

Textual Entailment

Structured Data

Neural Graph Learning

Computer Vision

Multimodal Learning

Presenters

Organizers

Acknowledgements