How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web. With ever-larger amounts of unstructured and limited labels, organizing and reconciling information from different sources and modalities is a central challenge in machine learning.

This cutting-edge tutorial aims to introduce the multimodal entailment task, which can be useful for detecting semantic alignments when a single modality alone does not suffice for a whole content understanding. Starting with a brief overview of natural language processing, computer vision, structured data and neural graph learning, we lay the foundations for the multimodal sections to follow. We then discuss recent multimodal learning literature covering visual, audio and language streams, and explore case studies focusing on tasks which require fine-grained understanding of visual and linguistic semantics question answering, veracity and hatred classification. Finally, we introduce a new dataset for recognizing multimodal entailment, exploring it in a hands-on collaborative section.

Overall, this tutorial gives an overview of multimodal learning, introduces a multimodal entailment dataset, and encourages future research in the topic.


The Recognizing Multimodal Entailment tutorial will be held live virtually at ACL-IJCNLP 2021: The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing on August 1st 2021 from 13:00 to 17:00 (UTC).

This and the full list of ACL-IJCNLP 2021 tutorials can be found on the conference's program page, and this tutorial session is held for event attendees on the Underline platform.


Section Subsection min
The landscape of online content 5
A case for multimodal entailment inferences 5
From word embeddings to contextualized representations 15
Fine-tuning pretrained models on downstream tasks 5
The textual entailment problem 5
Structured Data
Semi-structured and tabular text 5
Knowledge graphs 5
Neural Graph Learning Leveraging structured signals with Neural Structured Learning 10
Computer Vision Foundations of Computer Vision 20
Break - 10
Multimodal Learning

Attention Bottlenecks for Multimodal Fusion: state-of-the-art audio-visual classifications 15
Self-Supervised Multimodal Versatile Networks: visual, audio and language streams 25
Case studies: cross-modal fine-grained reasoning 15
Break - 10
Multimodal entailment
Multimodal models for entailment inferences 20
Multimodal entailment dataset 10
Final considerations
Closing notes 5
Q&A 30
Total 210



Colab notebook for Google Research recognizing-multimodal-entailment dataset.

A Tensorflow Keras baseline model authored by Sayak Paul using pre-trained ResNet50V2 and BERT-base encoders is available on this well documented Keras.io page, with the accompanying repository.

Illustrative example

Example of Multimodal Entailment
Example of multimodal entailment where texts or images alone would not suffice for semantic understanding or pairwise classifications.

Reading list

Natural Language Processing

Textual Entailment

Structured Data

Neural Graph Learning

Computer Vision

Multimodal Learning


Afsaneh Shirazi
Afsaneh Shirazi,
Arjun Gopalan
Arjun Gopalan,
Google Research
Arsha Nagrani
Arsha Nagrani,
Google Research

Cesar Ilharco
Cesar Ilharco,
Christina Liu
Christina Liu,
Google Research
Gabriel Barcik
Gabriel Barcik,
Google Research

Jannis Bulian
Jannis Bulian,
Google Research
Jared Frank
Jared Frank,
Lucas Smaira
Lucas Smaira,

Qin Cao
Qin Cao,
Google Research
Ricardo Marino
Ricardo Marino,
Roma Patel
Roma Patel,
Brown University


Afsaneh Shirazi, Alex Ku, Arjun Gopalan, Arsha Nagrani, Blaž Bratanič, Cesar Ilharco, Chris Bregler‎, Christina Liu, Felipe Ferreira, Gabriel Barcik, Gabriel Ilharco, Georg Osang, Jannis Bulian, Jared Frank, Lucas Smaira, Qin Cao, Ricardo Marino, Thomas Leung and Vaiva Imbrasaite.


We would like to thank Abby Schantz, Abe Ittycheriah, Aliaksei Severyn, Allan Heydon, Aly Grealish, Andrey Vlasov, Arkaitz Zubiaga, Ashwin Kakarla, Chen Sun, Clayton Williams, Cong Yu, Cordelia Schmid, Da-Cheng Juan, Dan Finnie, Dani Valevski, Daniel Rocha, David Chiang, David Price, David Sklar, Devi Krishna, Elena Kochkina, Elizabeth Hamon Reid, Enrique Alfonseca, Françoise Beaufays, Isabel Kraus-Liang, Isabelle Augenstein, Iulia Turc, Jacob Eisenstein, Jialu Liu, John Cantwell, John Palowitch, Jordan Boyd-Graber, Kenton Lee, Lei Shi, Luís Valente, Maria Voitovich, Mehmet Aktuna, Min Zhang, Mogan Brown, Mohammad Khan, Mor Naaman, Natalia P, Nidhi Hebbar, Pandu Nayak, Pete Aykroyd, Rahul Sukthankar, Richa Dixit, Sara Goetz, Sayak Paul, Sol Rosenberg, Steve Pucci, Tania Bedrax-Weiss, Tim Dettmers, Tobias Kaufmann, Tom Boulos, Tu Tsao, Vladimir Chtchetkine, Yair Kurzion, Yifan Xu and Zach Hynes.