Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Uesaka, Toshimitsu; Suzuki, Taiji; Takida, Yuhta; Lai, Chieh-Hsin; Murata, Naoki; Mitsufuji, Yuki

Computer Science > Machine Learning

arXiv:2404.19228 (cs)

[Submitted on 30 Apr 2024]

Title:Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Authors:Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

View PDF

Abstract:Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions. Based on our theoretical results, we also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability. To verify the effectiveness of the proposed method, we demonstrate pretraining of multimodal representation models on the Conceptual Caption datasets and evaluate zero-shot classification and linear classification on common benchmark datasets.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2404.19228 [cs.LG]
	(or arXiv:2404.19228v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2404.19228

Submission history

From: Toshimitsu Uesaka [view email]
[v1] Tue, 30 Apr 2024 03:15:04 UTC (304 KB)

Computer Science > Machine Learning

Title:Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators