Globetrotter: Unsupervised Multilingual Translation from Visual Alignment

1 Columbia University
2 UC Berkeley
Teaser figure
While each language represents a bicycle with a different word, the underlying visual representations remains consistent. A bicycle has similar appearance in the UK, France, Japan and India. We leverage this natural property to learn models of machine translation across multiple languages without paired training corpora.

Multi-language machine translation without parallel corpora is challenging because there is no explicit supervision between languages. Existing unsupervised methods typically rely on topological properties of the language representations. We introduce a framework that instead uses the visual modality to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms baselines on unsupervised word-level and sentence-level translation using retrieval.


    title={Globetrotter: Unsupervised Multilingual Translation from Visual Alignment},
    author={Sur\'is, D\'idac and Epstein, Dave and Vondrick, Carl},
    journal={arXiv preprint arXiv:2012.04631},


Our model learns an aligned embedding space for language translation by leveraging a transitive relation through vision. Cross-sentence similarity βij is estimated by the path through an image collection. Our approach learns this path by using both cross-modal (text-image) and visual similarity metrics. We do not use any paired data across languages.

Code, data and pretrained models

For this project we collected a dataset of image descriptions for 52 different languages. The training set contains descriptions of images that do not overlap across languages, and the testing set contains descriptions for the same images for all the languages. We release our code, dataset and pretrained models. Please check our Github project for more information.

This research is based on work partially supported by the DARPA GAILA program under Contract No. HR00111990058, the DARPA KAIROS program under PTE Federal Award No. FA8750-19-2-1004, NSF CRII Award #1850069, and an Amazon Research Gift. We thank NVidia for GPU donations. The webpage template was inspired by this project page.