Multi-language machine translation without parallel corpora is challenging because there is no explicit supervision between languages. Existing unsupervised methods typically rely on topological properties of the language representations. We introduce a framework that instead uses the visual modality to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms baselines on unsupervised word-level and sentence-level translation using retrieval.
This research is based on work partially supported by the DARPA GAILA program under Contract No. HR00111990058, the DARPA KAIROS program under PTE Federal Award No. FA8750-19-2-1004, NSF CRII Award #1850069, and an Amazon Research Gift. We thank NVidia for GPU donations. The webpage template was inspired by this project page.