CS Colloquium: “A Cooking (Training) Guide for Neural Cross-Language Retrieval Models” (Eugene Yang, John Hopkins University)
Abstract
Most search engines retrieve information in the same language as the user query, operating under the assumption that users can only read content in the language they write their queries. While this assumption held true in the past, advancements in machine translation systems and large language models now enable highly accurate and efficient translations between numerous languages, empowering users to access information beyond their native tongues. However, translating all documents before serving to any users is usually cost-prohibitive because of the size of the search corpus. This underscores the need for end-to-end cross-language retrieval models capable of directly retrieving relevant content from multiple languages based on user queries. In this talk, I will present a guide for training neural cross-language retrieval models based on various levels of resource availability.
Speaker’s Biography
Eugene Yang is a research scientist at the Human Language Technology Center of Excellence (HLTCOE) at Johns Hopkins University. His recent works focus on multilingual and cross-language information retrieval and their applications. Eugene also co-organizes the TREC NeuCLIR track since 2022. Before joining HLTCOE, Eugene received his Ph.D. from Georgetown University, where he worked on High Recall retrieval for electronic Discovery with Ophir Frieder and David D. Lewis.