2006-08-10 New Directions in Multilingual Information Access

I really liked the invited talk, I took fairly detailed notes, so please give it a read. It is a real change from what you usually see in the research world: focus on the practical real world, and in ways a bit pessimistic.

There are about 25-30 people in attendance.

Opening remarks:

In 2002 there was a workshop, and they asked what have they done and what should we be doing now? In 2002 the statements were:

  • Doug Oard: CLIR is a solved problem, the next step is to transition into practice
  • Mayfield and MacNamee: 5 years research objectives (almost nothing accomplished)

Why has the field gone is a different direction? And why has so little progress been made?

Invited talk: From R&D to Practice – Challenges to Multilingual Information Access in the Real World

David A. Evans, Clairvoyance Corporation

Overview of a presentation in 2002 on the state of CLIR / Multilingual Information Access. Comparison with the present. English is the predominant language, but there has been great growth in the presence of other languages on the internet. There is a big gap in the number of people on the internet, and the usage of language in the real world – so there is opportunity for growth in those countries with low internet adoption.

The definition of CLIR in 2002 was

  1. Query translation
  2. Document retrieval
  3. Document translation.

A complete system would do 1-3, a minimal system would do 1-2. In the document translation phase perhaps that is done through summarization / fact extraction (so less than full document translation.) In 2002 they built a map of commercial and research systems and plotted where they were with respect to 1+2+3, etc. Since then in 2006, very little has changed (a few new ones have popped up, but generally very little change.)

At Clairvoyance they have had requests for CLIR patent search in Japan. So they started developing a system in 2003, in 2005 they had an alpha-release tested with many companies. Very complex system, even though they focused on the minimal system they needed. In the Japanese market customers are very important, so you can’t afford to say “no”, but can you afford to say “yes” to them? They had difficulty meeting customer requests.

Interesting research challenge in one of the customer requests was how to map affect / sentiment analysis in one language to another? The scales and spaces do not translate simply across.

Problem is that customers want a solution, when they viewed it as a point (search) function, a component, but not an end solution in itself. CLIR searches were observed to be much larger than in monolingual queries, 3-4 times to 10x. Getting appropriate translation resources was very difficult. Using customer’s own domain internal material is hard, but customers expect that to be done easily.

Overall, the market is still “not there yet.” Quality and scope of machine translation is still a factor. Demand for CLIR itself is low, but to be successful they have to be fashioned around customers’ solutions. So we should re-think solutions, be very patient, or abandon hope.

Looking forward, look to entity identification and extraction, interpreting data in tables, text mining (particularly affect analysis), international collaboration (real time chat), meta-data mapping (e.g., semantic web), and portal organization (clustering, classification, searching.) Find a problem that needs to be solved, and build solutions for that. Consider targeting under-represented languages, and finally keep the faith!

A nice interesting talk, and I liked the focus on the practical and reality. It is a completely different space that what you would think from looking at the research community.


Doug Oard: Understood initial design of query translation architecture instead of document translation architecture. Since query processing needed to be done quickly why not switch to document translation? Many user requirements were dynamic so they couldn’t always pre-translate, taking on say 4GB of data translation on looked daunting.

2nd question: Statistical translation: Dictionary based, with hand engineered system, or statistical, or what? Three components: dict based, context vector based. They would like to include transliteration in next iteration for names and out of vocabulary items.

A comment: don’t want to say that you are translating documents because in the legal domain, if you say you translate the document you are opening yourself to being liable for mis-translation.

DAE comment: there are copyright issues, and in the Japanese market in particular you have to know the customer. From very early they learned that anything they show the user reflects on them, the company. If you do query expansion every term you show the customer has to be understandable – but a customer will not accept it if he doesn’t accept it. They also don’t want to show the translations (glosses) to the user. So if you can do source document translation for indexing, sometimes it is a very good way to go, but they didn’t go that way.

Another problem is that you have long standing customers. Now when they want to upgrade with CLIR, they want to pay incremental prices, not completely new system prices, even if the new CLIR system is 10x as complex.

There is a big mismatch between the R&D community and the customers. Application space is also very tough, you really need to sell solutions and there are real practical problems in integration and interfacing there. In the end they decided not to market the CLIR system they developed.

Combining Evidence from Homologous Datasets

Ao Feng and James Allen, University of Massachusetts

Question: do multiple machine translation systems help when doing CLIR? If you can only use foreign language documents, can using multiple MT systems help? If you can then use the original language, does that help at all? They have multiple data sets (same data translated with different systems, or from different sources.) They have multiple weighted queries that are transformations of an original query that are combined and submitted to a search engine (Indri) that has indexed the different data sources.

They use Mandarin data from TDT-5 (35 topics) with 3 machine translatied versions of the data. They used the original Mandarin data by translating with google and when combining all the data together they performed best. Most interestingly, then they combined the 3 MT systems with manual translation of the Mandarin, it performs worse than the 3 MT systems with google MT-translated Mandarin data!

Translation Disambiguation in Web-based Translation Extraction for English-Chinese CLIR

Chengye Lu, Yue Xu, Shlomo Geva
Queensland University of Technology

Given that you have some Chinese articles with some English articles as support. The goal is to make use of web search engine to identify these types of documents to then build translations of the words since usually the English term is somewhere in the article. (Perhaps given in parenthesis.) They collect the search engine snippets and not the complete documents. They select high frequency words or the longest words as the translation. They then have to disambiguate the candidates. They use results returned by a search engine to pick the terms with the most results to perform the disambiguation. It was clear to me how this works though – it looks like he was using multiple terms at once (A, B, C) to (A1, A2, A3, B1, B2, B3, C1, C2, C3) and combinations of the translations at the same time. Why three terms at once?

Question by Doug Oard: Are you aware of anyone who has used a similar approach? (I think his students did some work like this a few years back, or at least they have built bilingual dictionaries using web data in a similar type approach.)

Question: How do you do the segmentation? Answer: they used a statistical approach using mutual information with a maximum of 20 characters, filtering out words in a dictionary and low frequency words.

An answer to Doug’s question: A few years ago David A. Evans did some work on scoring passages that is kind of similar. (I think that comment was from him, it might have been Frederic Gey.)

Named Entity Processing for Cross-lingual and Multilingual IR applications

César de Pablo-Sánchez, José Luis, Paloma Martínez
U. Carlos Ill de Madrid

Named Entities are a cheap first step that can help with multilingual processing. They want to look at scalable methods for web data, and approaches that are applicable to multiple languages. They use Simple rule based methods, bootstrap using weakly supervised methods, and use multilingual resources like Wikipedia, and that are reusable.

They have some initial results from WebCLEF05. Their system made it easy to produce a tagger for a new language in one to three days. They used both linguistic features and web specific (tag) features.

Questions: Frederic Gey didn’t see German. They didn’t have German because it would be a bit more difficult.

Real World Understanding for Multilingual Statistical Tables

Fredric Gey, UC Data Archive & Technical Assistance

In Question Answering, list questions are some of the hardest things for people to deal with. Some of these can be answered by tables. Often the tables have both English and foreign labels, and can be mined for translation equivalents. There is a large organization at sdmx.org that specifies a standard for table data extraction.

Questions and comments: Doug Oard suggests looking at OCR community, and DAR PA might start something up in this area.
David A. Evans: Tables are very important, they have had customer solutions for 2 years doing this because customers want it. Tables are everywhere. This will be commercially very important in a quiet way. Another issue is mapping spreadsheets into databases. The problems are semantic, not syntactic.

Multilingual Summarization at DUC and MSE

Chin-Yew Lin, Microsoft Research Asia

Overview of DUC and MSE. I didn’t take notes since I know this pretty well.

Questions and comments:

Doug Oard: IR and summarization are similar. Should people in the IR community abandon average map precision and move to summarization style evaluations to foster more innovation and interactivity?
A question about Basic Elements: Is the basic idea to compare syntactic structure? Answer: there is a bit of confusion because there are some semantic tools in there as well. The idea is to move beyond syntax only and look at paraphrasing techniques also.

Identification of Document Language in Hard Contexts

Joaquim Ferreira da Silva, Gabriel Pereira Lopes
Universidade Nova de Lisboa

Language identification when the whole document is in one language is more or less a solved problem. It is more difficult when languages are mixed inside a single document, and when language runs are very short. So they look at characterizing documents using character n-grams. What are good sequences to use? So, the sequence o# (# is blank here) is discriminative across languages, but d#f is not. They reduce over 300,000 character sequences down to about 18 by looking at component analysis of similar documents to find out what the important character sequences are. They got about 100% precision at the document level, but down to 96% when looking at 23 character minimum size. The discriminative sequences can change depending on the context.

Questions and Comments

David A. Evans: there has been a lot of work in the patents area looking at words and what language they are from. People haven’t looked at principal components he doesn’t think, but that is an interesting approach for document level.

Integrated Content Presentation for Multilingual and Multimedia Information Access

Gareth J. F. Jones, Vincent Wade
Dublin City University, Trinity College Dublin

Casting IR as an adaptive hypermedia model. AH usually has a strong user model, and uses lots of metadata to present the information to the user. AH also uses a diverse range of content, mixing text, graphics, video, etc. One area might be cross-language image search. Suggests also looking at foreign language documents, and for understanding link chunks of the document to similar in-language document chunks. This is very similar to the work I did with using Arabic language documents to select English sentences for summarization.

Studying the Use of Interactive Multilingual Information Retrieval

Daqing He, Douglas W. Oard, Lynne Plettenberg
University of Maryland

The parable of AltaVista and Google. AltaVista used disjunctive queries (ORs) then Google came and had some advantages: larger index, conjunctive queries, pagerank algorithms. Better mean average precision with AltaVista, but Google put them out of business.

We have to focus on user studies to see what the actual process is for usage. They developed a system to allow users to interactively work with the results. Sometimes the users in the process do the expected CLIR thing (query formulation, query translation, search, reformulate query) but also sometimes they do translation re-selection without reformulating the query. As part of GALE this led to a system that does live translation of video feeds (via IBM) and lets people do CLIR over that, watch the feeds slightly delayed with subtitled machine translations, etc.

Questions and Comments

Question: What are the strategies that you have seen? Answer: We thought people would want to see duplicates removed, but they don’t want that. Users want duplicates, and want to be able to see them all.

Question: With the interactions with the users, how many iterations are the users willing to go through? When the user has more control, they do fewer interactions, but do more finer grained interactions.

Who are the users? 5 Librarians. (Professionals and students.) This is really a discussion question though: how happy are the users? Do they really want interactive systems? They are intentionally focused on information professionals just so that they can avoid some of the problems with usage. Users really like to have past history easily accessible for their usage. Very expensive to implement all of these things and do user studies.

The Remarkable Search Topic-Finding Task to Show Success Stories of Cross-Language Information Retrieval

Masashi Inoue, National Institute of Informatics

The problem is that people do not know how they should use CLIR. How can we implement a task for designing successful search topics for CLIR? When you want to sell new technologies of products, it is important to have “success stories” to promote your service. How can we CLIR topics for this?

I didn’t take such great notes on Masashi’s talk because I’ve seen these slides as well. There was a very good reception from the audience though. David A. Evans in particular commented that it is a very interesting and good idea that he likes, but it is risky. It is great to show users what they are missing, but you have to be careful about it, and it looks like an expensive and difficult task.

The Future of Multilingual Summarization: Beyond Sentence Extraction

David Kirk Evans, National Institute of Informatics

No notes for this one. I wonder why. I got a nice comment by David A. Evans who thinks that this is a very nice area to go into. Businesses are interested in social networks and sentiment analysis. I was a little disappointed that I didn’t get more questions or comments though, because in general that is a bad sign. In cases of few questions it means either the audience did not understand, and you lost them (poor presentation on your part); the audience thought the ideas were poor and did not want to confront you, or just didn’t think it was worth saying anything about (poor choice of research topic or approach!); or the audience wasn’t listening (perhaps there isn’t much interest overlap.) Of course, one or two nice comments is always better than some very critical comments and questions. 🙂

New Directions in CLEF

Carol Peters

An overview of what has happened up until 2002.
They have grown from 13 to 40 to over 110 groups from 1997 to 2002 to 2006. The collection has grown, with 12 languages, news documents in comparable corpora, image collection, some speech… Overall there has been a stimulation of research, improved document retrieval results, created a large set of data for experimentation, built a strong community.

What hasn’t been done? Where are the systems? We’ve forgotten the users.
What new tasks are needed for more advanced information requirements? How can reduce the gap between research and application communities? What are we doing wrong? Who are the users? What are the usage cases?


While you might not be aware of systems, in industry there was a comment that they were asked to build an Arabic system, so they read all the papers and used the data to build a system. It saved them tons of time. So maybe the results are just not as visible.

Another comment David A. Evans is that we should not be so worried about evaluating our communities only by the systems that are produced. The marketplace is fickle, and sometimes that is beyond our control. The community has broken new ground that we would not have received from just focusing on monolingual areas.

The Way Ahead of Multilingual Information Access Evaluation at NTCIR

Noriko Kando, National Institute of Informatics

IQ, QA and summarization are a family of access methods. We need to look at user oriented evaluations more as well. We should try to move to looking at interactive and exploratory search. Also look at catering to advanced and novice users: there are differences. For Cross-lingual: look at different perspectives.

Questions and Comments

David A. Evans: some of the most interesting and challenging tasks are being addressed in CLEF and NTCIR but not TREC. None of the tasks are realistic, but determined by the government and special interest groups. CLEF has medical data, NTCIR has patents. Transforming data into charts is another great problem (NTCIR’s MuST task.) The next big gap is on the side of cognitive models, as compared to observational models, but we’re poor on explanatory models. Perhaps we need to look at cognitive science for protocol analysis.
Frederic Gey: The patent task at NTCIR has ridiculous amount of data and really interesting topics: given a patent, is there prior art. That’s a totally different task from what others are doing, and totally useful.
Doug Oard: Speaking up for TREC, of course any cross lingual stuff in there is accidental and unintentional. There is a lot of creativity in all three sessions. TREC has a legal track, which is focused on a community, but so are patents. There are big upsides to having three evaluations around the world that coordinate, so they are all focusing on different things. That’s a big benefit. Comment David A. Evans: TREC is taking up innovative tracks now, but it has taken some time to get there.

A Data Curation Approach to Support In-depth Multilingual Evaluation Studies

Maristella Agosti, Giorgio Maria Di Nunzio, Nicola Ferro
University of Padova, Italy

We don’t know enough about data curation and should focus on it more. I didn’t take good notes on this one because I spaced out for a bit.

Designing Multilingual Information Access to Tate Online

Paul Clough, Jennifer Marlow, Mark Sanderson

A nice user study of what users want in a CLIR site. People were more skilled in reading foreign languages than writing them. Many of the people would like access in other languages, and are very interested in browsing as opposed to search. For cultural heritage people, we need to think about speed of execution, availability of resources (wrappers around babelfish, will it go down, etc?) copyright issues, cost and quality. They are working on a multi-lingual version of Tate that has wrappers around babelfish, and they will start testing that soon.

The Tate collection Online might have multilingual features soon.

Implementing MLIA in an existing DL system

Martin Braschler, Nicola Ferro, Julie Verleyen

They want to include MLIA functions in a system that communicate in an AJAX fashion with a server. They looked at doing query translation, and data record translation at the central index. They did an experiment over 150,000 records from the British Library. Took 100 queries and manually translated into German to test monolingual performance. They need better stemming / decompounding. They have performed some initial evaluation on the document collection translation approach.

Questions and comments

Do you have an evaluation results: Since this is a feasibility study, they only have qualitative results so far. Another comment: since Z39.50 is a protocol that allows for interaction and boolean operations on the result set, you might have trouble with that complexity.

What is the Future of Multi-lingual Information Access?

Isabelle Moulinier, Frank Schilder

Thomson is a large scale professional information management firm with 40,000 employees in 45 countries. Primarily legal and regulatory sectors, learning, financial services, and scientific health care. They used to be about 10 years ago about 75% print based for revenue, and not they are about 42% in online services and search (Westlaw!)

They are mostly about search, and what they have learned in CLEF and NTCIR. They focus on boolean search. They are not really in CLIR though: customers are not asking for it yet. When you try to use it, whenever there is a problem, the customer calls. So you have to be very careful about what you show to people and what you claim to do. Cross-lingual is expensive (cost, resources, can it scale?). Can it be profitable?

The problem of cross-linguality is not just translation, but it varies from country to country. The legal systems are different – some countries are based on common law, some are based on regulations. There are very different approaches to how you search under those two different paradigms. How do concepts map across country lines? Since users are librarians who are used to having lots of control, how do you handle things across language?

There is a lot of monolingual demand though, so you can focus on improving that. Move from words to concepts during translation. Go beyond document retrieval: better understand how users interact with a CL system, look at cultural points of view, controversy detection, summarization, question answering. Would like to present a multilingual news summary with different opinions from different news coverage (different languages.)


There are 4 panelists. Tetsuya Sakai from Toshiba on “Possible Future Directions in Multilingual Information Access”. Pay more attention to information presentation, interaction, etc. Interesting slides on how what we are evaluating does not really exist. (Query in Japanese, get Chinese documents back – but in reality the Japanese person can’t read those!) A very neat idea of querying, and a sequence of queries to get to the information that user wants, ended by “ok, give me a summary of the information that we’ve now found.”

Mark Sanderson: In CLEF there is a focus on mean average precision, but does that really reflect real-world performance and what users want to do? TREC is planning on moving to several thousand query track. What is the correct measure to use? It only takes one bad query for the user to decide there is a problem. Emphasis on pseudo-relevance feedback, but does that really help in the real world? What about offering up different types of search results depending on the interpretation of the query? Finally, culture is an interesting application area, particularly for academics. English is becoming the minority, sooner of later English speakers might need more help.

David A. Evans: A healthy tension has emerged between science and commercial applications. What we need for good science isn’t necessarily what we need for good applications: the community should not lose sight of what that is. In the real world, many people don’t know what is needed for good science (and don’t care.) As a community we have allowed ourselves to play the numbers game, to improve by x% at some significance level, but we should really look at what we need to do to evaluate real performance. We need to have shared resources. Certainly test collections are necessary, but we need resources for replication. Need good lexical and NLP resources for each language (emphasis mine.) Standard statistical evaluation packages are important with good instructions on how to use them. We need to have good models for explaining why things are the way they are, so good explanatory theories.

What do we need for good applications? Need to be able to use reliable resources and techniques. Need clear roles in solving problems. Applications don’t exist without a user community and a focus, a user need to solve problems. so we need to define those problems and be realistic about it. Some of the bridge problems that may be useful involve messages, blogs, fragments and excerpts (because these can be useful in the solution space and commercial space.) CLEF and NTCIR do a good job with patents and medical information, but what about advertisements, contracts, insurance documents. Contracts are especially important for the analysis of contracts to show a survey of practice.

We need to be wary of user studies. It is very easy to do very bad science with user studies. Doing human studies is tough, especially when the tasks are ill defined or have many degrees of freedom. We need to be much clearer on what it means to conduct human experiments, and we need to be driven by the cognitive models. These will not be simple models, and it will be tough, but it is necessary.

About pseudo-relevance feedback: there is a lot of data from the RIA workshop that looked into pseudo-relevance feedback, in general it does work and improve things, but there is a tension between what the system does well and what the user expects the system to do well. In their system they use pseudo-relevance feedback, but they often have to hide that information from the user. We know they work (reliably in most cases) but there is challenge between that and what the user will accept.

Doug Oard: has slides, “4 good ideas from today”. At SIGIR 1996 at the first workshop (called something different.) There were two things that we agreed on: 1. cross-language document problem was correct problem to work on, 2. should be called cross-language and not cross-lingual. We’ve regretted both since then. Now we think that CLIR is pretty much solved, but maybe it isn’t the best problem to focus on. We don’t know much about what the rest of the story is.

  1. Death to average precision. Ranked lists are a type of multi-document summary, let’s try some summarization measures.
  2. Study fully integrated systems. Users are control freaks. Study the process and not just the results. We are in enormous danger of doing bad science with user studies, but we have to at least jump in and try doing it. Should think before taking a step, but we do have to start working on it.
  3. Study cross-cultural retrieval: CLIR requires matching meaning. MT requires re-expressing meaning. Communication requires negotiating meaning. We should look at cross-cultural communication.
  4. Eat your own dog food: let’s make a searchable archive of CLEF or NTCIR records. Help develop an archival practice. Illustrate the use of our technology in a multinational enterprise.







Leave a Reply

Your email address will not be published. Required fields are marked *