2006-07-18 Invited Keynote Tuesday morning
Argmax Search in Natural Language Processing
Daniel Marcu
The goal is to convince us that search is hard, and by not worrying
about it (just throwing in an argmax without thinking about it much)
we have a problem.
For example with ISI’s syntax based MT system, the big problem they
have is search errors because the search space is very complex. When
they moved to training on long sentences, they started to have many
search errors. A very interesting high-level talk about the problems
still in search. Many good examples, and encouraging us as a field to
look at search more compared to models where we do spend a lot of time
already.
—
Summarization I session
Bayesian Query-Focused Summarization
Hal Daume III, Daniel Marcu (ISI, University of Southern California)
Extractive summarization in a mostly-unsupervised environment.
Assume a large document collection, a query, and relevance judgments
(a few representative examples) to the query over the documents.
Very interesting talk, compared to a variety of extraction methods.
Basically it works like a relevance feedback method for IR over
sentences.
—
Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li, Mingli Wu, Qin Lu, Wei Xu, Chunfa Yuan
Their events are verbs and action nouns appearing between two named
entities. They also look at associated named entities (person,
organization, location, date) as well. They build a graph of named
entities and event terms. They use PageRank to run over the graph and
create summaries.
Intra-event relevance is defined as relevance that is direct between
two nodes on named entities. Inter-event relevance is links on the
graph that are not direct? I’m not sure, this isn’t very clear in the
presentation.
They use WordNet for some similarity for “semantic relevance” and also
a topic-specific relevance from the document set. Basically it looks
like just a count of how many times the events occur together (the
number of named entities that they share.) They also cluster named
entities based on descriptive text to see how they are related.
Basically I don’t think they are doing anything new here; I’ve seen
all the things they are doing before. They put a lot of different
approaches together perhaps, but I don’t see any new big ideas here.
They use some DUC data for evaluation with ROUGE scores of different
features of their system, but I don’t think they differences they
present are statistically significant from what I remember of DUC
scores.
There was a long pause before any questions, but then one was asked
about how PageRank was used. I don’t think that is an interesting
question, because Drago (at least, there are others) have been using
PageRank for sentence ranking for a few years. Of course, some sort
details can be given on their usage and adoption of it.
—
Models for Sentence Compression: A Comparison across Domains, Training
Requirements and Evaluation Measures
James Clarke, Mirella Lapata
(I met James Clarke last night and had a few drinks with him. A
really nice guy!)
This paper concentrates on word deletion. They had 3 annotators
remove tokens from a broadcast news transcript. Analysis of their
corpus and an existing one (Ziff-Davis by Knight and Marcu I think,
created from looking at abstracts and looking for full sentences in
the news article.)
They train a decision tree model to learn operations over the parse
tree. Compared to a word-based model.
60 unpaid annotators rated compressed sentences on a single 1-5
scale.
They looked at two automatic evaluation metrics to see if they are
correlated with the human scores. The more complicated metric (they
say F-measure, but probably should specify as F-measure over what)
correlated better than simple string accuracy. The decision list
method didn’t perform as well as the word-based model.
I really liked this presentation as well, and will make an effort to
see his poster.
Kathy commented that he missed out on some previous work (Hongyan Jing
namely and Bonnie Dorr on headline generation.) This paper is more
statistically oriented, but didn’t seem to include the document
context to determine what is important. The poster describing work
using Integer programming to pull in more linguistic rules.
Another comment on document-level context usage.
A question about trying to retain information that is important to
someone (so, query-focused compression.) A question from Inderjeet
Mani (I think) on the quality of the internet-based free annotators.
—
A Bottom-up Approach to Sentence Ordering for Multi-document
Summarization
Damushka Bollegala, Naoki Okazaki, Mitsuru Ishizuka
(The University of Tokyo)
They propose an automatic evaluation metric for sentence ordering
(average continuity.) They do a hierarchical ordering based on
pairwise comparison of all sentences (or blocks) to others. They have
a variety of heuristics for sentence ordering that go into the
evaluation functions, and learn the combination of all the features
from a set of manual sentence orderings. Chronological, topical
relevance using cosine similarity, precedence criteria (compare
sentences in their document that come before the sentence they have
extracted, and look at info in extracted set), succession criteria is
the same. They create training instances with the 4 features over
manually ordered summary.
Average continuity computed n-grams of orderings and looks at how many
of those are in the target ordering.
I didn’t really think that this is groundbreaking, but just has put
together a few features for sentence ordering. It is a nice piece of
engineering to pull together the different sentence ordering
approaches.
—
Went to lunch with Kathy McKeown and Min-yen Kan.
—
Skipped the first session so I could check email and brush my teeth.
—
Direct Word Sense Matching for Lexical Substitution
Ido Dagan, Oren Glickman, Alfio Gliozzo, Efrat Marmorshtein, Carlo
Strapparava
The main point of this talk was that WSD does not have to be defined
as picking out the wordnet synset number for each word we are
interested in. It can be reduced to a binary categorization task:
does the word sense of the target word match the word sense of the
source word? Moving to that formalization will make things clearer
for actual use in applications (a less complicated task than what
people have been solving unnecessarily) and open up the field to new
methods from classification.
—
Segment-based Hidden Markov Models for Information Extraction
Zhenmei Gu, Nick Cerone
I’m having a hard time understanding what the main thrust of this talk
is 11 slides in.
—
Information Retrieval I Session
An Iterative Implicit Feedback Approach to Personalized Search
Yuanhau Lv, Le Sun, Junlin Zhang, Jina-Yun Nie, Wan Chen, Wei Zhang
Using click-through information to gather terms for query expansion
that is used to re-rank scores for the user. They use a HITS-like
algorithm analogously with “search results” and “query terms” for
“authorities” and “hubs”. The interesting thing is that the
re-ranking and query expansion is performed at the same time (upon
convergence, take top terms as QE, top results to re-rank.)
Out-performs Google in both English and Chinese on precision at 5, 10,
20, and 30 documents.
A question on what documents are used to determine the terms used to
pick for QE. The accusation is that by using terms from unclicked
documents to determine which terms to select, they are not using a
good discrimination set.
Atsushi Fujii from NII asked whether they have tried pseudo-relevance
feedback, which should have similar performance. There is some
question about whether they have used a fair baseline.
—
The Effect of Translation Quality in MT-Based Cross-Language
Information Retrieval
Jiang Zhu, Haifeng Wang (Toshiba R&D China)
The idea is to use translated queries from an MT system that has been
artificially degraded by reducing knowledge (rule / dictionary) base
size, then try to correlate translation quality with result quality.
They are correlated, so better MT means better IR results. Degrading
the rule base leads to syntax errors and word sense errors. Degrading
the dictionary causes only word sense errors, to which the IR systems
are more sensitive.
Comments: much CLIR does not even use MT since it isn’t really
required, and often systems throw in all senses of words. The
question is did they try to determine which categories of words seemed
to have the most impact (Named Entities, nouns, verbs, etc.) They
didn’t look at that, they focused on translation quality and search
effectiveness. Similar question on syntax rules, but they only
randomly selected rules for dropping.
—
A Comparison of Document, Sentence, and Term Event Spaces
Catherine Blake
The conclusion is that the document IDF and and inverse sentence
frequency (and inverse term frequency) are correlated. IDF values
appear to be very stable. Language used in abstracts are different
from language for full text (for this data anyway.)
A kind of interesting investigation of IDF applied to different ways
of using it. There were some questions from (my notes abruptly end here.)
Leave a Reply