Notes from Friday’s sessions at the last day of COLING/ACL.
Friday 2007-07-21 Notes from COLING/ACL
Language, gender and sexuality: Do bodies always matter?
Sally McConnel-Ginet Cornell University (Invited talk)
Co-author of book “Language and Gender” – sounds interesting.
Gave an example of John Summers (sp?) the Harvard guy and his talk
about women and science. There is a difference between the overt and
covert messages, and we have to be aware of both of those, as well as
the ways that our messages can be interpreted. The majority of the
talk was about gender as a performance vs. gender as what we do, and
various ways that linguistics can be a resource for that performance.
Interesting talk about identity and how we realize our identities, and
the frameworks around us for socializing them.
—
Question and Answering Session
Answer Extraction, Semantic Clustering, and Extraction Summarization
for Clinical Question Answering
Dina Demner-Fushman, Jimmy Lin (University of Maryland)
They present a system that answers the question “What is the best
treatment type for X?” and provide bullets that list treatment, along
with context to show support, and links to the source articles where
the answers came from. Extract answers, do semantic clustering to put
answers into a hierarchy, then extract answers for each semantic
category.
They have a template to search medline for drug therapy type
questions. They then tag the returned data with MetaMap (a NE /
semantic tagger for medical domain) to find interventions (drug
treatments) and cluster the interventions and extract a sentence to
show support. Their extractive summarizer uses supervised machine
learning to detect outcome statements using a bunch of classifiers
(rule based, naive Bayes, n-gram, position, semantics, length) that
are combined linearly to detect best “outcome statement”.
Seems like a reasonable approach, I would be interested in seeing how
this work compares to Min-yen Kan’s centrifuser system, or the work
that Noemie has done. I’m not familiar enough with work in this area
to be able to make a good evaluation of this work. It doesn’t look
like anything amazing to me though. I think evaluation in this
scenario is tough, because what are they going to compare to? They
compare to a graded scale from 1-10 of how effective each treatment
was (but I’m not sure if this comes from medline or elsewhere for
similar questions.) They are using ROUGE to evaluate their summaries,
but what are they using as the reference summary?
So far their results have been color coded instead of hard numbers
with significance testing. They do say that the answers that come
from their clustering system outperform just the results from the IR
system at a statistically significant level, but I don’t know what
their model summaries are.
Kathy asks about how they deal with contradictions across articles.
They do not deal with contradictions. Question from someone else
about what version of ROUGE and what the reference summaries were (me
too!) Used ROUGE 1.5 with ROUGE-1 precision, summaries are the one
provided as the abstract of the papers used to generate their
extractive summary (this is a very strange and poor evaluation
approach I think.)
Another question about comparatives and negation (“this is not the
best treatment”) they use SEMREP which deals with negations. It isn’t
quite the same as Kathy’s question, but is somehow related. In the
data there does not seem to be many comparative summaries, so it
hasn’t been much of a problem.
Question to explain more about the features used in the classifier for
sentence extraction. It looks to me like they have basically some
pretty normal features. The rule-based classifier was created by a
registered nurse who looked at about 1000 good articles and tried to
identify what the bottom line sentence would be, and indicators to
find it. What is the naive Bayes classifier trained on? N-gram &
semantic also… They have a paper on that (JAMIA 2005) it looks
like, so we can go there for more information. Another question on
the extractor: have you compared the output of the meta-classifier
against expert opinions on extracts. Yes, it is fairly accurate (from
85 – 92%) but didn’t say about how they evaluated that. There is
another work that looks at the relations from the abstract, and they
didn’t want to duplicate that work.
—
Exploring Correlation of Dependency Relation Paths for Answer
Extraction
Dan Shen, Dietrich Klakow (Saarland University presented by Dan Shen –
a female)
They are looking at syntactic matching between question parse tree and
candidate answer sub-trees. They want to avoid strict parse tree
matching because the answer can be realized in different forms. They
use Minipar dependency relation parser. They introduce a path
correlation measure to check the correlation between they question
parse and candidate answer parse. They use a Dynamic Time warping
algorithm to align the two paths. They need to know what the
correlation is between types of relations, so they compute a
correlation measure over the TREC 02 and 03 data using a mutual
information measure.
They also have a phrase similarity metric that works over noun
phrases. It takes some morphological variation (via stemming) into
account and some format variations. Semantic similarity is done as
per Moldovan and Novischi 2002.
TREC 1999-2003 for training data, test data is TREC 2004. Built the
training set by grabbing relevant documents for each question and
matching one keyword to find the answer. They used this for training?
Isn’t there some better gold standard? I am not clear on their
training data and what they are trying to train at all. They use
kernel trees to do the syntactic matching, but say that they don’t
learn the similarity features? They present a bunch of different
methods, but spent too much time early on explaining their system, and
I’m not clear what they are comparing against.
This talk is going a bit long, but there were some mike problems
earlier on (picking up interference from another mike somewhere?) that
delayed it. I don’t think there will be time for questions though.
She might not finish. For a 20 minute talk, she has 32 slides…
They found that overall their correlation based metric outperformed
competing methods, and performed well on “difficult question” where NE
recognizers might not help.
There is time for at least one question. The correlation was used in
both answer extraction and answer ranking. How good of a job does
this correlation do in capturing additional candidates and wrong
candidates? (Where are false positives and negatives?) Using exact
matching degraded performance by 6%, and lose 8% if they remove the
Maximum Entropy model.
Seems like an interesting paper, but I think the presentation should
have been more polished. There were real timing issues, and I’m not
sure if the real impact was presented. I think the main contribution
here is the correlation measure and not using strict matching only.
Again, I don’t have a large background in this area though. I’m
interested in the similarity measures though, since I think variation
makes taking different forms into account important.
—
Question Answering with Lexical Chains Propagating Verb Arguments
Adrian Novischi, Dan Moldovan (Language Computer Corp.)
Dan Moldovan (session chair) is giving the next paper, so he was
introduced by someone else.
The main point is that you should propagate syntactic structures along
your lexical chains. They needed to extend WordNet by matching to
VerbNet patterns. Not all VerbNet classes have syntactic patterns, so
they had to learn them. Didn’t present the results of the learning,
but probably there is another paper about that somewhere previously.
Presented a method to propagate the structure along through verbs.
See Moldovan and Novischi 2002. Different relations modify the
syntactic structure around in some way, and they have a table showing
the different classes and what changes. They have some weights for
how to order the relations, which were set by trial and error. I’m
not really clear on what the ordering is used for. Examples for
different relation types. There is also an example for how this can
be used to map an answer to a question.
They have some numbers on how many arguments can be propagated across
106 concept chains. By adding this syntactic structure propagation
they improved the performance of their Q&A system.
eXtended WordNet is available from University of Texas at Dallas. It
has word senses disambiguated in the glosses and they have been
transformed into logic forms.
In general, I am not sure what this talk is really about. It is about
lexical word chains (using WordNet concepts instead of words) and how
you can transform them by following the relations in WordNet/VerbNet (along
with some learned relations) to help with Q&A by transforming
answer parses to match question parses. You can do some sophisticated
reasoning with this approach. It seemed like it was really a bit all
over the place though. I would have preferred that the presentation
focused on the algorithm and how it can be used with Q&A. I
didn’t see enough on the learning for 20,000 syntactic patterns added
to VerbNet.
Questions: How do you handle modifications that have things like
“considered hitting the ball”. That would be up to the semantic
parser to figure out. In cases of entailment the same thematic roles
are retained – what about in cases like “buy and sell”, do the
thematic roles stay the same? Possibly they flip in that example.
—
Methods of Using Textual Entailment in Open-Domain Question Answering
Sanda Harabagiu, Andrew Hickl (LCC)
Focused on using textual entailment in a Q&A system. They have a
sophisticated system for parsing questions and paragraph answers.
They extract features, then a classifier does a yes/no classification
on whether the question entails the answer given with some belief.
Sanda has really been going through the slides quickly. They do
detect negative polarity. Their system is built upon a large QUAB
(Question Answer Bank – a big database of questions and answers.)
Generally they are looking for an entailment relationship between a
new question, and a question they have already seen in their
database. They use a Maximum Entropy classifier to predict if two
questions can be related by entailment. They use a variety of
features for the classifier.
Trained classifier over 10,000 alignment chunk pairs as positive or
negative. MaxEnt did better than a hill climber on the same training
set. They extracted 200,000 sentence pairs as positive examples from the ACQUAINT corpus
and used them as positive examples. Used headline to entail a
sentence that is the first pair, filtered on if there were at least
one entity in common between the two. Negative examples took
sentences with even though, in contrast, but etc. between two
sentences. The two were used as negative examples for not entailing
between them. Performance improved significantly over their 10,000
development corpus.
Also uses paraphrase alignment in their system. Use WWW data to get
clusters of potential paraphrases following Barzilay and Lee 2003.
She’s really been whipping through the slides, there is a lot of
content here, but it is getting hard to follow. She explained the
features used in the entailment classifier. Their experiments use the
textual entailment in three ways: to filter the output answers, to
rank candidate passages, to select automatically generated question –
answer pairs from the QUAB database. Evaluated it over 600 randomly
selected questions from TREC. Also using PASCAL Recognizing Textual
Entailment 2006 data for evaluation? Their textual entailment system
is about 75% accurate based on that PASCAL data. They can improve
Q&A accuracy from .30 to .56 using entailment for known answer
types. Also improves with unknown answer types but not as much.
Lots of interesting things in here, but there was too much in the
presentation I think. I’ll have to read the paper. In general
though, it looks like textual entailment was a big help… The talk
went long, maybe 10 minutes? Since it is just before lunch though, it
isn’t a problem.
Questions: It wasn’t clear how the actual entailment was done? The
question was the hypothesis.
—
Machine Translation Session (Dekai Wu chairing)
Scalable Inference and Training of Context-Rich Syntactic Translation
Models
Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve
DeNeefe, Wei Wang, Ignacio Thayer
(Columbia University, USC/ISI, Language Weaver, Google)
I thought Michel’s talk was pretty good. I’m sure he’ll get some
tough questions though. I liked the rule analysis, and that he didn’t
spend too much time talking about BLEU scores.
Dan (Melamed) asked a question about non-contiguous phrases, and what
was meant by that. Asked a question about the power of the formalism,
if it is the same as ITG or not? Question by Dekai Wu about that time
complexity.
—
Modelling lexical redundancy for machine translation
David Talbot, Miles Osborne (School of informatics) (which one?
where? University of Edinborough)
There is redundancy in the lexicon used for translation that is
unimportant for translation purposes. The model becomes more complex
(e.g., seaweed -> nori, konbu, wakame, etc. in Japanese) which makes the
learning problem more difficult. Remove the distinction in the
training corpus to improve distributions for learning, reducing data
sparsity, etc.
I had troubling following the math used to cluster terms together that
should be collapsed. The model prior is a markov random field. This
is used somehow to indicate how likely words should be assigned to the
same cluster (has features over the words, for example two words that
differ only in that they start with a c or g.) They use an EM
algorithm to tune the parameters using bilingual occurrence
information to guide the E step. That looks pretty interesting to
me. David gave an overview of other related work, and there is some
other related work in morphology that might do more of that.
Their experiments use Czech, Welsh, and French. Phrase-based decoder
with Giza for alignment. Showed some improvement in BLEU scores (2-4
points) and examples where their system helped. Also looked at vocab
sizes, which are maybe .25 reduced.
Questions: BLEU should be thrown away. Would like to see 10
sentences before and after. Did you look at distributional features
in the source language? No, not yet, but he has considered it.
Thought of using monolingual features in the MRF prior, but didn’t
implement it. Initially talked about how more refined word senses
could be tackled here, but examples were all morphological. The
features in the MRF don’t capture thesauri information, but had an
example of secret, secretive and private translated into welsh where
some collapsing was done.
I thought this was an interesting presentation and will have to read
the paper.
—
Moved over to the generation session.
Learning to Predict Case Markers in Japanese
Hisami Suzuki, Kristina Toutanova (Microsoft Research)
In one of the examples she gives as incorrect output from her MT
system, I would probably have made the same mistake… First
presented results on predicting case markers in a monolingual task.
The system makes errors with ha / ga, similar to how humans make the
mistake…
Second part did prediction in a bilingual reference settings. Given a
source English sentence, and a Japanese translation of the sentence
missing case markers. They have a dependency parse of the English
from the NLPWIN (Qurick et. al 2005 from MS research) parser then
align with Giza++, and derive syntactic information on the Japanese
side from the English parse projected via alignment links, and POS
tags on Japanese.
They have a variety of features for this task. A 2-word left and
right context window, POS tags and dependency tree info, words from
the English that are aligned, source syntactic features. Improves
over the monolingual Japanese model only, in each case using syntactic
features helped.
Questions: Japanese case markers often do not have English
equivalents, so what do they align to in English using Giza? Usually
align to null, or ha sometimes goes to copula (is).
—
Improving QA Accuracy by Question Inversion
John Prager, Pablo Duboue, Jennifer Chu-Carrol (IBM T.J. Watson)
What other questions should you ask a QA system when you don’t know if
an answer is correct? By asking other questions they can find
information that can be used to invalidate bad answers (e.g., if you
know the birth and death year of a person, all things relating to
their achievements must be bounded by those years.)
They take a question Q1, generate more questions (inverted questions)
Q2 to find answers A2 that can be used to invalidate answers A1 in
response to Q1. Looks like I got this description a bit wrong, but it
should be more clear as the presentation goes on.
Interesting, they use the generated questions to get answers that they
use to validate the answers they already had.
—
Learning to Say It Well: Reranking Realizations by Predicted Synthesis
Quality
Crystal Nakatsu, Michael White (Department of Linguistics, Ohio State
University)
This is synthesis in the text to speech meaning. The job is to choose
text realizations for the synthesizer that are predicted to sound
good. They rated a variety of different sentence realizations, and
used an SVN to rank the plans. Their system improved output quality
to statistically significant levels. Raised average ratings from “ok”
to “good”.
Questions: There was a question about where effort should go to
improve synthesis? The answer is that you should work on both
synthesis and generation. Did the reviews hit a fatigue effect where
they got used to the synthesizer? Spread effort out over days, and
started from different ends of the lists. How was inter-annotator
agreement, and did the annotators really judge only on synthesis and
not sentence structure? Agreement was like .66. More interested in
extremes of what was really good and really bad, so not too worried
about that.
Leave a Reply