Rough notes from Thursday’s presentations at COLING/ACL.
Exploiting Syntactic Patterns as Clues in Zero-Anaphora Resolution
Ryu Iida, Kentaro Inui, Yuji Matsumoto
A nominee for the best asian paper award. A fairly interesting
approach to anaphoric identification for “missing” zero-anaphora. I
can’t really say that I know much about anaphoric resolution research,
but they have a fairly sophisticated approach that uses dependency
parse trees (retaining only relevant sub-trees and reducing them to
part of speech information to avoid data sparsity.)
—
Challenges in NLP: Some New Perspectives from the East
A very interesting panel I think. Three points:
* Are Asian language special?
* What are some salient linguistic differences in Asian languages, and
what are the implications for NLP?
* Can the availability of detailed linguistic information (e.g.,
morphology) help ameliorate problems with the scarcity of large
annotated corpora?
(1 Tsuji Junichi)
For Asian languages, there is some separation between sentence grammar
and discourse grammar. Asian languages are discourse oriented
languages. Topic markers (wa) and particles that are very sensitive
to discourse (sure, dake, mo, koso…) so they have a real impact on
discourse understanding. Similarly for Chinese, there does not seem
to be a packing device in English that corresponds to Chinese topic
marking. The concept of zero-anaphora, pronouns that are just dropped
and have to be recovered, and in some way are quite related to
discourse. Context is very important to understand underspecified
meaning (僕はうなぎだ.)
(2 Kenjamin K. Tsou City University of Hong Kong)
Writing system in asian languages is very broad. Alphabetic and
ideographic and mixtures of them. What are some differences? Entropy
of European writing systems is about 4, but with Asian languages can
go up to 9.
Structurally when you have two nouns together, what is their
relationship for (for MT etc.) The love of God vs. the love of
Money. (subjective & objective vs objective only.) There is some
culture basis about how to interpret these things as well. Intrinsic
differences in the writing systems are seen in the differences in the
entropy of the writing systems. Different relationships between
constituents, sometimes marked, sometimes you need culture to
disambiguate.
(3 Pushpak Bhattacharyya)
Annotated corpora is a scarce resource. Only about 10 languages have
large annotated corpora out of about 7000 languages. English has
relatively low morphology compared to other languages, so can we
understand rich morphology? Exploiting morphological richness can
really help. We should build high quality morphological analyzers.
—
After the panel on Asian NLP I decided to go shopping for new shoes.
Despite my best efforts, my shoes stank from Wednesday’s incident
(thanks for owning up to that Kris!) and I didn’t think I could take
the smell much longer myself. So I missed out on the morning sessions
in exchange for a bout of shopping.
—
Discourse session (Daniel Marcu chair)
Proximity in Context: an empirically grounded computational model of
proximity for processing topological spatial expressions
John Kelleher, Geert-Jan Kruijff, Fintan Costello
This paper so far wins my “Best author list for creating a D&D
party from”.
They want to be able to talk to robots about the space they are in and
their surroundings. This means they need to understand spatial
references. They have looked at previous (human) research on
proximity and determine that proximity is a smooth function that is
inversely related to distance, depends on the absence or presence of
“distractor” objects and the size of the reference object. They use
functions to approximate proximity and merge the fields of the
distractors to determine overall proximity. They used human
experiments to verify their model. Presence of a distractor has a
statistically significant influence. Their relative proximity
equation better predicted results than just absolute proximity
(without the normalization with the distractor.) They had a funny
video with a robot that answered questions about where things are
(cows, pigs, and donkeys.)
They have another paper on how the descriptions are chosen to describe
the scene – that would have been more interesting for me I think. The
talk was very understandable though, and well-presented I thought.
—
Machine Learning of Temporal Relations
Inderjeet Mani, Marc Verhagen, Ben Wellner, Chong Min Lee, James
Pustejovsky
Inderjeet Mani presenting, joint work with MITRE, Georgetown
University, and Brandeis University.
There is an annotation scheme, TIMEX, to annotate time-based
expressions and some relations between them. TLINKS in TIMEX3 are
used to express time links using Allen’s Interval Relations. MITRE
has some tools for annotating text (Tango? probably available for
research use.) TimeBank corpus and they have an Opinion corpus with
about 100,000 documents!? I need to look into this. Inter-annotator
agreement is pretty low on TLINKs (.55F links and labels) . So they
are looking at, given a human-determined link, can they learn the
correct link type?
They have a tool for enriching link chains with all links types
between all entities (called closure). So a small amount of markup is
done, then a full graph is created that allows for temporal reasoning.
Learning over the full graphs improves results greatly.
Had a baseline of human-created rules, and rules mined from the web
using ISI’s VerbOcean. Their approach of learning with a maximum
entropy system over the closed data is the best.
—
You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) — A
Qualitative Evaluation of Association Measures for Collocation and
Term Extraction
Joachim Wemter, Udo Hahn
Looking at collocation and technical term identification. Wants to
see how to determine if one statistical method can be said to be
better than another or not. Only looked at candidates that had
frequency higher than 10 to avoid low frequency data. In English they
looked at trigram noun phrases, in German they looked at PP-Verb
combinations.
Compared to t-test for collocations and terms. In each case their
linguistically motivated approach compared to how often the candidates
could be modified, or how much variation they showed. For their
comparison they computed their terms, then split the n-best lists in
half and looked to see how much other statistical tests would change
the ranking with respect to true negatives and true positives.
His result tables and graphs were a little hard to follow because he
went by them very quickly. I think the scatter plots were easier to
understand than the tables, and on the tables I’m not sure how
significant the differences (in percentages) were. They seemed pretty
close.
The overall conclusion is that frequency counting does about as good
as the t-test for identifying these entities. The linguistically
motivated measures that are presented here do perform better than
frequency counting (and the t-test) though.
Question from Pascale: The linguistically motivated measure also
seemed to be statistical though. Doesn’t agree with the conclusion
(talks about Frank Smadja’s collocation work) and saying that nobody
proposed to use just t-score.
Another question: does he do stop-listing? Yes, they do take out
stop words for terms.
Another question: What exactly do they show in this study? What
wouldn’t I learn from just looking at a precision recall graph or
table from X’s study (some other study cited, but I didn’t catch it.)
—
Weakly Supervised Named Entity Transliteration and Discovery from
Multilingual Comparable Corpora
Alexandre Klementiev, Dan Roth (UIUC Cognitive Computation Group)
Use an English NE tagger to tag text, in a comparable corpus use the
English NE tags and identify counterparts in another language
(Russian) and move tags across.
They use temporal alignment as a supervision signal to learn a
transliteration model. Identify a NE in the source language, ask
transliteration model M for best candidates in Target language, then
select candidate that has good temporal match to candidate NE, add to
training data for model. Re-train model M and repeat until it stops
changing, or some stability has been reached.
Linear discriminative approach for transliteration model, is a
perceptron M(Es, Et) gives a score for transliteration pair. Features
are substring pairs of letters in source and target. Feature-set
(sets of pairs) grows as they see more examples. Model is initialized
with 20 NE pairs. They also look at how many / few pairs you can use
to seed model with. 5 candidates worked as well as 20, but required a
lot longer to converge.
For their temporal similarity model they group word variants into
equivalence class for Russian. English isn’t as diverse so just uses
exact strings. Use DFT then do euclidean distance in Fourier space.
They use about 5 years of short BBC articles with loose Russian
translations.
Using temporal feature along with learned transliteration model they
get accuracy that is pretty high, up to 70% for top-5 list for a NE
term. About 66% accuracy on multi-word NEs.
I really liked this presentation and think it is a very novel look at
transliteration. Would it work with Chinese? It is similar to
Pascale’s early work on time warping with comparable corpora to learn
a bilingual dictionary, but the feature set is novel to me – on the
fly learning of character transliteration pairs.
I asked a question about application to Chinese and Japanese, and got
about the answer that I expected.
—
A Composite Kernel to Extract Relations between Entities with both
Flat and Structured Features
Min Zhang, Jie Zhang, Jian Su, Guodong Zhou
They elect to use a kernel method for a learning system (SVM?) because
it can easily take hierarchical features as input. Their contribution
is that they designed a composite kernel that combines flat features
and syntactic features (hierarchical) for classification.
So they did a lot of work choosing which sub-trees from the parse
information to throw at their machine learning system, and ways to
prune the data to make it more tractable. Trained on ACE data (’03
and ’04.) They report that a specific type of tree kernel (mixed both
features and syntax trees with a polynomial) has the best performance,
beating state-of-the-art systems.
I guess they have a nice result, but I didn’t like the talk much. The
speaker focused a lot on details that I don’t think are important.
What I took away from this is that it is possible to use a tree kernel
to mix in numeric feature values with syntax type hierarchical
features and achieve good performance. This seems like it should have
been a poster paper to me. (Of course, I don’t work in this area so
my opinion probably isn’t worth much…)
Leave a Reply