Notes from the 2006-09-12 to 13 Information Processing Society of Japan meeting. The Information Processing Society of Japan Special Interest Group on Natural Language Processing holds bi-monthly meetings all around Japan. Two months ago, the meeting was in Hakodate. This time, the meeting was in Shinjuku, very close to where I live, so I decided it would be a good chance to attend and see what research is going on in the field in Japan.
It was really interesting. All but two of the presentations were in Japanese, which was a very nice chance to get up to speed on technical Japanese, and to see how presentations here go. It was pretty tiring too though. I also had a chance to talk with some of the member of the 情報爆発世界ニュース group that I’m involved with.
If you are interested in reading some very surface comments about the papers that I saw on Tuesday’s session, click to read more…
The first presentation was about concepts that can not be found in
dictionaries or wikis. How do these concepts operate like technical
terms, and how are they related? It looks like the input is books:
they use scanners and OCR over books. He builds a matrix of technical
terms over the publications. Given a term, you can then pick out the
contexts that it can appear in, and find related terms.
—
Second Presentation is from New York University – I met him at SIGIR –
Nobuyuki Shimizu – “Knowledge Frame Extraction From Navigational Route
Instruction”
Giving directions to a robot or something, how do you plan the path?
There are three actions, turn left/right, go straight, enter a door.
Takes those directions and fills a frame with them.
—
Shimomura-san from Okayama daigaku.
He is trying to determine the agent, theme, and goal of sentences. So
the problem is to assign parts, roles to the actors in the sentence.
So it turns out they’ve been studying many of these, and now today’s
presentation is on “Reason” and “Opponent”. They use EDR to get
information on nouns, and have some interesting rules that use EDR
categories and particles for information on how to classify nouns
(actor, agent, theme, goal, etc.)
—
Tokyo Daigaku – Hiroko Ishida-san – “Study of sensory classification
as co-occurring terms with imitative words”
This is about retrieving information from medical type databases using
normal Japanese, and the giongo / gitaigo associated with describing
sounds or feelings in organs.
They used goo ヘルスケア as a resource, as well as a giongo/gitaigo
dictionary, and a medical dictionary. They hand-classified about 2300
terms from goo into about 8 categories on what part of the body they
relate to.
—
I had lunch with Yoshioka-sensei, Mori-sensei, and others.
—
Yoshioka’s presentation: Classification of Anchor Text for Web
Information Applications
He looked at the anchor text in 100gigabytes of web pages (from the
NTCIR web test collection) and how inter- and intra- site link anchor
text differs. About 100,000,000 links. There is some interesting
analysis of the link content. He wants to classify link text into 8
types of categories.
—
Hirao Kazuki – “Web Search Result Clustering Based on Structure of
Compound Nouns” – Okayama Daigaku
The idea here is to use the compound nouns in Japanese documents to
cluster them. Also, since compound noun composition is fairly easy to
understand, we can use that to build a hierarchy of clusters. 複合名
詞 is compound noun in Japanese. You can break them up into a
supplemental concept and main concept (補足部 主要部) That can be used
to build a concept hierarchy. The build their clusters, and then have
a nice way of labeling the clusters. In particular, if the cluster
was generated by a keyword that is a sub-concept that they all share,
then they make a label like “XX Recipes”.
—
Hiroyuki Sakai – “Estimation of Impact Contained in Articles about
each Company in Financial Areas” – Toyoshahi University
This also looks like a very interesting and relevant (to my work)
paper. He is talking about identifying documents and sentences that
will have an “impact” on business. I’m having a really hard time
understanding this presentation (the presenter speaks quickly) but it
looks like they calculate the stock price the day before and the day
after the date of the news article, and look at that to calculate the
impact. In general, I think this is a very good approach (and for
longer periods of terms, like over a week, look at the week-behavior
of the stock price and the centroid of articles over that week.)
They extract some terms that have a big impact (via entropy) on the
company.
—
Daisuke Matsuzaki, Toshinori Watanabe, Hisashi Koga, Nuo Zhang –
“Document Relation Analysis by Data Compression” – University of
Electro-Communications
I wasn’t particularly interested in this paper, and I’ve seen work
before on using compression to do document categorization.
—
Testuya Sakai – “Controlling the Penalty on Late Arrival of Relevant
Documents in Information Retrieval Evaluation with Graded Relevance” –
Knowledge Media Laboratory, Toshiba R&D Center.
I’ve read one of his papers on Q-measure before, and it seems quite
similar.
—
Hiroka Kimura, Toshinori Watanabe, Hisashi Koga, Nuo Zhang – “A New
Document Retrieval Method Using LZ78 Compression Function”
They did some summarization experiments using artificial documents.
They had a few strategies: centroid summarization, most similar
summarization, and most unique summarization. I didn’t really follow
what the documents were like and what they evaluated against. They
also used Japanese news documents to do real summarization with human
models to compare against. I don’t understand how the summaries were
scored either actually… People liked centroid and most novel types
of summarization most. It looked like it was more of a human scoring
evaluation than anything else.
This particular paper was the focus of a lot of questions. The main
thrust of the questioning was “how is this different from a
vector-space model for summarization?” for the most part. The
interesting part of the answer is that this approach works really with
binary data, so application to text is a first step, but conceivably
these approaches can be applied to pictures or video as well.
Certainly I agree that is true, but I’m not sure if there is a strong
argument to be made for how to interpret the results over those kinds
of media.
—
Junichi Fukumoto, Tsuneaki Kato, Fumito Masui, Tatsunori Mori, Noriko
Kando – “An Automatic Evaluation of Question Answering using Basic
Elements” – Ritsumeikan University, University of Tokyo, Mie
University, Yokohama National University, National Institute of
Informatics
An overview of the BE evaluation method for Question and Answering.
It will be used in the QAC at NTCIR this year. They have also created
a Japanese version of a BE creator that runs over Kabochya output.
Leave a Reply