Notes from Wednesday 2007-03-21 Natural Language Processing Meeting in Japan

Information Extraction, Text Minig

Information Extraction / Text Mining room is almost completely full.

D3-3 小説テキストを対象とした人物情報の抽出と体系化 (“Extraction and organization of character information from short stories”)
○馬場こづえ, 藤井敦 (筑波大)
D3-4 統計的手法を利用した伝染病検索システムの構築に向けて (“Towards construction of a statistical search system for infectious diseases”)
○竹内孔一, 岡田和也 (岡山大), 川添愛, コリアー・ナイジェル (NII)
D3-5 米国特許データベースからの引用文献情報の抽出 (“Extracting literature references from Western Patent Databases”)
○小栗佑実子, 難波英嗣 (広島市立大)
D3-6 開発プロジェクトリスク管理のための議事録発言の分析 (“Analysis of spoken meeting records for development project management”)
○齋藤悠, 立石健二, 久寿居大 (NEC)
D3-7 コールセンターにおける会話マイニング (“Call Center Conversation Mining”)
○那須川哲哉, 宅間大介, 竹内広宜, 荻野紫穂 (日本IBM)
D3-8 意見性判定手法の評価と精度向上 (“Improvement in precision of opinionated text identification”)
○高橋大和, 廣嶋伸章, 古瀬蔵, 片岡良治 (NTT)
D3-9 言語情報と映像情報の統合による作業教示映像の自動要約 (“Automatic summarization of pictures used for teaching by unifying text and image information”)
○柴田知秀 (東大), 黒橋禎夫 (京大)

D3-3 小説テキストを対象とした人物情報の抽出と体系化

“Extraction and organization of character information from short stories”, ○馬場こづえ, 藤井敦 (筑波大)

They analyze short story text and extract characters that appear in the stories. They also build a relationship map between the extracted entities. They extract information about five things (sex, age, job, …) They compute a score for the characters based on their appearance in the text, whether or not they have lines in a particular scene, and specific words for characters appearing and leaving. They rely on parsed information about character names to determine if people are characters or not, but for foreign names they use Katakana in conjunction with the rules and their score metric to determine if they are names or not. They also have a mechanism for determining when two strings are co-referent (e.g., Sherlock Holmes and Holmes) based on close occurrence of the name, and a shortened name or noun referencing them. On their entity map characters that have more contact with each other are closer together.

This is actually quite interesting to me, because I’m working now on summarization that goes in this direction of identifying important people for a story, etc.

D3-4 統計的手法を利用した伝染病検索システムの構築に向けて

“Towards construction of a statistical search system for infectious diseases”, ○竹内孔一, 岡田和也 (岡山大), 川添愛, コリアー・ナイジェル (NII)

Infectious diseases start out as local events, and reporting on them will be first in the local language, not English. We also don’t know if something is an infectious disease or not at first, so we have to check many sources to see if it is widely reported, which is difficult to do by hand. So the point is to build an automatic system that can check these things. Their system is called BioCaster. So they have crawled the web for data, annotated it, created a domain ontology, and are using the annotated data as training data. They also check the language and translate it into proper language for the target audience, they have a summarization component as well.

Their system currently runs in Japanese, but they have plans to develop it for other Asian languages as well. They’ve developed an ontology and have a statistical method to use it to tag data with semantic information. The tags have to be assigned based on context (can’t always assume that person can be tagged as a condition descriptor unless it has some description of a sickness, condition, etc.) They use cabocha for some semantic parse information. They have run an experiment looking at precision and recall (and F) over tags using different feature information using CRF (conditional random fields) and SVM (support vector machines). CRFs perform a bit better than SVMs in their data it looks like.

D3-5 米国特許データベースからの引用文献情報の抽出

“Extracting literature references from Western Patent Databases”, ○小栗佑実子, 難波英嗣 (広島市立大)

They are doing some work based on the NTCIR4,5,6 patent work. They are doing a system that does cross-patent search to see if they can invalidate a patent (I think – two of the main words in there I don’t know.) There is some difference the vocabulary used in academic 論文 databases and patent databases. They are doing an analysis of reference linking between the documents in the databases, then confederate the two databases for a search. PRESRI is the academic reference database. The presented work has to do with relationships from patent data to academic data. They have to do some parsing of the text to pick up the references. They use indicator phrases (手がかり語) to pick up the references. They have a bootstrapping process for picking up more indicator phrases. It looks like they use SVMs to pick up the actual references. They have a CRF based tool for tagging based on training data. They performed two evaluations. 1 with indicator phrases and SVMs picked up additional keywords, and one without as the baseline. They used CRF++ as the learning method for tagging (I suppose for both.) The tags are AUTHOR, TITLE, SOURCE, PAGE, VOL, YEAR, etc. They have run this evaluation of Western and National (Japanese) data, and have results over each of the tag types for precision and recall. They are at about 85% and 82% precision and recall (in their conclusions.)

D3-6 開発プロジェクトリスク管理のための議事録発言の分析

“Analysis of spoken meeting records for development project management”, ○齋藤悠, 立石健二, 久寿居大 (NEC)

They are talking about software development projects and how there are many different kinds of risks associated with them (bugs, product delivery, etc.) Their research is to look at the records (minutes) of meetings (written I believe) and try to determine which items are risks to the project. There are two types of things that they extract: items that they project manager should keep an eye out for, and things that are difficult to manage. They have 6 categories of things that should be managed (schedule, cost, etc.) They have some rules (look like trigger terms to me) that are used for identifying the topics and action points. They have two evaluations where they look at the accuracy of their project risk identification tool and the action points.

D3-7 コールセンターにおける会話マイニング

“Call Center Conversation Mining”, ○那須川哲哉, 宅間大介, 竹内広宜, 荻野紫穂 (日本IBM)

The call center logs are generally quite short, up to a few hundred characters. They have a system for building tables and looking at outliers based on the categorization of the problem in the help call. This sort of analysis can lead to better and quicker reaction to developing problems. Now voice recognition has advanced enough that they can use automatic transcription of the support calls. Doing that though there is a lot more data that is not relevant compared to the simpler, shorter tickets written up by the call center operators. Nasukawa-san gave a demo of the Takumi system over a car rental conversation and how it extracted the car model, purpose of the call, etc. They did some experiments with intentionally adding noise to the recognition data to see how effective search over noisy data can be. Even with up to 50% noise, they still are able to mine the same kind of categorical information to some degree. They did an evaluation from 0% to 100% of noisy data.

D3-8 意見性判定手法の評価と精度向上

“Improvement in precision of opinionated text identification”, ○高橋大和, 廣嶋伸章, 古瀬蔵, 片岡良治 (NTT)

They are looking to determine if a sentence expresses an opinion or not. They do binary classification with SVMs. They are then doing search over blog data for opinionated sentences. They have a 23k sentence training corpus (over what kind of data?)

Looks like they have some sort of system online as well: オピニオン Reader for 映画 .

D3-9 言語情報と映像情報の統合による作業教示映像の自動要約

“Automatic summarization of pictures used for teaching by unifying text and image information”, ○柴田知秀 (東大), 黒橋禎夫 (京大)

Images from cooking, D.I.Y., these things of instructions. They have a search system for finding images with related instructional content it looks like. They do some sort of summarization relating to the image. It looks like this is for video work. They do segmentation based on the topic, and extract important terms for the picture summary.

FuguTabetai Blog