{"id":190,"date":"2007-03-21T05:18:00","date_gmt":"2007-03-20T20:18:00","guid":{"rendered":"https:\/\/fugutabetai.com\/blog\/2007\/03\/21\/notes-from-wednesday-2007-03-21-natural-language-processing-meeting-in-japan\/"},"modified":"2007-03-21T05:18:00","modified_gmt":"2007-03-20T20:18:00","slug":"notes-from-wednesday-2007-03-21-natural-language-processing-meeting-in-japan","status":"publish","type":"post","link":"https:\/\/fugutabetai.com\/blog\/2007\/03\/21\/notes-from-wednesday-2007-03-21-natural-language-processing-meeting-in-japan\/","title":{"rendered":"Notes from Wednesday 2007-03-21 Natural Language Processing Meeting in Japan"},"content":{"rendered":"<h2>Information Extraction, Text Minig<\/h2>\n<p>Information Extraction \/ Text Mining room is almost completely full.  <\/p>\n<ul>\n<li>D3-3  \t\u5c0f\u8aac\u30c6\u30ad\u30b9\u30c8\u3092\u5bfe\u8c61\u3068\u3057\u305f\u4eba\u7269\u60c5\u5831\u306e\u62bd\u51fa\u3068\u4f53\u7cfb\u5316 (&#8220;Extraction and organization of character information from short stories&#8221;)<br \/>\n\t\u25cb\u99ac\u5834\u3053\u3065\u3048, \u85e4\u4e95\u6566 (\u7b51\u6ce2\u5927)<\/li>\n<li>D3-4 \t\u7d71\u8a08\u7684\u624b\u6cd5\u3092\u5229\u7528\u3057\u305f\u4f1d\u67d3\u75c5\u691c\u7d22\u30b7\u30b9\u30c6\u30e0\u306e\u69cb\u7bc9\u306b\u5411\u3051\u3066 (&#8220;Towards construction of a statistical search system for infectious diseases&#8221;)<br \/>\n\t\u25cb\u7af9\u5185\u5b54\u4e00, \u5ca1\u7530\u548c\u4e5f (\u5ca1\u5c71\u5927), \u5ddd\u6dfb\u611b, \u30b3\u30ea\u30a2\u30fc\u30fb\u30ca\u30a4\u30b8\u30a7\u30eb (NII)<\/li>\n<li>D3-5 \t\u7c73\u56fd\u7279\u8a31\u30c7\u30fc\u30bf\u30d9\u30fc\u30b9\u304b\u3089\u306e\u5f15\u7528\u6587\u732e\u60c5\u5831\u306e\u62bd\u51fa (&#8220;Extracting literature references from Western Patent Databases&#8221;)<br \/>\n\t\u25cb\u5c0f\u6817\u4f51\u5b9f\u5b50, \u96e3\u6ce2\u82f1\u55e3 (\u5e83\u5cf6\u5e02\u7acb\u5927)<\/li>\n<li>D3-6 \t\u958b\u767a\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u30ea\u30b9\u30af\u7ba1\u7406\u306e\u305f\u3081\u306e\u8b70\u4e8b\u9332\u767a\u8a00\u306e\u5206\u6790 (&#8220;Analysis of spoken meeting records for development project management&#8221;)<br \/>\n\t\u25cb\u9f4b\u85e4\u60a0, \u7acb\u77f3\u5065\u4e8c, \u4e45\u5bff\u5c45\u5927 (NEC)<\/li>\n<li>D3-7 \t\u30b3\u30fc\u30eb\u30bb\u30f3\u30bf\u30fc\u306b\u304a\u3051\u308b\u4f1a\u8a71\u30de\u30a4\u30cb\u30f3\u30b0 (&#8220;Call Center Conversation Mining&#8221;)<br \/>\n\t\u25cb\u90a3\u9808\u5ddd\u54f2\u54c9, \u5b85\u9593\u5927\u4ecb, \u7af9\u5185\u5e83\u5b9c, \u837b\u91ce\u7d2b\u7a42 (\u65e5\u672cIBM)<\/li>\n<li>D3-8 \t\u610f\u898b\u6027\u5224\u5b9a\u624b\u6cd5\u306e\u8a55\u4fa1\u3068\u7cbe\u5ea6\u5411\u4e0a (&#8220;Improvement in precision of opinionated text identification&#8221;)<br \/>\n\t\u25cb\u9ad8\u6a4b\u5927\u548c, \u5ee3\u5d8b\u4f38\u7ae0, \u53e4\u702c\u8535, \u7247\u5ca1\u826f\u6cbb (NTT)<\/li>\n<li>D3-9 \t\u8a00\u8a9e\u60c5\u5831\u3068\u6620\u50cf\u60c5\u5831\u306e\u7d71\u5408\u306b\u3088\u308b\u4f5c\u696d\u6559\u793a\u6620\u50cf\u306e\u81ea\u52d5\u8981\u7d04 (&#8220;Automatic summarization of pictures used for teaching by unifying text and image information&#8221;)<br \/>\n\t\u25cb\u67f4\u7530\u77e5\u79c0 (\u6771\u5927), \u9ed2\u6a4b\u798e\u592b (\u4eac\u5927)<\/li>\n<\/ul>\n<p><!-- readmore --><\/p>\n<h3>D3-3  \t\u5c0f\u8aac\u30c6\u30ad\u30b9\u30c8\u3092\u5bfe\u8c61\u3068\u3057\u305f\u4eba\u7269\u60c5\u5831\u306e\u62bd\u51fa\u3068\u4f53\u7cfb\u5316<\/h3>\n<p>&#8220;Extraction and organization of character information from short stories&#8221;, \u25cb\u99ac\u5834\u3053\u3065\u3048, \u85e4\u4e95\u6566 (\u7b51\u6ce2\u5927)<\/p>\n<p>They analyze short story text and extract characters that appear in the stories.  They also build a relationship map between the extracted entities.  They extract information about five things (sex, age, job, &#8230;)  They compute a score for the characters based on their appearance in the text, whether or not they have lines in a particular scene, and specific words for characters appearing and leaving.  They rely on parsed information about character names to determine if people are characters or not, but for foreign names they use Katakana in conjunction with the rules and their score metric to determine if they are names or not.  They also have a mechanism for determining when two strings are co-referent (e.g., Sherlock Holmes and Holmes) based on close occurrence of the name, and a shortened name or noun referencing them.  On their entity map characters that have more contact with each other are closer together.  <\/p>\n<p\/>  This is actually quite interesting to me, because I&#8217;m working now on summarization that goes in this direction of identifying important people for a story, etc.  <\/p>\n<h3>D3-4 \t\u7d71\u8a08\u7684\u624b\u6cd5\u3092\u5229\u7528\u3057\u305f\u4f1d\u67d3\u75c5\u691c\u7d22\u30b7\u30b9\u30c6\u30e0\u306e\u69cb\u7bc9\u306b\u5411\u3051\u3066<\/h3>\n<p>&#8220;Towards construction of a statistical search system for infectious diseases&#8221;, \u25cb\u7af9\u5185\u5b54\u4e00, \u5ca1\u7530\u548c\u4e5f (\u5ca1\u5c71\u5927), \u5ddd\u6dfb\u611b, \u30b3\u30ea\u30a2\u30fc\u30fb\u30ca\u30a4\u30b8\u30a7\u30eb (NII)<\/p>\n<p>Infectious diseases start out as local events, and reporting on them will be first in the local language, not English.  We also don&#8217;t know if something is an infectious disease or not at first, so we have to check many sources to see if it is widely reported, which is difficult to do by hand.  So the point is to build an automatic system that can check these things.  Their system is called BioCaster.  So they have crawled the web for data, annotated it, created a domain ontology, and are using the annotated data as training data.  They also check the language and translate it into proper language for the target audience, they have a summarization component as well.  <P\/><br \/>\nTheir system currently runs in Japanese, but they have plans to develop it for other Asian languages as well.  They&#8217;ve developed an ontology and have a statistical method to use it to tag data with semantic information.  The tags have to be assigned based on context (can&#8217;t always assume that person can be tagged as a condition descriptor unless it has some description of a sickness, condition, etc.)  They use cabocha for some semantic parse information.  They have run an experiment looking at precision and recall (and F) over tags using different feature information using CRF (conditional random fields) and SVM (support vector machines).  CRFs perform a bit better than SVMs in their data it looks like.  <\/p>\n<h3>D3-5 \t\u7c73\u56fd\u7279\u8a31\u30c7\u30fc\u30bf\u30d9\u30fc\u30b9\u304b\u3089\u306e\u5f15\u7528\u6587\u732e\u60c5\u5831\u306e\u62bd\u51fa<\/h3>\n<p>&#8220;Extracting literature references from Western Patent Databases&#8221;, \u25cb\u5c0f\u6817\u4f51\u5b9f\u5b50, \u96e3\u6ce2\u82f1\u55e3 (\u5e83\u5cf6\u5e02\u7acb\u5927)<\/p>\n<p>They are doing some work based on the NTCIR4,5,6 patent work.  They are doing a system that does cross-patent search to see if they can invalidate a patent (I think &#8211; two of the main words in there I don&#8217;t know.)  There is some difference the vocabulary used in academic \u8ad6\u6587 databases and patent databases.  They are doing an analysis of reference linking between the documents in the databases, then confederate the two databases for a search.  PRESRI is the academic reference database.  The presented work has to do with relationships from patent data to academic data.  They have to do some parsing of the text to pick up the references.  They use indicator phrases (\u624b\u304c\u304b\u308a\u8a9e) to pick up the references.  They have a bootstrapping process for picking up more indicator phrases. It looks like they use SVMs to pick up the actual references.  They have a CRF based tool for tagging based on training data.  They performed two evaluations.  1 with indicator phrases and SVMs picked up additional keywords, and one without as the baseline.  They used CRF++ as the learning method for tagging (I suppose for both.)  The tags are AUTHOR, TITLE, SOURCE, PAGE, VOL, YEAR, etc.  They have run this evaluation of Western and National (Japanese) data, and have results over each of the tag types for precision and recall.  They are at about 85% and 82% precision and recall (in their conclusions.)  <\/p>\n<h3>D3-6 \t\u958b\u767a\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u30ea\u30b9\u30af\u7ba1\u7406\u306e\u305f\u3081\u306e\u8b70\u4e8b\u9332\u767a\u8a00\u306e\u5206\u6790<\/h3>\n<p>&#8220;Analysis of spoken meeting records for development project management&#8221;, \u25cb\u9f4b\u85e4\u60a0, \u7acb\u77f3\u5065\u4e8c, \u4e45\u5bff\u5c45\u5927 (NEC) <\/p>\n<p>They are talking about software development projects and how there are many different kinds of risks associated with them (bugs, product delivery, etc.)  Their research is to look at the records (minutes) of meetings (written I believe) and try to determine which items are risks to the project.  There are two types of things that they extract: items that they project manager should keep an eye out for, and things that are difficult to manage.  They have 6 categories of things that should be managed (schedule, cost, etc.)  They have some rules (look like trigger terms to me) that are used for identifying the topics and action points.  They have two evaluations where they look at the accuracy of their project risk identification tool and the action points.  <\/p>\n<h3>D3-7 \t\u30b3\u30fc\u30eb\u30bb\u30f3\u30bf\u30fc\u306b\u304a\u3051\u308b\u4f1a\u8a71\u30de\u30a4\u30cb\u30f3\u30b0<\/h3>\n<p>&#8220;Call Center Conversation Mining&#8221;, \u25cb\u90a3\u9808\u5ddd\u54f2\u54c9, \u5b85\u9593\u5927\u4ecb, \u7af9\u5185\u5e83\u5b9c, \u837b\u91ce\u7d2b\u7a42 (\u65e5\u672cIBM)<\/p>\n<p>The call center logs are generally quite short, up to a few hundred characters.  They have a system for building tables and looking at outliers based on the categorization of the problem in the help call.  This sort of analysis can lead to better and quicker reaction to developing problems.  Now voice recognition has advanced enough that they can use automatic transcription of the support calls.  Doing that though there is a lot more data that is not relevant compared to the simpler, shorter tickets written up by the call center operators.  Nasukawa-san gave a demo of the Takumi system over a car rental conversation and how it extracted the car model, purpose of the call, etc.  They did some experiments with intentionally adding noise to the recognition data to see how effective search over noisy data can be.  Even with up to 50% noise, they still are able to mine the same kind of categorical information to some degree.  They did an evaluation from 0% to 100% of noisy data.  <\/p>\n<h3>D3-8 \t\u610f\u898b\u6027\u5224\u5b9a\u624b\u6cd5\u306e\u8a55\u4fa1\u3068\u7cbe\u5ea6\u5411\u4e0a<\/h3>\n<p>&#8220;Improvement in precision of opinionated text identification&#8221;, \u25cb\u9ad8\u6a4b\u5927\u548c, \u5ee3\u5d8b\u4f38\u7ae0, \u53e4\u702c\u8535, \u7247\u5ca1\u826f\u6cbb (NTT)<\/p>\n<p>They are looking to determine if a sentence expresses an opinion or not.  They do binary classification with SVMs.  They are then doing search over blog data for opinionated sentences.  They have a 23k sentence training corpus (over what kind of data?)  <\/p>\n<p>Looks like they have some sort of system online as well:  <a href=\"http:\/\/opinion.labs.goo.ne.jp\/cgi-bin\/index.cgi\">\u30aa\u30d4\u30cb\u30aa\u30f3 Reader for \u6620\u753b <\/a>.<\/p>\n<h3>D3-9 \t\u8a00\u8a9e\u60c5\u5831\u3068\u6620\u50cf\u60c5\u5831\u306e\u7d71\u5408\u306b\u3088\u308b\u4f5c\u696d\u6559\u793a\u6620\u50cf\u306e\u81ea\u52d5\u8981\u7d04<\/h3>\n<p>&#8220;Automatic summarization of pictures used for teaching by unifying text and image information&#8221;, \u25cb\u67f4\u7530\u77e5\u79c0 (\u6771\u5927), \u9ed2\u6a4b\u798e\u592b (\u4eac\u5927)<\/p>\n<p>Images from cooking, D.I.Y., these things of instructions.  They have a search system for finding images with related instructional content it looks like.  They do some sort of summarization relating to the image.  It looks like this is for video work.  They do segmentation based on the topic, and extract important terms for the picture summary.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Information Extraction, Text Minig Information Extraction \/ Text Mining room is almost completely full. D3-3 \u5c0f\u8aac\u30c6\u30ad\u30b9\u30c8\u3092\u5bfe\u8c61\u3068\u3057\u305f\u4eba\u7269\u60c5\u5831\u306e\u62bd\u51fa\u3068\u4f53\u7cfb\u5316 (&#8220;Extraction and organization of character information from short stories&#8221;) \u25cb\u99ac\u5834\u3053\u3065\u3048, \u85e4\u4e95\u6566 (\u7b51\u6ce2\u5927) D3-4 \u7d71\u8a08\u7684\u624b\u6cd5\u3092\u5229\u7528\u3057\u305f\u4f1d\u67d3\u75c5\u691c\u7d22\u30b7\u30b9\u30c6\u30e0\u306e\u69cb\u7bc9\u306b\u5411\u3051\u3066 (&#8220;Towards construction of a statistical search system for infectious diseases&#8221;) \u25cb\u7af9\u5185\u5b54\u4e00, \u5ca1\u7530\u548c\u4e5f (\u5ca1\u5c71\u5927), \u5ddd\u6dfb\u611b, \u30b3\u30ea\u30a2\u30fc\u30fb\u30ca\u30a4\u30b8\u30a7\u30eb (NII) D3-5 \u7c73\u56fd\u7279\u8a31\u30c7\u30fc\u30bf\u30d9\u30fc\u30b9\u304b\u3089\u306e\u5f15\u7528\u6587\u732e\u60c5\u5831\u306e\u62bd\u51fa (&#8220;Extracting literature references from Western Patent Databases&#8221;) [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[10],"tags":[],"_links":{"self":[{"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/posts\/190"}],"collection":[{"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/comments?post=190"}],"version-history":[{"count":0,"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/posts\/190\/revisions"}],"wp:attachment":[{"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/media?parent=190"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/categories?post=190"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fugutabetai.com\/blog\/wp-json\/wp\/v2\/tags?post=190"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}