April 23, 2008

The Balanced Corpus of Contemporary Written Japanese and Japanese Blog Data

Today I went to a brief introduction talk about the plans to release a corpus of Japanese blog data for research use. The presentation was at the National Institute of Informatics, with a panel of Professor Toukura and Professor Oyama from NII, MAEKAWA Kikuo from The National Institute for Japanese Language, and a representative from Yahoo! Japan's blog division (I didn't catch his name, sorry.)

There were a lot of people there, about 30 or so all told. The purpose of the presentation was to introduce the plans to make a corpus of Japanese blog data available for research use. The presentation wasn't too detailed about what exactly will be released, but the current plan is to make the data available to researchers in July of 2008. The data consists of post entries from the Yahoo! Blog service where the users have agreed to allow their data to be collected and used in such a manner. The post comments are not included in the data, and the corpus will possibly have things like proper nouns anonymized and other things done to protect the privacy of the people in the data. It is really nice to see people thinking about putting together this kind of data for research use. I haven't found a URL for the project or I would post that - the contact section of the handout says to email Professors Toukura, Professor Oyama, or Mr. Maekawa, but I suspect there will be information on the main NII homepage about the data release when the time comes.

In addition, Mr. MAEKAWA spoke a bit about the Balanced Corpus of Contemporary Written Japanese, which looks very interesting. The project to build the corpus runs from 2006 to 2010, so they are only about two years into the project right now, but it is looking to be something like a Brown corpus for Japanese. It contains three sub-corpora, published material from 2001~2005 (magazines, newspapers, and books) and material from 1986 - 2005 from library sources (books mostly it looks like), and a mixed domain sub-corpus with web data, white papers, text books, records from Diet meetings, best seller novels, and so on.

This post isn't really all that content bearing, but there was only very useful resource that Mr. MAEKAWA mentioned in his talk: the demo of the KOTONOHA Corpus of Modern Japanese Search system (actual entrance is on a button click from the description page.) This is exactly what Alex was asking about in one of his posts: a Japanese KWIC (Key Word In Context) search.

I don't know how long that demo will be available, but it is totally great for language learners or generally people who don't know colloquial usage. I tried poking around at it a bit, putting in a few terms but didn't come up with anything too interesting. I liked めんど as a search term because there were lots of hits, some showing it used more as めんどう and others the shorter めんど, often with a くさい not too far behind...

Anyway, that demo search could be a useful tool for non-native Japanese speakers. I'll add it to my toolkit of places to check when I'm mystified.

Now if only someone would make a Geinojin info site that would tell me *why* that person is famous and should be a guest on some panel, that would be great. (I currently use Wikipedia for that, but I would be happier with something that just says X: comedian, Y: famous lawyer, etc.)


Comments

Provide your email address when commenting and Gravatar will provide general portable avatars, and if you haven't signed up with them, a cute procedural avatar with their implementation of Shamus Young's Wavatars.

Re: The Balanced Corpus of Contemporary Written Japanese and Japanese Blog Data
Next can you work on getting a publicly available KWIC search engine development team assembled? Not just a corpus search, but something that could crawl an already existing search engine's results (blogsearch.google.com) and compile a nifty little list of living sentences...

Remember that brief discussion:
http://victorymanual.com/2008/01/19/attention-japanese-learners-who-can-also-program/
Posted 12 years, 4 months ago by Alex • • wwwReply
Re: The Balanced Corpus of Contemporary Written Japanese and Japanese Blog Data
Nevermind, I spoke too soon! I just realized that Kotonoha engine has a Yahoo settings function below it! Doh! I'm going to be playing around with that for the next few hours straight.
Posted 12 years, 4 months ago by Alex • @wwwReply
Re: The Balanced Corpus of Contemporary Written Japanese and Japanese Blog Data
Unfortunately the Yahoo source selection does not index their search content as far as I understand things. They should be accessing only certain Yahoo! properties (blog, auction, etc.) which might not be live.

Still, I think it is a really useful tool. I would love to take you up on the programmer's KWIC challenge - I honestly think it should only be a day or two's worth of work - but I've got too much on my plate as it is! :)
Posted 12 years, 4 months ago by Fugu • • wwwReply

Comments have now been turned off for this post