November 29, 2006

International Conference on Asian Digital Libraries (ICADL2006) in Kyoto day 3

Session 8: Semistructured Data XML

"Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML", Toshiyuki Shimizu (Kyoto University Japan), Norimasa Terada (Nagoya University), Masatoshi Yoshikawa (Kyoto University)

Many DL systems are starting to use XML documents, and we would like to be able to search taking advantage of XML's structure. They want to target users who are not familiar with XML, so they use keyword searches. Their main contribution is a user-friendly "FetchHighlight" user interface, and a storage system that is well suited to XML. They store the XML documents into a relational database. They explain four types of returning XML elements, whether you rank documents or elements, etc. Their fetchbrowse method aggregates relevant elements by document, and ranks them in document order.

With XML you have to handle a huge number of document fragments. 16,000,000 fragments for 17,000 documents. The store in a databased with the XRel schema (particular to XML documents.) They do translation from keywords to SQL query. Their IR system uses a vector space model with term weights in a tf*idf type method.

They used a 700MB dataset from INEX with 40 queries.

"Supporting Efficient Grouping and Summary Information for Semistructured Digital Libraries", Minsoo Lee, Sookyung Song, Yunmi Kim (Ewha Woman's University, Korea), Hyoseop Shin (Konkuk University, Korea).

When documents are in XML, you need to have grouping capabilities to give users an overview or summary of the documents in the collection. XQuery is a language that allows you to query XML documents. XQuery does not specifically support a group-by syntax, so it is quite difficult to create queries that properly do grouping. Their work focuses on extending XQuery to provide a group-by clause. They give many examples of how the group-by clause makes it easier to understand the query compared to writing it without.

They did not really talk about summarization at all.

Session 9a: Information Organization

"Functional Composition of Web Databases", Masao Mori, Tetsuya Nakatoh, Sachio Hirokawa (Kyuushyuu University, Japan).

When searching for content in multiple web databases, you need to have multiple browsers open to search in different systems. So they propose a system that merges results from multiple data sources. They automatically generate CGI programs on a server based on user settings to connect to the various web databases, and then sends response back to the user. The user specifies settings (which databases to use, connections to make between them, etc.) using a fairly complicated looking scripting language via a web-form interface. I can't help but think that this is a pretty complicated way to do federate search, but it does look quite extensible and general.

"Integration of Wikipedia and a Geography Digital Library", Ee-Peng Lim, Zhe Wang, Darwin Sadeli, Yuanyuan Li, Chew-Hung Chang, Kalyani Chatterjea, Dion Hooe-Lian Goh, Yin-Leng Theng, Jun Zhang, Aixin Sun (Nangyan Technical University, Singapore)

G-Portal is a geography digital library with metadata for geographical web resources. The reason to integrate the two is that wikipedia does not have digital library features, and from G-Portal point of view they are a good resource for geography education. They created an automatic login to G-Portal from Wikipedia so users can navigate from Wikipedia to G-Portal content, and they modify G-Portal to show selected information based on displayed WikiPedia information (their "metadata centric display".) They also add g-portal content links into WikiPedia by using reverse proxy so they do not have to actually change anything in the wiki. Based on preliminary studies with students they have had good response. They are now looking at how they can update their meta-data content when WikiPedia changes.

"Impact of Document Structure to Hierarchical Summarization", Fu Lee Wang (City University of Hong Kong), Christopher C. Yang (Chinese University of Hong Kong).

This is single document summarization. This computes sentence scores at each node in the hierarchical structure based on the score of the sentences under the node.

For multiple document summarization, they looked at two sets of news story documents (from and plotted distribution of news stories on a timeline. They look at the set of multiple documents as a large document that should be fit into a hierarchical structure and then summarized. They have three ways to build the tress:

A balanced tree built over the documents as arranged on a timeline. A second approach organizes by having child nodes with equal and non-overlapping intervals. (The tree is unbalanced.) Their third approach is structure by event topics. For the summarization they generate a summary for each range block. If the summary is too large, they partition into children.

They look at the intersection of sentences between the three sentence types, but I don't know what that really shows or means. They also conducted intrinsic and extrinsic evaluation of the summaries. They had human summaries created at a 5% compression ratio. They have "precision" of summary at 5% precision, but I don't know how they are computing that. They show that the "summarization by event" performs better than the other two. They also performed an extrinsic evaluation using a Q&A task.

According to that evaluation, the degree of the tree spread-out does not affect the results of summarization. Also, summarization by event again outperformed the other two methods (but not at compression ratio of 20% and 20% where it is about the same.) So they conclude that hierarchical summarization is useful for high compression ratios.

I asked for a clarification and indeed the topics are assigned manually. I asked him later if he thought about looking at the DUC data, but he said that due to the manual event categorization necessary, they did not look at that data set. We talked a little bit (briefly) about current automatic topic clustering, but he didn't seem too interested.

Session 10: Keynote and Invited Talks

"Indexing All the World's Books: Future Directions and Challenges for Google Book Search", Daniel Clancy (Google, USA).

People are lazy, and if information is readily available they will go there to get information instead of somewhere else that is slightly harder to access. He has a poll question: How many of you have been in a library in the past year? (In this audience, most everyone, in normal audiences, it can fall down to 30%-50% easily!) The internet makes it very easy to find some information, but there is still a vast amount of information that is not accessible on the internet.

They have two initiatives: 15% is the partner program (books that are in print) and library program (85% and most of these books are out of print and/or out of copyright.) For most library content there are about 15% in print, 65% unclear copyright status, 20% public domain (books before 1923). 92% of the world's books are neither generating revenue for the copyright holder nor easily accessible to potential readers.

Gives a demo of Google Books. His first search "Kyoto History" prompted him to talk about ranking problems. On the web because of the inter-connectivity, Google is able to use that linking information to improve result ranking. That link structure doesn't exist within books, so they have to look at other ways to improve rankings.

They've designed the interface as a real browsing experience. There is a reference page that tries to give you information about the book, which has references (books that refer to this book) and related books, and references from scholarly works. This is for a publisher book.

For books that they don't know the rights about, they show a short "snippet" view.

For full view books you get the whole thing. For example, the Diary of George Washington is the whole public domain book.

About finding books: the long tail. There is a really large amount of diversity. They are moving towards blending in book results into the normal web search content. The result links you right into the page where the information is. People are comfortable with parsing a book, and some people have a hard time getting oriented to web pages since their layout is all different.

Their process for scanning. There is logistics for moving books, doing the scanning, and storage. When Google looked at this market, nobody has developed a scanner that fits their needs where they really want to look at large scale scanning. For 30 million books, they need to make their own technology that is cost-effective. He can't release the technology that they are using though. It sounds like they are doing some sort of grid-type thing with lower quality components but use software to detect and fix problems, lots of redundancy, etc. As the software improves over time you can always go back and fix that stuff. Many problems with poor text books, tightly bound, etc. They also needed to make sure that they would not damage the books. They have extensive processing to remove yellowed pages, etc. Problems with getting the right page number, OCR and math, spelling correction with old, old books with olde English, intentional spelling errors, incorrect metadata for books, mixed languages, layout order for pages, working on support for CJK languages.

How do you create a rich link structure that relates all these books and information outside of books? Books are not always individual units, and indeed have relationships to other books, references, criticism and review, and so on. They would like to have a similar kind of link structure, so perhaps we will allow people to build up links in books.

Discussion notes:

  • Role of Library in the Future
  • Access for everyone everywhere
  • Problems with current institutional subscription models
  • Publishing and User Generated Content
  • How can Google help support research?
  • Role of private companies
And interesting question relating the Great Library of Alexandria to Google: what happens if Google burns down? For the library partners Google actually gives them a copy of the data as they create it, so they will not be the only people with the data.

"One Billion Children and Digital Libraries: With Your Help, What the $100 Laptop and Its Sunlight Readable Display Might Enable", Mary Lou Jepson (One Laptop Per Child, USA)

Has a demo of the cute little laptop. It is small and has two cute "ears" for wireless I guess? The single laptop is cheaper than a text book in the 3rd world. They want to provide entire libraries to kids.

The display is the most costly and power intensive part of the laptop. Mary Lou Jepson's job was to reduce the cost of the display. The laptop is being produced by Quanta, the largest laptop maker in the world (20 million a year about.) Has a book more (screen flips around) and in general looks pretty cool. Average power consumption is 2 watts for OLPC, compared to 45 watts for normal laptops (wow!) So they go to great lengths to turn off components that are not in use. They have a nice trick where they put memory buffer to the display, and if nothing is changing on it, they let the buffer do the updates and turn off the CPU. It has 512MB of flash, no hard drive though.

Interesting comment: this isn't a product, it is a global humanitarian cause. Ownership is key: the laptop does not break when the child owns it. The mesh network works: you have to get backhaul connectivity to the school, but then that is it and distributed laptops spread the area connectivity. They have a 640x480 video camera for $1.5! It uses a AMD Geode GX2-500. 3 USB ports hidden under rabbit ears, SD card slot, mic and stereo, video camera. Mode 1: 1200x900 greyscale sunlight readable, mode 2: 1024x768 color. 1 Watt with backlight off (vs about 7 Watts normal), backlight off ~ 0.2 watts. Innovative changes in the LCD: changes pixel layout to use fewer of the expensive color gels. 200dpi at 6 bit per pixel (same as a book) in b&w mode. Color when backlight is on, roomlight increases resolution.

They are also making a $100 disk farm server that stores lots of data, $10 DVD player (parts are $8, but licensing fees are $10! They are trying to get a forgiveness for the licensing fee for that project), $100 projector. The projector uses the same screen as the laptop! Reduces lamp cost because it doesn't need to be so expensive: $1 lamp and 30 Watt power consumption. Mechanical design has no moving parts. It is droppable with a shock-mounted LCD and replaceable bumper. Dirt/moisture resistant, has a handle, can put a strap on it. They get about .5 - 1 kilometer range with rabbit ears (2-3x more than without!) Input devices: game controller, sealed keyboard, dual mode touchpad (middle is finger pad, or you can write with a stylus across the whole thing.) They are designing different types of chargers, car battery, human power, etc. They had a crank, but ergonomically that didn't work well, so they've moved to a foot pedal and string puller thing.

All of the software they ship on there is GPL.

Session 11: ICADL2006 / DBWeb2006 Joint Panel

"Next-Generation Search", Daniel Clancy (Google), Masaru Kitsuregawa (University of Tokyo), Zhou Lizhu (Tsinghua University of China), Wei-Ying Ma (Microsoft Research Asia, China), Hai Zhuge (Chinese Academy of Sciences)

Kitsuregawa: Trends. Nobody knows the future, and on the web things are very fast (Friendster to MySpace migration, acute increase in MySpace value, YouTube.) Transition from Search to Service. Currently the value of search results are not high enough; people will not pay for it so it is currently advertising supported. Businesses must respond within 1 second, but Universities do not have to be restricted to that limit. He is looking at this in the information explosion project. He gives a demo of temporal evolution search, showing how information changes over time in sites, and how linking patterns between them change.

He also started to speak about the Information Grand Voyage Project. The main theme is From Search to Service.

Dr. Wei-Ying Ma, Microsoft Research Asia. Improving Search: can we understand content (web pages) or peoples' queries better? There are bottom up approaches (IR, DB, ML, IE, DM, Web 2.0) and top-down approaches (semantic web.) One research direction his group has been working on is moving from web pages to web objects (entities). For example, Microsoft Libra. The idea is that for the research community there are many important objects (conferences, authors, papers, journal, interest groups, etc.) You can search for different types of objects. Input a search topic (association rules) and get a ranked list of papers, or a ranked list of important people in the field, or important conferences / journals / interest groups. This is an example of moving search engines from page level to the object level.

They have another project (Guanxi) for doing general search on the internet over objects. It looks like they try to model important types of objects (at least people for sure) and extract information about that. Moving from doing just relevance ranking to providing intelligence in the search results. Also moving to personalized search. Also, can we integrate more information from the deep/hidden web (databases)? Searching is moving from individual to social as well (MySpace networking sites and so on.)

System and Infrastructure: Process-centric to Data centric. Tightly coupled to loosely coupled. They are trying to re-architect their search engine so that it is built in modularized layers. They are focusing on building an infrastructure along with IDE tools to make development of these systems easier. WebStudio is another layer in this search-framework architecture.

Global Advertising is $500 billion vs $200 billion in packaged software in 2005. So search is important. They are looking to focus on global internet economies.

Hai Zhuge. "Completeness of Query Operations on Resource Space". People have not paid enough attention to semantics because of a lack of computational power. How do you normalize organization of resources in a large-scale decentralized network. He introduces a resource space model that can be used to locate objects in the space, and normal form over them. He talks about different operations over the space.

Content is the key, but that is not text, image, keywords, not static. It needs to be dynamically clustered. Gave a demo of the digital cave project, lots of video and scrolling. A combination of text, image and sounds.

I really have a very hard time seeing how the theory he has mostly been talking about connects to the search aspects in practical terms; the resource space model it a bit vague and unclear to me right now.

Zhou Linzu. "The Next Generation Search Engines - Personal Search Engines." Search will become a daily activity, but information needs vary drastically from person to person, so we need personal search engines. Search results should be easily navigated by clustering, as a graph, and indexed by semantic terms. You should be able to include technology from other fields (AI, NLP, DB, Multimedia) to build the search engines easily. Need to develop a systems-building framework for personal search engines. They have an example of the C-Book search engine for searching for books. They have a platform for search engine construction: SESQ for constructing vertical search engines. Users supply a schema, some seed URLS, and some external rules, the system does the rest.

Dr. Daniel Clancy. The reason we care about different directions in the future of search is so that we can do research that has an impact. The areas we publish in should grow and have an impact. Looking at the history of how things evolved: users like it simple. After many usability studies and assessments, people like things to be simple. The other thing is one revolution at a time please. The web revolution happened and users built this great link structure, and then Google came along and took advantage of that. They did not say "make good hyperlinks and tags then I can make a great search" - they used what was there. Search and ads were not coupled together, they first got search right and then got ads right. As a researcher I love the Semantic web, but as a user I am cautious about the possibilities.

How does it work today? Users give you a query and it works well for simple things. When you are looking for complex information, it has more trouble. (e.g., I will be in Kyoto for three hours and will be on the western side of town which temple should I go to?) The future will be transitioning search to search as a dialogue. Personalization is important, but in reality the information that knowing the personal context has very little information compared to the query. Also, people do not like remembering lots of different places to go. People like to go to one place and then let the system figure out what vertical application they are actually interested in. Another area of big interest is user-generated content. How can we take advantage of user-generated content and search within that content space because it is a very different type of content.

Comments and discussion. Semantic high-level search is very interesting, but there is a problem in that users like simplicity. Should queries be more natural language in the future?

Daniel Clancy comments that natural language doesn't seem to be up to the level needed for this challenge yet. Users type in short queries partly because that is what we have trained them to do, but maybe they can be re-trained, and it will happen at some point in the future. It is a good area for research.

Dr. Wei-Ying Ma: they initially thought that Ask Jeeves was their main competitor since they think that natural language search was the main area. But still NLP does not seem to be up to the task.

Professor Kitsuregawa: When you look at the long tail, there are people that make natural language queries. The desire is there, so we should try to pursue that in the future. One way to reduce the ambiguity in NLP is by using interaction, so going to dialogue first might be a better idea. Right now they are only trying to apply NLP to sentiment analysis.

Dr. Lizhu agrees with this. There is still no formal theory to represent the semantics of NLP, which is a big problem.

Kitsuregawa: the main problem is disambiguation, so maybe through dialogue we can perform the disambiguation better.

Makiko Miwa: Perhaps it is possible to identify the users' goal, or at least genre and date of update of the content. For example, for people that monitor the same information every day, or other people that don't want to read personal homepages but just want to look at official sites. So what about users' context information processing?

Professor Lizhu: some context can be gained from log analysis. It still looks difficult to use this information to determine the users' goals. There is research on this in Europe that includes more user context information and interaction.

Daniel Clancy: when people go to or Microsoft's Libre then they are giving you some context. It is difficult to understand a user because what they want really depends and changes with the circumstances.


Provide your email address when commenting and Gravatar will provide general portable avatars, and if you haven't signed up with them, a cute procedural avatar with their implementation of Shamus Young's Wavatars.

Comments have now been turned off for this post