International Conference on Asian Digital Libraries (ICADL2006) in Kyoto day 3

Session 8: Semistructured Data XML

“Kikori-KS: An Effective and Efficient Keyword Search System for
Digital Libraries in XML”,
Toshiyuki Shimizu (Kyoto University Japan), Norimasa Terada (Nagoya
University), Masatoshi Yoshikawa (Kyoto University)


Many DL systems are starting to use XML documents, and we would like
to be able to search taking advantage of XML’s structure. They want
to target users who are not familiar with XML, so they use keyword
searches. Their main contribution is a user-friendly “FetchHighlight”
user interface, and a storage system that is well suited to XML. They
store the XML documents into a relational database. They explain four
types of returning XML elements, whether you rank documents or
elements, etc. Their fetchbrowse method aggregates relevant elements by
document, and ranks them in document order.

With XML you have to handle a huge number of document fragments.
16,000,000 fragments for 17,000 documents. The store in a databased
with the XRel schema (particular to XML documents.) They do
translation from keywords to SQL query. Their IR system uses a vector
space model with term weights in a tf*idf type method.

They used a 700MB dataset from INEX with 40 queries.

“Supporting Efficient Grouping and Summary Information for
Semistructured Digital Libraries”, Minsoo Lee, Sookyung Song, Yunmi
Kim (Ewha Woman’s University, Korea), Hyoseop Shin (Konkuk University,
Korea)
.

When documents are in XML, you need to have grouping capabilities to
give users an overview or summary of the documents in the collection.
XQuery is a language that allows you to query XML documents. XQuery
does not specifically support a group-by syntax, so it is quite
difficult to create queries that properly do grouping. Their work
focuses on extending XQuery to provide a group-by clause. They give
many examples of how the group-by clause makes it easier to understand
the query compared to writing it without.

They did not really talk about summarization at all.


Session 9a: Information Organization

“Functional Composition of Web Databases”, Masao Mori, Tetsuya
Nakatoh, Sachio Hirokawa (Kyuushyuu University, Japan)
.

When searching for content in multiple web databases, you need to have
multiple browsers open to search in different systems. So they
propose a system that merges results from multiple data sources. They
automatically generate CGI programs on a server based on user settings
to connect to the various web databases, and then sends response back
to the user. The user specifies settings (which databases to use,
connections to make between them, etc.) using a fairly complicated
looking scripting language via a web-form interface. I can’t help but
think that this is a pretty complicated way to do federate search, but
it does look quite extensible and general.

“Integration of Wikipedia and a Geography Digital Library”,
Ee-Peng Lim, Zhe Wang, Darwin Sadeli, Yuanyuan Li, Chew-Hung Chang,
Kalyani Chatterjea, Dion Hooe-Lian Goh, Yin-Leng Theng, Jun Zhang,
Aixin Sun (Nangyan Technical University, Singapore)

G-Portal is a geography digital library with metadata for geographical
web resources. The reason to integrate the two is that wikipedia does
not have digital library features, and from G-Portal point of view
they are a good resource for geography education. They created an
automatic login to G-Portal from Wikipedia so users can navigate from
Wikipedia to G-Portal content, and they modify G-Portal to show
selected information based on displayed WikiPedia information (their
“metadata centric display”.) They also add g-portal content
links into WikiPedia by using reverse proxy so they do not have to
actually change anything in the wiki. Based on preliminary studies
with students they have had good response. They are now looking at
how they can update their meta-data content when WikiPedia changes.

“Impact of Document Structure to Hierarchical Summarization”,
Fu Lee Wang (City University of Hong Kong), Christopher C. Yang
(Chinese University of Hong Kong)
.

This is single document summarization. This computes sentence scores
at each node in the hierarchical structure based on the score of the
sentences under the node.

For multiple document summarization, they looked at two sets of news
story documents (from CNN.com) and plotted distribution of news
stories on a timeline. They look at the set of multiple documents as
a large document that should be fit into a hierarchical structure and
then summarized. They have three ways to build the tress:

A balanced tree built over the documents as arranged on a
timeline. A second approach organizes by having child nodes with
equal and non-overlapping intervals. (The tree is unbalanced.) Their
third approach is structure by event topics. For the summarization
they generate a summary for each range block. If the summary is too
large, they partition into children.

They look at the intersection of sentences between the three
sentence types, but I don’t know what that really shows or means.
They also conducted intrinsic and extrinsic evaluation of the
summaries. They had human summaries created at a 5% compression
ratio. They have “precision” of summary at 5% precision, but I don’t
know how they are computing that. They show that the “summarization
by event” performs better than the other two. They also performed an
extrinsic evaluation using a Q&A task.

According to that evaluation, the degree of the tree spread-out does
not affect the results of summarization. Also, summarization by event
again outperformed the other two methods (but not at compression ratio
of 20% and 20% where it is about the same.) So they conclude that
hierarchical summarization is useful for high compression ratios.

I asked for a clarification and indeed the topics are assigned
manually. I asked him later if he thought about looking at the DUC
data, but he said that due to the manual event categorization
necessary, they did not look at that data set. We talked a little bit
(briefly) about current automatic topic clustering, but he didn’t seem
too interested.


Session 10: Keynote and Invited Talks

“Indexing All the World’s Books: Future Directions and Challenges for
Google Book Search”,
Daniel Clancy (Google, USA).

People are lazy, and if information is readily available they will go
there to get information instead of somewhere else that is slightly
harder to access. He has a poll question: How many of you have been
in a library in the past year? (In this audience, most everyone, in
normal audiences, it can fall down to 30%-50% easily!) The internet
makes it very easy to find some information, but there is still a vast
amount of information that is not accessible on the internet.

They have two initiatives: 15% is the partner program (books that are
in print) and library program (85% and most of these books are out of
print and/or out of copyright.) For most library content there are
about 15% in print, 65% unclear copyright status, 20% public domain
(books before 1923).
92% of the world’s books are neither generating revenue for the
copyright holder nor easily accessible to potential readers.

Gives a demo of Google Books. His first search “Kyoto History”
prompted him to talk about ranking problems. On the web because of
the inter-connectivity, Google is able to use that linking information
to improve result ranking. That link structure doesn’t exist within
books, so they have to look at other ways to improve rankings.

They’ve designed the interface as a real browsing experience. There
is a reference page that tries to give you information about the
book, which has references (books that refer to this book) and related
books, and references from scholarly works. This is for a publisher
book.

For books that they don’t know the rights about, they show a short
“snippet” view.

For full view books you get the whole thing. For example, the Diary
of George Washington is the whole public domain book.

About finding books: the long tail. There is a really large amount of
diversity. They are moving towards blending in book results into the
normal web search content. The result links you right into the page
where the information is. People are comfortable with parsing a book,
and some people have a hard time getting oriented to web pages since
their layout is all different.

Their process for scanning. There is logistics for moving books,
doing the scanning, and storage. When Google looked at this market,
nobody has developed a scanner that fits their needs where they really
want to look at large scale scanning. For 30 million books, they need
to make their own technology that is cost-effective. He can’t release
the technology that they are using though. It sounds like they are
doing some sort of grid-type thing with lower quality components but
use software to detect and fix problems, lots of redundancy, etc. As
the software improves over time you can always go back and fix that
stuff. Many problems with poor text books, tightly bound, etc. They
also needed to make sure that they would not damage the books. They
have extensive processing to remove yellowed pages, etc. Problems
with getting the right page number, OCR and math, spelling correction
with old, old books with olde English, intentional spelling errors,
incorrect metadata for books, mixed languages, layout order for pages,
working on support for CJK languages.

How do you create a rich link structure that relates all these books
and information outside of books? Books are not always individual
units, and indeed have relationships to other books, references,
criticism and review, and so on. They would like to have a similar
kind of link structure, so perhaps we will allow people to build up
links in books.

Discussion notes:

  • Role of Library in the Future
  • Access for everyone everywhere
  • Problems with current institutional subscription models
  • Publishing and User Generated Content
  • How can Google help support research?
  • Role of private companies

And interesting question relating the Great Library of Alexandria to
Google: what happens if Google burns down? For the library partners
Google actually gives them a copy of the data as they create it, so
they will not be the only people with the data.

“One Billion Children and Digital Libraries: With Your Help, What the
$100 Laptop and Its Sunlight Readable Display Might Enable”,

Mary Lou Jepson (One Laptop Per Child, USA)

Has a demo of the cute little laptop. It is small and has two cute
“ears” for wireless I guess? The single laptop is cheaper than a text
book in the 3rd world. They want to provide entire libraries to
kids.

The display is the most costly and power intensive part of the
laptop. Mary Lou Jepson’s job was to reduce the cost of the display.
The laptop is being produced by Quanta, the largest laptop maker in
the world (20 million a year about.) Has a book more (screen flips
around) and in general looks pretty cool. Average power consumption
is 2 watts for OLPC, compared to 45 watts for normal laptops (wow!)
So they go to great lengths to turn off components that are not in
use. They have a nice trick where they put memory buffer to the
display, and if nothing is changing on it, they let the buffer do the
updates and turn off the CPU. It has 512MB of flash, no hard drive
though.

Interesting comment: this isn’t a product, it is a global humanitarian
cause. Ownership is key: the laptop does not break when the child
owns it. The mesh network works: you have to get backhaul
connectivity to the school, but then that is it and distributed
laptops spread the area connectivity. They have a 640×480 video
camera for $1.5! It uses a AMD Geode GX2-500. 3 USB ports hidden
under rabbit ears, SD card slot, mic and stereo, video camera. Mode
1: 1200×900 greyscale sunlight readable, mode 2: 1024×768 color. 1
Watt with backlight off (vs about 7 Watts normal), backlight off ~ 0.2
watts. Innovative changes in the LCD: changes pixel layout to use
fewer of the expensive color gels. 200dpi at 6 bit per pixel (same as
a book) in b&w mode. Color when backlight is on, roomlight increases
resolution.

They are also making a $100 disk farm server that stores lots of data,
$10 DVD player (parts are $8, but licensing fees are $10! They are
trying to get a forgiveness for the licensing fee for that project),
$100 projector. The projector uses the same screen as the laptop!
Reduces lamp cost because it doesn’t need to be so expensive: $1 lamp
and 30 Watt power consumption. Mechanical design has no moving parts.
It is droppable with a shock-mounted LCD and replaceable bumper.
Dirt/moisture resistant, has a handle, can put a strap on it. They
get about .5 – 1 kilometer range with rabbit ears (2-3x more than
without!) Input devices: game controller, sealed keyboard, dual mode
touchpad (middle is finger pad, or you can write with a stylus across
the whole thing.) They are designing different types of chargers, car
battery, human power, etc. They had a crank, but ergonomically that
didn’t work well, so they’ve moved to a foot pedal and string puller
thing.

All of the software they ship on there is GPL.


Session 11: ICADL2006 / DBWeb2006 Joint Panel

“Next-Generation Search”, Daniel Clancy (Google), Masaru
Kitsuregawa (University of Tokyo), Zhou Lizhu (Tsinghua University of
China), Wei-Ying Ma (Microsoft Research Asia, China), Hai Zhuge
(Chinese Academy of Sciences)

Kitsuregawa: Trends. Nobody knows the future, and on the web things
are very fast (Friendster to MySpace migration, acute increase in
MySpace value, YouTube.) Transition from Search to Service.
Currently the value of search results are not high enough; people will
not pay for it so it is currently advertising supported. Businesses
must respond within 1 second, but Universities do not have to be
restricted to that limit. He is looking at this in the information
explosion project. He gives a demo of temporal evolution search,
showing how information changes over time in sites, and how linking
patterns between them change.

He also started to speak about the Information Grand Voyage Project.
The main theme is From Search to Service.

Dr. Wei-Ying Ma, Microsoft Research Asia. Improving Search: can we
understand content (web pages) or peoples’ queries better? There are
bottom up approaches (IR, DB, ML, IE, DM, Web 2.0) and top-down
approaches (semantic web.) One research direction his group has been
working on is moving from web pages to web objects (entities). For
example, Microsoft Libra. The idea is that for the research community
there are many important objects (conferences, authors, papers,
journal, interest groups, etc.) You can search for different types of
objects. Input a search topic (association rules) and get a ranked
list of papers, or a ranked list of important people in the field, or
important conferences / journals / interest groups. This is an
example of moving search engines from page level to the object level.

They have another project (Guanxi) for doing general search on the
internet over objects. It looks like they try to model important
types of objects (at least people for sure) and extract information
about that. Moving from doing just relevance ranking to providing
intelligence in the search results. Also moving to personalized
search. Also, can we integrate more information from the deep/hidden
web (databases)? Searching is moving from individual to social as
well (MySpace networking sites and so on.)

System and Infrastructure: Process-centric to Data centric. Tightly
coupled to loosely coupled. They are trying to re-architect their
search engine so that it is built in modularized layers. They are
focusing on building an infrastructure along with IDE tools to make
development of these systems easier. WebStudio is another layer in
this search-framework architecture.

Global Advertising is $500 billion vs $200 billion in packaged
software in 2005. So search is important. They are looking to focus
on global internet economies.

Hai Zhuge. “Completeness of Query Operations on
Resource Space”. People have not paid enough attention to semantics
because of a lack of computational power. How do you normalize
organization of resources in a large-scale decentralized network. He
introduces a resource space model that can be used to locate objects
in the space, and normal form over them. He talks about different
operations over the space.

Content is the key, but that is not text, image, keywords, not
static. It needs to be dynamically clustered. Gave a demo of the
digital cave project, lots of video and scrolling. A combination of
text, image and sounds.

I really have a very hard time seeing how the theory he has
mostly been talking about connects to the search aspects in practical
terms; the resource space model it a bit vague and unclear to me right
now.

Zhou Linzu. “The Next Generation Search Engines –
Personal Search Engines.” Search will become a daily activity, but
information needs vary drastically from person to person, so we need
personal search engines. Search results should be easily navigated by
clustering, as a graph, and indexed by semantic terms. You should be
able to include technology from other fields (AI, NLP, DB, Multimedia)
to build the search engines easily. Need to develop a
systems-building framework for personal search engines. They have an
example of the C-Book search engine for searching for books.
They have a platform for search engine construction: SESQ for
constructing vertical search engines. Users supply a schema, some
seed URLS, and some external rules, the system does the rest.

Dr. Daniel Clancy. The reason we care about
different directions in the future of search is so that we can do
research that has an impact. The areas we publish in should grow and
have an impact. Looking at the history of how things evolved: users
like it simple
. After many usability studies and assessments,
people like things to be simple. The other thing is one revolution at
a time please
. The web revolution happened and users built this
great link structure, and then Google came along and took advantage of
that. They did not say “make good hyperlinks and tags then I can make
a great search” – they used what was there. Search and ads were not
coupled together, they first got search right and then got ads right.
As a researcher I love the Semantic web, but as a user I am cautious
about the possibilities.

How does it work today? Users give you a query and it works well for
simple things. When you are looking for complex information, it has
more trouble. (e.g., I will be in Kyoto for three hours and will be
on the western side of town which temple should I go to?) The future
will be transitioning search to search as a
dialogue
. Personalization is important, but in reality the
information that knowing the personal context has very little
information compared to the query. Also, people do not like
remembering lots of different places to go. People like to go to one
place and then let the system figure out what vertical application
they are actually interested in. Another area of big interest is
user-generated content. How can we take advantage of user-generated
content and search within that content space because it is a very
different type of content.


Comments and discussion.
Semantic high-level search is very interesting, but there is a problem
in that users like simplicity. Should queries be more natural
language in the future?

Daniel Clancy comments that natural language doesn’t seem to be up to
the level needed for this challenge yet. Users type in short queries
partly because that is what we have trained them to do, but maybe they
can be re-trained, and it will happen at some point in the future. It
is a good area for research.

Dr. Wei-Ying Ma: they initially thought that Ask Jeeves was
their main competitor since they think that natural language search
was the main area. But still NLP does not seem to be up to the task.

Professor Kitsuregawa: When you look at the long tail, there are
people that make natural language queries. The desire is there, so we
should try to pursue that in the future. One way to reduce the
ambiguity in NLP is by using interaction, so going to dialogue first
might be a better idea. Right now they are only trying to apply NLP
to sentiment analysis.

Dr. Lizhu agrees with this. There is still no formal theory to
represent the semantics of NLP, which is a big problem.

Kitsuregawa: the main problem is disambiguation, so maybe through
dialogue we can perform the disambiguation better.

Makiko Miwa: Perhaps it is possible to identify the users’ goal,
or at least genre and date of update of the content. For example, for
people that monitor the same information every day, or other people
that don’t want to read personal homepages but just want to look at
official sites. So what about users’ context information processing?

Professor Lizhu: some context can be gained from log
analysis. It still looks difficult to use this information to
determine the users’ goals. There is research on this in Europe that
includes more user context information and interaction.

Daniel Clancy: when people go to scholar.google.com or
Microsoft’s Libre then they are giving you some context. It is
difficult to understand a user because what they want really depends
and changes with the circumstances.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *