August 8, 2006

2006-08-08 SIGIR notes

Social Networks, Incentives, and Search
Jon Kleinberg, Cornell University

An introduction to social networks, and some parallels to information retrieval.

Via mathematical models of the web, we are looking for "Kepler's Laws of Motion for the Web" (Mike Steuerwalt NSF KDI 1998). Looking at social networks and decentralized search.
LiveJornal friendships [Liben-Nowell et al. 2005] looks like an interesting paper.
Very interesting talk overall about social networks, and characterizing how the links act. Why do they act that way? Geography seems to play more of a role than one would expect on the "global" internet.
A bit of recent work on adding incentives to search (offer money, send out search, intermediaries take a cut, push on query, etc.) and a Nash equilibrium for it.


Session I: Question Answering

Probabilistic Model for Definitional Question Answering
Kyoung-Soo Han, Young-In Song, Hae-Chang Rim
Korea University

They develop a probabilistic model that models the Probability of the (Topic, Definition | Sentence) (~= P(T|S)P(S|T)).

They use top ranked documents from corpus as evidence for the topic language model along with external definition sources, and top ranked documents from the web.
The definition language model is a unigram model that differs by definition type, so they have a component that is for the domain, and a component that is general for all definitional language.

Did experimentation with 2003 and 2004 TREC answer set over AQUAINT corpus. Used Lin's automatic method for evaluation (POURPRE Lin 05). They tune the parameters that control the mixing of the three language models (documents, external definitions, web documents) and show that they can complement each other. Their system performs comparably to top-level systems in TREC.

For topic modeling external definitions are the most important, but can be complemented by top ranked documents and wed documents.

The definition model changes a lot depending on the domain and target types.

Change to different session: Machine Learning

Identifying Comparative Sentences in Text Documents
Nitin Jindal, Bing Liu
University of Illinois at Chicago

They are interested in opinions in evaluative text (like customer reviews or something.) There are two types of evaluations, direct opinions, and comparisons.
Their tasks are to extract product features, and opinions that people hold. Then they use that to produce a summary that is graphical: positive / negative distribution on a bar graph of the extracted features. They also propose to mine comparative relations like "X is better or worse than Y with regard to Z" and use that relation to produce a summary.

The challenge here is that not all sentences that contain POS tagged adjectives are comparatives (JJR, RBR, JJS, RBS) and some sentences are comparatives but do not use explicit comparison words. So they make it into a classification problem. There are two observations that help: 1. there are strong linguistic patterns that indicate a comparison (... better than ... ) and 2. a comparative sentence almost always contains a keyword that indicates that the sentence might be a comparison.

They take a keyword strategy, with about 83 keywords that have very high recall and reasonable precision. They use POS tags (JJR, JJS, RBR, RBS) instead of lexical keywords (so those are 4 keywords) and exceptions: more, less, most, least. Other indicative words, and some key phrases. Step 1: extract all sentences containing at least one keyword, 2. improve precision using naive bayes classifier to classify into comparison or not with linguistic patterns as features.

They have some complicated way to learn pattern sequence rules that capture the comparison sentences. They use the rules in the NB classifier as features.

There is a lot of interesting work here, and I should read some of their previous work to come up to speed for the opinion work at NTCIR. I really liked this talk, I plan to go back and read the paper.

Back in the Question Answering session.

Predicting the Quality of Answers with Non-Textual Features
J. Jeon, W. B. Croft, (University of Massachusetts Amherst), J.H. Lee (Soon-sil University), S. Park (Dunksung Women's University)

By non-textual information they mean something like view count, click count, number of times recommended, etc. They are looking at sites like community answer sites, in Korean but similar to Yahoo answers maybe.

Given a question answer pair, they extract features, feed them into a quality predictor (based on maximum entropy) and output a quality score. They use 13 non-textual features from the service (click count, print count, copy count, questioner's self-evaluation score, answer's activity score, answer length, answer's acceptance ratio, recommendation count, sponsor's answer, editor's recommendation, ...)

They convert non-monotonic features (like question length, user activity -- e.g., very active could be spammers, so it isn't a monotonically increasing indicator) into a monotonic type feature. They use kernel density estimation to do that. I need to see the details of that though. Once they have their quality predictor, they ran some IR experiments to see if this quality prediction could improve IR search results for good answers.


Session V: Web 2

Finding near-duplicate web pages: a large-scale evaluation of algorithms
Monika Henzinger
Google and Ecole Federale de Lausanne (EPFL)

If the difference is invisible to the user, or is only minor (text differing only by time stamps, or a "execution time" line or something) then the web page is a near-duplicate. Another definition is that if two pages are entry pages to the same site (happens with porn sites mostly.)

She looks at evaluating two previously proposed algorithms for the duplicate detection task. Prepossessing strips HTML, and then converts text into 64 bit number tokens. There is a proof that the Jacard measure of similarity over the number sequences can be approximated by whether the two documents share the same minimum when you hash through a random 1-1 function. So you can compute multiple random functions (say 80 some) and then store the mins of all of them, and estimate the similarity by looking at the Jacard overlap of the mins. This makes it a tractable problem. The other algorithm looks at using a hyperplane through the terms to see if they lie on different sides of a random hyperplane. Do that check for multiple hyperplanes, and check to see how many times they are on the same side of the hyperplane. So you can store a bit vector over multiple random hyperplanes. Both of these approaches can be made to work in linear time with number of documents (down from n^2.)

In a hand evaluation of about 1900 random samples, determined if the results were correct, non-correct, or undecided. Looked at the distribution of number of token differences and number of near-duplicates per-page. The algorithms do not have much overlap, so you can run the two in parallel. She runs algorithm B, then filters with algorithm C.

These approaches work well for near duplicates on different sites, but poorly when looking at pages from the same sites. The problem there is boiler plate text that is the same on the site.


Structure-driver crawler generation by example
Altigan Silva, Marcio Vidal, Edleno Moura, Joao Cavalcanti
Federal University of Amazonas

This is something about focused crawling, but I don't really know much about this area. I didn't really understand the problem, and didn't follow what he was talking about.


Building Implicit Links from Content for Forum Search
Gu Xu, Wei-Ying Ma
Microsoft Research Asia

They have a model for user browsing that is based on topic models. A user is more likely to stay withing their topic or jump to a related topic than to jump to a completely unrelated topic.

They have models of content, and build implicit links between pages based on the content. Users jump from pages over their implicit links based on the probability of the transition based on these language models. It is like adding another link based on the content to the google-type link analysis algorithms.


I skipped the next talk (improving PageRank with some damping functions) so I could re-charge my laptop before meeting with Inoue-san. Inoue-san and I met after the last session and went over slides. We went to dinner on "the Ave" at a fairly cheap Korean place. They had nice dul-sott bi bim bab. I tried to get to bed early, but since the hotel has Comedy Central I got sucked into watching the Daily Show and the Colbert Report...


Provide your email address when commenting and Gravatar will provide general portable avatars, and if you haven't signed up with them, a cute procedural avatar with their implementation of Shamus Young's Wavatars.

Comments have now been turned off for this post