August 9, 2006

2006-08-09 SIGIR Notes

Wednesday's keynote:

Information Access in the Extended Boeing Enterprise
Radha Radhakrishnan

Overview of Boeing's information technology and information distribution structure.

Session VI: Clustering

Document Clustering with Prior Knowledge
Xiang Ji, Wei Xu, Shenhou Zhu, Yihong Gong
NEC Labs America

Document Clustering for summarization reference to Neto 2000?

Their application is to build a knowledge base from conversational text in email. They have some prior knowledge on cluster size and cluster membership that I think they are getting from human judgements on the web for the mailing list? This isn't really clear, but hopefully he will have an example. Their approach models the set of documents in an undirected graph where similarity links exist between some pairs of documents (not all.) They integrate their prior knowledge as a constraint over the (dis)similarity matrix and turn clustering into a constraint satisfaction problem on top of the clustering problem. They use Eigenvector decomposition and project documents into the eigen-space on the K eigenvectors. Then they use k-means to find the document clusters in the eigenspace. Incorporating this prior knowledge outperforms traditional clustering approaches. I'm still not sure where they get this "paired" information for some documents though.


Text Clustering with Extended User Feedback
Yifen Huang, Tom Mitchell
Carnegie Mellon University

Different user criteria can result in different clustering results. They want to look at how to model user criteria by modeling a general topic, or a specific topic that a user is interested in. They model words in a documents using Naive Bayes by having a binary variable for each word to decide if it is generated by the specific or general topic model, and then have models that predict the word based on the cluster. They can extend the model by adding more features (like people contact info for email.)

What kind of feedback can people give to the clustering algorithm? For email they allow users to give feedback by removing the entire cluster - which adjusted initial posterior probabilities for documents being in the cluster. The user can also give feedback on each email (document) they adjust the posterior probability of the specific feature set given the document. The other thing they can do is to give feedback on keypersons or keywords.

They do evaluation with an email corpus, and three different newsgroup corpora (3 similar groups, 3 related groups, 3 different groups.) Using full set of feedback they achieve over 20% improvement in accuracy.


Near-Duplicate Detection by Instance-Level Constrained Clustering
Hui Yang, Jamie Callan
Carnegie Mellon University

They are focusing on semantic similarity. Her application is in governmental domain where by law they have to read all comments submitted by people, so they want to identify near duplicates to avoid lots of work reading the same comments. Things like make this even worse by setting up forms. She looked at different editing styles that occurred in a corpus and are trying to look at when they have documents that are edits of another. They use some domain knowledge of the documents to determine different types of links (must-links, cannot-links) based on things like mailer-relay (, etc.) shared footer text, etc. They also have some set of constraints for the clustering problem.

They identify master copies by doing a SHA1 hash on documents, sorting them, and then taking the document with the earliest timestamp and at least 5 duplicates.


Unifying User-based and Item-based Collaborative Filtering by Similarity Fusion
Jun Wang, Arjen P. de Vries, Marcel J.T Reinders
Delft University of Technology, The Netherlands

Cast recommendation as a prediction model. Create a user - item matrix with ratings, and for our items that we don't have ratings, can we predict what our rating should be? (Based on the similar user ratings.) They propose to use fusion between both user ratings and item ratings at the same time. Probabilistic model that interpolates the user and item parts.


Personalized Recommendation Driven by Information Flow
Xiaodan Song (UofW) Belle L. Tseng (NEC Labs America)
ching-Yung Lin, Ming-Ting Sun (University of Washington)

They take the model that some people are early adopters, some are late adopters, and so on. People in these classes are likely to have the same usage habits. Information adoption is modeled as a random walk, and users are ranked by state probabilities.


Analysis of a Low-dimensional Linear Model under Recommendation Attacks
Sheng Zhang, Yi Ouyang, James Ford, Fillia Makedon
Dartmouth College

The paper is about a model to resist "shilling attacks" - when people on recommendation sites try to artificially raise the score of items for recommendations.

A Compositional Context Sensitive Multi-Document Summarizer
Ani Nenkova

An analysis of summaries created and term frequency. Basically, looked at the question of whether frequency is a good feature to use in a summarizer.


Information Graphics: An Untapped Resource for Digital Libraries
Sandra Carberry (1), Stephanie Elzer (2), Seniz Demir (1)
University of Delaware, Millersville University

Would like to be able to retrieve information graphics from a digital library. They have done a corpus study, and created a Bayesian system to try to recognize the message conveyed by bar charts.

Their study looks at how the message of the graphic is conveyed by the article text. Classify graphics into one of four categories: fully conveys, mostly conveys, conveys little, conveys none of the graphic's message. Many articles did not convey the message of the graphic, so they say that the author splits information between the two modes, so we should index the graphic for search and retrieval.

They have a system that does OCR / Recognition over a .gif. They have a module for analysis that has a few types of rules, and a set of words that suggest a certain category of message. They use these signals as input into a Bayesian network to predict the message. 110 simple bar charts used in a cross-validation experiment.


News to Go: Hierarchical Text Summarization for Mobile Devices
Jahna Otterbacher (University of of Cyprus)
Dragomir Radev, Imer Kareem (University of Michagan)

A study of different summarization techniques for reading news over mobile devices. The approaches - a hierarchical approach that users can drill down to other sentences, top 20% of highly rated sentences, first sentences, and full documents. The evaluation was done as a task asking multiple choice questions about important events in the documents. I liked the task-based evaluation a lot.


Provide your email address when commenting and Gravatar will provide general portable avatars, and if you haven't signed up with them, a cute procedural avatar with their implementation of Shamus Young's Wavatars.

Comments have now been turned off for this post