ICADL 2007 - LNCS 4822
   

Automatic Classification of Web Search Results: Product Review vs. Non-review Documents

Tun Thura Thet, Jin-Cheon Na, and Christopher S.G. Khoo

Wee Kim Wee School of Communication and Information, Nanyang Technological University, 31 Nanyang Link,637718 Singapore
ut0001et@ntu.edu.sg
tjcna@ntu.edu.sg
assgkhoo@ntu.edu.sg

Abstract. This study seeks to develop an automatic method to identify product review documents on the Web using the snippets (summary information that includes the URL, title, and summary text) returned by the Web search engine. The aim is to allow the user to extend topical search with genre-based filtering or categorization. Firstly we applied a common machine learning technique, SVM (Support Vector Machine), to investigate which features of the snippets are useful for classification. The best results were obtained using just the title and URL (domain and folder names) of the snippets as phrase terms (n-grams). Then we developed a heuristic approach that utilizes domain knowledge constructed semi-automatically, and found that it performs comparatively well, with only a small drop in accuracy rates. A hybrid approach which combines both the machine learning and heuristic approaches performs slightly better than the machine learning approach alone.

Keywords: Product Review Documents, Genre Classification, Snippets, Web Search Results

LNCS 4822, p. 65 ff.

Full article in PDF | BibTeX


lncs@springer.com
© Springer-Verlag Berlin Heidelberg 2007