GOM CỤM VĂN BẢN ĐỘNG DỰA TRÊN MÔ

GOM CỤM VĂN BẢN ĐỘNG DỰA TRÊN MÔ HÌNH KẾT HỢP
INCREMENTALDBSCAN
Nguyễn Hoàng Tú Anh, Bùi Thị Danh, Nguyễn Anh Thy
Khoa Công nghệ Thông tin, Trường Đại học Khoa học Tự nhiên – ĐHQG Tp. HCM
nhtanh@fit.hcmuns.edu.vn, tuanaivnn@yahoo.com, popstarsongngu@yahoo.com

Tóm tắt
Sự gia tăng các cơ sở dữ liệu lớn và thường xuyên thay đổi theo thời gian đã đặt ra cho những người nghiên cứu thuộc lĩnh vực gom cụm dữ liệu câu hỏi lớn: Làm sao có thể quản lý sự thay đổi trong cấu trúc cụm mà vẫn đảm bảo được thời gian thực hiện? Và bài toán gom cụm động ra đời. Tương ứng với cơ sở dữ liệu là tập văn bản, chúng ta có bài toán gom cụm văn bản động. Trong bài báo này, chúng tôi đề xuất phương pháp kết hợp mô hình biểu diễn văn bản thành đồ thị và thuật toán gom cụm động IncrementalDBSCAN như một trường hợp tiếp cận cho bài toán gom cụm văn bản động. Mô hình đồ thị cho phép biểu diễn đầy đủ cấu trúc của từng văn bản cũng như toàn bộ tập văn bản, cấu trúc đồ thị sẽ được cập nhật khi có văn bản mới thêm vào. Trong khi đó, IncrementalDBSCAN là thuật toán gom cụm hiệu quả trên những tập dữ liệu thay đổi thường xuyên. Độ tương tự giữa hai văn bản được tính bằng độ tương tự giữa vector đặc trưng và thông tin về cụm từ chung giữa chúng. Một số cải tiến đã được áp dụng nhằm hạn chế khuyết điểm của thuật toán IncrementalDBSCAN. Các kết quả thu được cho thấy tính hữu hiệu của phương pháp.

Từ khoá: khai thác văn bản, mô hình đồ thị, gom cụm văn bản động, đánh chỉ mục trên cụm từ, độ tương tự văn bản.

GRAPH-BASED DOCUMENT REPRESENTATION AND
INCREMENTALDBSCAN - A COMBINATION APPROACH FOR
INCREMENTAL DOCUMENT CLUSTERING
Nguyen Hoang Tu Anh, Bui Thi Danh, Nguyen Anh Thy
Faculty of Information Technology, University of Science – VNU HCMC
nhtanh@fit.hcmuns.edu.vn, tuanaivnn@yahoo.com, popstarsongngu@yahoo.com

Abstract
The increase of the large databases that are updated frequently leads to the question: How to manage changes in cluster structure while maintaining processing time? This motivates the new problem to be solved: Incremental Clustering. In case data sets are document collections, this problem becomes incremental document clustering problem. In this paper, we propose our new approach that is a combination of Graph-based Document Representation and IncrementalDBSCAN, an incremental clustering algorithm, in order to solve the incremental document clustering problem. Graph-based model allows us to model completely the structure of not only each document but also the whole collection of documents. The graph structure is updated when there is a new document. Meanwhile, IncrementalDBSCAN is an effective clustering algorithm on datasets with frequent changes. Similarity between two documents is measured by the similarity of their feature vectors and their common phrases. We also propose several improvements to reduce IncrementalDBSCAN’s shortcoming. Our experimental results illustrate the effectiveness of our proposed method.

Key words: text mining, graph-based model, incremental document clustering, phrase-based indexing, document similarity.