PHÂN LOẠI DỮ LIỆU VỚI GIẢI THUẬT

PHÂN LOẠI DỮ LIỆU VỚI GIẢI THUẬT ARCX4-LSSVM
Phạm Nguyên Khang, Đỗ Thanh Nghị, Trần Cao Đệ
Khoa Công nghệ Thông tin, Trường Đại học Cần Thơ
{pnkhang,dtnghi,tcde}@cit.ctu.edu.vn

Tóm tắt
Chúng tôi trình bày trong bài viết một giải thuật học mới, Arcx4 Least-Squares Support Vector Machine (Arcx4-LSSVM), cho phân loại dữ liệu rất lớn trên máy tính cá nhân. Chúng tôi mở rộng giải thuật học của Suykens bằng việc sử dụng đại lượng chuẩn hóa Tikhonov và công thức Sherman-Morrison-Woodbury để có thể xử lý dữ liệu có số chiều rất lớn. Tiếp theo sau, chúng tôi kết hợp với phương pháp Arcx4 của Breiman để xây dựng giải thuật Arcx4-LSSVM có thể phân loại dữ liệu kích thước khổng lồ về số phần tử cũng như số chiếu. Kết quả chạy thử nghiệm tạo trên dữ liệu từ UCI như Adult, KDDCup 1999, Forest Covertype, Reuters-21578 và RCV1-binary cho thấy Arcx4-LSSVM có thời gian huấn học rất nhanh và cho độ chính xác cao nhất trong hầu hết các trường hợp khi so sánh với các giải thuật máy học vecto hỗ trợ khác như LibSVM, CB-SVM và SVM-Perf.

Từ khoá: máy học vecto hỗ trợ, Arcx4, phân loại dữ liệu lớn.

LARGE SCALE DATA CLASSIFICATION USING ARCX4-LSSVM
Pham Nguyen Khang, Do Thanh Nghi, Tran Cao De
Faculty of Information Technology, University of Can Tho
{pnkhang,dtnghi,tcde}@cit.ctu.edu.vn

Abstract
Arcx4 of Least-Squares Support Vector Machine algorithm aims at classifying large datasets on standard personal computers (PCs). We extend the LS-SVM proposed by Suykens and Vandewalle in several ways to efficiently classify large datasets. By adding a Tikhonov regularization term and using the Sherman-Morrison-Woodbury formula, we developed a column-incremental LS-SVM to process datasets with a small number of data points but very high dimensionality. Finally, by applying Arcx4 to the incremental LS-SVM algorithm, we developed a classification algorithm for massive, very-high-dimensional datasets. Numerical test results on UCI, RCV1-binary, Reuters-21578, Forest cover type and KDD cup 1999 datasets showed that our algorithm is often significantly faster and/or more accurate than state-of-the-art algorithms LibSVM, SVM-perf and CB-SVM.

Key words: least-squares support vector machine, Arcx4, massive classification.