Support Vector Machine

SVM is most commonly used to split a single input set of documents into two distinct subsets. For example, SVM can be used to classify documents into privileged and non-privileged, or record and non-record sets. The SVM algorithm learns to distinguish between the two categories based on a training set of documents that contains labeled examples from both categories. Internally, SVM manipulates documents to represent them as points in a high-dimensional space and then finds a hyper-plane that optimally separates the two categories. In this example, the documents are represented as points in a two dimensional space and the SVM algorithm finds the linear separator that divides the plot into two parts, which correspond to the two opposing classes. This separating line (i.e., the model) is recorded and used for classification of new documents. New documents are mapped and classified based on their position with respect to the model. There are many ways to reduce a document to a vector representation that can be used for classification: for example, the number of times particular terms, characters, or substrings appear can be counted; the lengths of sentences or the amount of white space can also be considered.

Enlarge Supervised Learning