Comparative Study of Classification Algorithms: Holdouts as Accuracy Estimation

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

1. INTRODUCTION Information retrieval system aims to obtain relevant information from a collection of large number of information. As the number of digital text documents spread over the internet e-ISSN: 2477-8079 This article has been accepted for publication in Cogito Smart Journal but has not yet been fully edited. Some content may change prior to final publication.
continues to grow every day, it triggers the need for a system that can organize the documents, and as well as make it easy for users to get the right and useful information. A number of algorithms and tools have been developed and implemented to retrieve information from large repositories.
Data mining provides solution to handle the rapid growth of data. Using data mining technique, the documents are grouping into classes in order to simplify the process of retrieving information from large set of data [1]. In data mining, there are two main approaches of grouping documents namely classification and clustering. Classification method groups the documents into fixed categories based on documents' predefined labels. On the other hand, clustering method grouping the documents based on documents' similarity.
Document classification is defined as grouping documents into one or more categories based on predefined label. Document classification starting with the learning process to determine the category of the document, is called supervised learning. This research investigated the text documents. Reference [2] and [3] defined text classification as a relation between two sets, set of documents, = ( 1 , 2 , ⋯ , ) and set of categories = ( 1 , 2 , ⋯ , ). is i-th document to be classified. is j-th predefined category for a document. is the number of documents to be classified, and is the total of predefined category in . Text classification is the process of defining a Boolean value for each pair ( , ) ∈ × , where is the set of documents and is a set of predefined categories. Classification is about to approximate the classifier function (also called rule, hypothesis, or model): : The value (true) assigned to pair ( , ) indicates that document includes in category . Otherwise, the value indicates that document is not a member of category .
Document is a sequence of words [4]. In information retrieval document is stored as set of words, also called vocabulary or feature set [5]. Vector Space Model is employed as document representation model. A document is an array of words, in the form of binary vector with value of 1 when a word present in the document or value of 0 for absences of a word. Each document is included in the vector space | | , | | is the size of vocabularies . For a collection of documents, called dataset, documents are represented as m x n matrix, where m is the number of documents and n is the words. Matrix element aij denotes the occurrence of word j in document i which is represented as binary value.
There are two main approaches that can be applied for classifying document, i.e. rulebased approach and machine learning approach. In rule-based approach, also called knowledge engineering, the rules that define the categories of documents are assigned manually by an expert. Then, the documents are grouped into categories that have been defined [2]. Using this method, rule-based classifier is able to produce an effective classification with good accuracy. However, its dependency on an expert to assign the rules manually becomes the main drawback. When the categories are about to change then the previous expert who defined the rules must be involved. Over all, this method requires high cost and takes time in classifying large number of documents [6]. This research aims to examine and compare text documents classification algorithms, specifically the machine learning based classification algorithms.

Machine Learning based Classification
To overcome the weaknesses of rule-based classifier, machine learning based approach is applied to perform classification. This method is also called inductive process or learner, in which the document classification is running automatically using the text label that have been defined first (predefined class). Machine learning based classifiers learn the characteristics of the set of documents, which have been classified into category . Using these characteristics, the inductive process is done to obtain new characteristics that the new documents must have to be included in a category. So, inductive process is a way of building the classifiers automatically e-ISSN: 2477-8079 This article has been accepted for publication in Cogito Smart Journal but has not yet been fully edited. Some content may change prior to final publication.
from set of documents that have been pre-classified. This method can overcome the problems of large document dataset, reducing labor cost, while the accuracy is comparable to the rules resulted from a supervisor.

A. Decision Tree
Decision rules using DNF rule to build a classifier for category . DNF rule is a conditional rule consists of disjunctive-conjunctive clause. This rule describes the requirements for the document to be classified into categories defined; 'if and only if' the document meets on of the criteria in DNF clauses. The rules in DNF clauses represent categories' profile. Each single rule comprise of category's name and the 'dictionary' (list of words included in that category). A collection of rules is the union of some single rule using logic operator "OR". Decision rules will choose the rules whose scope is able to classify all the documents in training sets. Rules set can be simplified using heuristic without affecting the accuracy of resulting classifier.
Sebastiani in [2] explained, DNF rules are built in a bottom-up fashion, as follows: 1. Each training document is 1 , … , → clause where 1 , … , are the words contain in document , and is the category when satisfy the criteria of , otherwise it is ̅. 2. Rules generalization. Simplifying the rules by removing the premise from clauses, or merging clauses. Compactness of the rules is maximized while at the same time not affecting the 'scope' property of the classifier.
Pruning. The resulting DNF rules from step 1 may contain more than one DNF clauses, which able to classify documents in the same category (overfit). Pruning is done to 'cut' the unused clauses from the rule.

B. Decision Tree
Decision tree decomposes the data space into a hierarchical structure called tree. In textual data context, data space means the presence or absence of a word in the document. Decision tree classifier is a tree comprise of: a. Internal nodes. Each internal node stores the attributes, i.e. collection of words, which will be compared with the words contained in a document. b. Edge. Branches that come out of an internal node are the terms/conditions represent one attribute value. c. Leaf. Leaf node is a category or class of documents.
Decision tree classifying document by testing term weight of the internal nodes label contained in vector ̅ recursively, until the document is classified at a leaf node. Label of the leaf node will be the document's class. Decision tree classifiers are built in a top-down fashion [2]: 1. Starting from the root node, document is tested whether it has the same label as the node's (category or ̅). 2. If the does not fit, select the -th term ( ), divide into classes of documents that have the same value as . Create a separated sub-tree for those classes. 3. Repeat step 2 in each sub-tree until a leaf node is formed. Leaf node will contain the documents in category . The tree structure in decision tree algorithm is easy to understand and interpret, and the documents are classified based on their logical structure. On the contrary, this algorithm requires a long time to do the classification manually. When misclassification at the higher level occurs, it will affect the level below, and the possibility of overfit is high.
Sebastiani [2] explains, to reduce overfitting, several nodes can be trimmed (pruning), by withholding some of the attributes that are not used to build the tree. These attributes determine whether a leaf node will be pruned or not. The next step is comparing the class distribution in used attributes versus unused attributes. If the class distribution of the training documents used to construct the decision tree is different from the class distribution of the class distribution of the training documents retained for pruning, then the nodes are overfit to training documents and can be pruned. e-ISSN: 2477-8079 This article has been accepted for publication in Cogito Smart Journal but has not yet been fully edited. Some content may change prior to final publication.

C. k-Nearest Neighbor
In machine learning field k-nearest neighbor (k-NN) algorithm belongs to lazy learner group. Lazy learners, also called example-based classifier [2] or proximity-based classifier [7], do the classification task by utilizing the same existing category labels on the training documents with labels on the test documents.
k-NN starts by searching or determining the number of k nearest neighbor of the documents to be classified. Input parameter k indicates the number of document level to be considered in calculating document ( ) classification function, ( ) . A document is compared with the neighbor classes, to calculate their similarity. Document will become member of category if there are k training documents that are similar to in category . k-NN classification function is defined as follows: is a measure of relationship between testing document with training document .  ( ) is the set of testing document to maximize the function ( , ).

D. Naïve Bayes
Naïve Bayes is a kind of probabilistic classifier that utilize mixture model, a model that combine terms probability with category, to predict document category probability [7]. This approach define classification as the probability of document , which is represented as term vector = 〈 1 , … , | | 〉, belongs to category . Document probability is calculated using the following equation: where ( ⃗ ) is the probability of document (randomly chosen), ( ) is the probability of a document to become classified in category . The size of document vector ⃗ may be large. Therefore, naïve Bayes applies word independence assumption. According to word independence assumption two different document vector coordinates are disjoint [2]. In other words, a term probability in a document does not depend on others. So, the presence of a word has no affect on others, so called 'naïve'.
Probabilistic classifier naïve Bayes is expressed in the following equation: There are two commonly used naïve Bayes variants, namely Multivariate Bernoulli and Multinomial Model. a. Multivariate Bernoulli Model. This model using the term occurrence in document as the document feature. Term occurrence is represent as binary value, 1 and 0 (1 denoting presence and 0 absence of the term in the document). Term occurrence frequency is not taken into account for document classification modeling. b. Multinomial Model. As oppose to multivariate model, this model considers the term occurrence frequency. Document is defined as 'bag of words', along with term frequency of each word. Classification modeling is conducted based on these occurrence frequencies in the document. Multinomial model has better performance compare with the other naïve Bayes variants [8,9].
e-ISSN: 2477-8079 This article has been accepted for publication in Cogito Smart Journal but has not yet been fully edited. Some content may change prior to final publication.

E. Support Vector Machine
Similar to regression-based classification, SVM represents documents as vectors. This approach aims to find a boundary, called decision surface or decision hyperplane, which separates two groups of vectors/classes. The system was trained using positive and negative samples from each category, and then calculated boundary between those categories. Documents are classified by first calculating their vectors and partition the vector space to determine where the document vector is located. The best decision hyperplane is selected from a set of decision hyperplane 1 , 2 , … , in vector space | | dimension that separate the positive and negative training documents. The best decision hyperplane is the one with the widest margin [2,7].  Fig. 1. Box symbols are the support vectors, i.e. the documents whose distance against decision hyperplanes will be computed to determine the best hyperplane. is the best one. Its normal distance against each training documents is the widest. Thus, become the maximum possible separation barrier..

Classifier Evaluation
Experimental approach was applied as document classifier evaluation method, to measure the effectiveness of the classifiers [2,6]. Classifier effectiveness describes the classifiers' ability to classify a document in the right category. Three most often used methods to determine effectiveness applied in this study are precision, recall, and accuracy, based on probability technique. Table 1 shows the contingency table that is used to measure probability estimation for category .
To determine precision, recall, and accuracy must first begin by understanding if the classification of a document was a true positive (TP), false positive (FP), true negative (TN), and false negative (FN). TP means the documents being classified correctly as relating to a category. FP determined as documents that is related to the category incorrectly. FN describes documents that is not marked as related to a category but should be. TN means documents that should not be marked as being in a particular category.  = + c. Combining precision and recall may provide better analysis of classifier performance. This is called F-Measure: = ( 2 + 1) 2 + where denote precision, for recall, and positive parameter that represents the goal of evaluation task. is given a value of 1 if both precision and recall are considered equally important. = 0 when precision is more important than recall. Conversely, if recall is more important than precision, the value of is infinite. Another parameter commonly used to measure classifier performance is accuracy. Accuracy (̂) is measured by the following formula: = + + + + Holdout, random subsampling, cross validation (k-fold), and bootstrap are common techniques used for assessing classifier accuracy [10]. Holdout method partitions the full set of data into two sets, namely training set and test set. It is common to hold out two-third of the data for training (learning phase) and the remaining one-third of the data are for training [10,11]. Each set must be chosen independently and randomly. 1.3 WEKA WEKA, stands for Waikato Environment for Knowledge Analysis, is software for data mining tasks that consist of machine learning algorithms written in Java. WEKA provides tools to support data mining tasks include data preprocessing, classification, clustering association rules, attribute selection, and visualization.

RESEARCH METHOD
The steps that composes the methodology that is used in this research for comparing the performance of five text classification algorithms is shown in Fig 2. This research was conducted in four main steps which are data collection, data preprocessing, experimentation, and result analysis. Collecting the text document needed for conducting the experiment is the first step in the methodology. The data is downloaded from http://weka.wikispaces.com/Datasets. These text documents then passed through preprocessing step. In preprocessing step documents are filtered and to transformed the data into ARFF format, the format accepted by WEKA. The first step in preprocessing is removing stop words such as number, prepositions (i.e. in, under, before), determiners (i.e. a, an another, the), and conjunctions (for, but, or, so, yet). The next step is grouping words that share the same morphological root, called stemming. The summary of dataset used is shown in Table II. e-ISSN: 2477-8079 This article has been accepted for publication in Cogito Smart Journal but has not yet been fully edited. Some content may change prior to final publication.   Attributes  D1  2463  2001  D2  3204  13196  D3  3075  12433  D4  1003  3183  D5  918  3013  D6  1050  3239  D7  913  3101  D8  1504  2887  D9  1657  3759  D10  414  6430  D11  313  5805  D12  336  7903  D13  204  5833  D14  927  10129  D15  878  7455  D16  690  8262  D17 1560 8461 The third step in the methodology is conducting the experiments. The datasets was tested using WEKA's classifiers as shown in Table III. 3. RESULT AND DISCUSSION Algorithms comparison was done based on their accuracy, precision, recall, and F-Measure, and classifier model building time. As shown in Fig. 3 and Fig. 4, among the five algorithms, Naïve Bayes, Decision Tree, and SVM have high effectiveness and accuracy rates, Naïve Bayes classifier is the highest with 0.815, 0.802, and 0.786 respectively for Precision, Recall, and F-Measure. Directly proportional to the evaluation of precision, recall, and Fmeasure, Table III shows that naïve Bayes classifier has the highest accuracy rate among the five classifiers. The average accuracy of naïve Bayes is 80.33%. Decision Tree and SVM follow Naïve Bayes.  Another measure that is obtained from the experiment is the amount of time taken to build the classifier models (see Table IV). It shows that the average time required by k-NN classifiers is the smallest (fastest), 0.01 seconds. In contrast, decision tree classifiers take a long time to build a text classifier models. The average amount of time to accomplish building the model is 101.3 seconds. In Table V we try to conclude the relation between classifier effectiveness values with amount of time taken to build classifier models. Both decision rules and k-NN have poor classification performance. Compare to k-NN, decision rules has the lowest in terms of precision and F-measure. Yet, its accuracy is higher than k-NN's. SVM can reach high effectiveness performance (73.3%) in average of time 3.67 seconds for building a classification model. In terms of time, decision tree requires a huge amount of time to build classification model. However, it can classify the documents well. Overall, results of the experiment indicate that Naïve Bayes algorithm is superior among the five algorithms, assessed from the aspects of effectiveness and time. It requires small amount of time to build the model with high accuracy and effectiveness.  4. CONCLUSION This study compared performance of five machine learning based classification algorithms, namely decision rules, decision tree, k-NN, naïve Bayes, and SVM. Comparison was based on time and four classifier effectiveness measurements: precision, recall, F-measure, and accuracy. The following conclusions were drawn: 1. Decision rules and k-NN performance are lack since their effectiveness values and accuracy are less than 2. The algorithms that can build classifiers with high effectiveness rate are Naïve Bayes, decision tree, and SVM a. SVM is able to classify the documents well in small amount of model building time. b. Decision tree have an equally good performance in classifying multi-class text documents, with average precision, recall, and F-measure values more than 0.7, as well as accuracy rate which is around 75%. Yet, it has drawback in time to build the classifier models. c. Experiment result shows Naïve Bayes has the highest effectiveness values, as well as spent small amount of time to build the classifier models. For Naïve Bayes and SVM algorithms, the greater the percentage of training documents, the higher the resulting model accuracy. Therefore, Naïve Bayes' get the highest accuracy at percentage split of 90/10, while SVM is at 80/20 and decision tree is at 70/30.