Optimizing Text Classification Using Techniques AdaBoost Ensemble with Decision Tree Algorithm

Marnis Nasution; Ibnu Rasyid Munthe; Fitri Aini Nasution; Sarjon Defit

doi:10.31154/cogito.v11i1.741.39-51

Authors

Marnis Nasution universitas labuhanbatu
Ibnu Rasyid Munthe Universitas Labuhanbatu
Fitri Aini Nasution Universitas Labuhanbatu
Sarjon Defit Universitas Putra Indonesia YPTK

DOI:

https://doi.org/10.31154/cogito.v11i1.741.39-51

Keywords:

Text Classification, ADABoost, Decision Trees, Machine Learning, NLP

Abstract

This study presents an optimized text classification framework combining AdaBoost ensemble techniques with Decision Tree algorithms (ID3, C4.5, CART) to address critical challenges in small dataset scenarios (n=795 Indonesian-language reviews). Employing rigorous five-fold stratified cross-validation (random seed=42), we implemented a comprehensive preprocessing pipeline including case normalization, language-specific stemming, and TF-IDF feature extraction. The ensemble model utilized 50 AdaBoost iterations with a learning rate of 1.0, evaluated through multiple performance metrics while accounting for class imbalance effects. Key results demonstrate significant performance enhancements, with the C4.5+AdaBoost configuration achieving 96.72% accuracy (±0.88), representing a 10.6 percentage point improvement over the base C4.5 algorithm. The ensemble approach particularly improved minority class identification, boosting positive sentiment classification F1-scores by 0.28 points while maintaining exceptional neutral sentiment detection (F1-score 0.99±0.00). Comparative analysis revealed consistent advantages across all Decision Tree variants, with accuracy improvements of 18.6% for ID3, 10.6% for C4.5, and 14.2% for CART, alongside reduced performance variance (62-73% decrease). While these findings validate AdaBoost's effectiveness for enhancing Decision Tree stability in small-scale text classification, the study acknowledges limitations regarding sample size constraints and language specificity. The research contributes practical methodologies for sentiment analysis applications while emphasizing the need for validation on larger, more diverse datasets. Future work should explore comparative benchmarking against transformer architectures. Advanced feature representation techniques and multilingual generalization testing. This work provides a reproducible framework for developing robust, ensemble-based text classification systems in resource-constrained scenarios.

References

E. Rosenberg, C. Tarazona, F. Mallor, H. Eivazi, D. Pastor-Escuredo, F. Fuso-Nerini, and R. Vinuesa, "Sentiment analysis on Twitter data towards climate action," Results in Engineering, vol. 19, p. 101287, 2023, doi: 10.1016/j.rineng.2023.101287.

M. S. Başarslan and F. Kayaalp, “Sentiment analysis of coronavirus data with ensemble and machine learning methods”, TUJE, vol. 8, no. 2, pp. 175–185, 2024, doi: 10.31127/tuje.1352481.

H. Naz, S. Ahuja, D. Kumar, and R. Rishu, "DT-FNN based effective hybrid classification scheme for twitter sentiment analysis," Multimedia Tools and Applications, vol. 80, no. 8, pp. 11443–11458, Mar. 2021, doi: 10.1007/s11042-020-10190-3.

A. Alotaibi et al., “Spam and Sentiment Detection in Arabic Tweets Using MARBERT Model,” Math. Model. Eng. Probl., vol. 9, no. 6, pp. 1574–1582, 2022, doi: 10.18280/MMEP.090617.

A. Qazi, N. Hasan, R. Mao, M. E. M. Abo, S. K. Dey, and G. Hardaker, “Machine Learning-Based Opinion Spam Detection: A Systematic Literature Review,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3399264.

X. Zhang, G. Liu, and M. Zhang, “Ensemble-Based Text Classification for Spam Detection,” Inform., vol. 48, no. 6, pp. 71–80, 2024, doi: 10.31449/inf.v48i6.5246.

P. Atandoh, F. Zhang, D. Adu-Gyamfi, P. H. Atandoh, and R. E. Nuhoho, “Integrated deep learning paradigm for document-based sentiment analysis,” J. King Saud Univ. - Comput. Inf. Sci., vol. 35, no. 7, 2023, doi: 10.1016/j.jksuci.2023.101578.

L. Xing, “Secure Official Document Management and intelligent Information Retrieval System based on recommendation algorithm,” Int. J. Intell. Networks, vol. 5, pp. 110–119, 2024, doi: 10.1016/j.ijin.2024.02.003.

M. Chen et al., “Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost,” Front. Artif. Intell., vol. 7, 2024, doi: 10.3389/frai.2024.1401810.

L. Ju, L. Huang, and S.-B. Tsai, “Online Data Migration Model and ID3 Algorithm in Sports Competition Action Data Mining Application,” Wirel. Commun. Mob. Comput., vol. 2021, 2021, doi: 10.1155/2021/7443676.

F. Es-Sabery et al., “A MapReduce Opinion Mining for COVID-19-Related Tweets Classification Using Enhanced ID3 Decision Tree Classifier,” IEEE Access, vol. 9, pp. 58706–58739, 2021, doi: 10.1109/ACCESS.2021.3073215.

A. Alshamsi, R. Bayari, and S. Salloum, “Sentiment analysis in English Texts,” Adv. Sci. Technol. Eng. Syst., vol. 5, no. 6, pp. 1638–1689, 2020, doi: 10.25046/AJ0506200.

Q. Aini, J. A. H. Hammad, T. Taher, and M. Ikhlayel, “Classification of Tweets Causing Deadlocks in Jakarta Streets with the Help of Algorithm C4.5,” J. Appl. Data Sci., vol. 2, no. 4, pp. 143–156, 2021, doi: 10.47738/jads.v2i4.43.

M. L. Gadebe, O. P. Kogeda, and S. O. Ojo, “Smartphone naïve bayes human activity recognition using personalized datasets,” J. Adv. Comput. Intell. Intell. Informatics, vol. 24, no. 5, pp. 685–702, 2020, doi: 10.20965/JACIII.2020.P0685.

O. Aiyeniko, T. O. Aro, O. A. Olukiran, A. A. Alfa, L. C. Umoru, and A. Owonipa, “Enhanced accuracy for sms spam detection using One Dimensional Ternary Patterns (1D-TP) and firefly algorithm,” Indian J. Eng., vol. 20, no. 53, 2023, doi: 10.54905/disssi/v20i53/e4ije1004.

D. Irawan, D. I. Sensuse, P. A. W. Putro, and A. Prasetyo, “Public Response to the Legalization of The Criminal Code Bill with Twitter Data Sentiment Analysis,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 2, pp. 295–303, 2023, doi: 10.14569/IJACSA.2023.0140236.

S. Mei, “Research on the Reform and Practice of Informatization Mode of Adult Higher and Continuing Education Academic Records Management in the Context of Three-Whole Parenting,” Appl. Math. Nonlinear Sci., vol. 9, no. 1, 2024, doi: 10.2478/amns.2023.2.01470.

R. J. Coller et al., “Caregiving and Confidence to Avoid Hospitalization for Children with Medical Complexity,” J. Pediatr., vol. 247, pp. 109-115.e2, 2022, doi: 10.1016/j.jpeds.2022.05.011.

M. Raihan-Al-Masud and M. Rubaiyat Hossain Mondal, “Data-driven diagnosis of spinal abnormalities using feature selection and machine learning algorithms,” PLoS One, vol. 15, no. 2, pp. 1–21, 2020, doi: 10.1371/journal.pone.0228422.

J. Shanthi, D. G. N. Rani, and S. Rajaram, “A C4.5 decision tree classifier based floorplanning algorithm for System-on-Chip design,” Microelectronics J., vol. 121, no. July 2021, p. 105361, 2022, doi: 10.1016/j.mejo.2022.105361.

N. Garg and K. Sharma, “Text pre-processing of multilingual for sentiment analysis based on social network data,” Int. J. Electr. Comput. Eng., vol. 12, no. 1, pp. 776–784, 2022, doi: 10.11591/ijece.v12i1.pp776-784.

A. Ali, M. Khan, K. Khan, R. U. Khan, and A. Aloraini, “Sentiment Analysis of Low-Resource Language Literature Using Data Processing and Deep Learning,” Comput. Mater. Contin., vol. 79, no. 1, pp. 713–733, 2024, doi: 10.32604/cmc.2024.048712.

M. O. Hegazi, Y. Al-Dossari, A. Al-Yahy, A. Al-Sumari, and A. Hilal, “Preprocessing Arabic text on social media,” Heliyon, vol. 7, no. 2, 2021, doi: 10.1016/j.heliyon.2021.e06191.

N. A. K. M. Haris, S. Mutalib, A. M. A. Malik, S. Abdul-Rahman, and S. N. K. Kamarudin, “Sentiment classification from reviews for tourism analytics,” Int. J. Adv. Intell. Informatics, vol. 9, no. 1, pp. 108–120, 2023, doi: 10.26555/ijain.v9i1.1077.

M. N. Fakhruzzaman, S. Z. Jannah, S. W. Gunawan, A. I. Pratama, and D. A. Ardanty, “IndoPolicyStats: sentiment analyzer for public policy issues,” Bull. Electr. Eng. Informatics, vol. 13, no. 1, pp. 482–489, 2024, doi: 10.11591/eei.v13i1.5263.

V. Nurcahyawati and Z. Mustaffa, “Improving sentiment reviews classification performance using support vector machine-fuzzy matching algorithm,” Bull. Electr. Eng. Informatics, vol. 12, no. 3, pp. 1817–1824, 2023, doi: 10.11591/eei.v12i3.4830.

I. Ho, H.-N. Goh, and Y.-F. Tan, “Preprocessing Impact on Sentiment Analysis Performance on Malay Social Media Text,” J. Syst. Manag. Sci., vol. 12, no. 5, pp. 73–90, 2022, doi: 10.33168/JSMS.2022.0505.

H. Fang, G. Xu, Y. Long, and W. Tang, “An Effective ELECTRA-Based Pipeline for Sentiment Analysis of Tourist Attraction Reviews,” Appl. Sci., vol. 12, no. 21, 2022, doi: 10.3390/app122110881.

A. R. W. Sait and M. K. Ishak, “Deep Learning with Natural Language Processing Enabled Sentimental Analysis on Sarcasm Classification,” Comput. Syst. Sci. Eng., vol. 44, no. 3, pp. 2553–2567, 2023, doi: 10.32604/csse.2023.029603.

F. González, M. Torres-Ruiz, G. Rivera-Torruco, L. Chonona-Hernández, and R. Quintero, “A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre,” Mathematics, vol. 11, no. 23, 2023, doi: 10.3390/math11234735.

A. O. Mostafa and T. M. Ahmed, “Enhanced Emotion Analysis Model using Machine Learning in Saudi Dialect: COVID-19 Vaccination Case Study,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 1, pp. 356–369, 2024, doi: 10.14569/IJACSA.2024.0150134.

S. Saifullah, R. Dreżewski, F. A. Dwiyanto, A. S. Aribowo, Y. Fauziah, and N. H. Cahyana, “Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection,” Appl. Sci., vol. 14, no. 3, 2024, doi: 10.3390/app14031078.

P. Kanungo and H. Singh, “A FEATURE EXTRACTION BASED IMPROVED SENTIMENT ANALYSIS ON APACHE SPARK FOR REAL-TIME TWITTER DATA,” Scalable Comput., vol. 24, no. 4, pp. 847–855, 2023, doi: 10.12694/scpe.v24i4.2343.

A. Alhazmi, R. Mahmud, N. Idris, M. E. M. Abo, and C. I. Eke, “Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models,” PLoS One, vol. 19, no. 7 July, 2024, doi: 10.1371/journal.pone.0305657.

M. Mujahid et al., “Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-024-00943-4.

L. L. Oliveira, X. Jiang, A. N. Babu, P. Karajagi, and A. Daneshkhah, “Effective Natural Language Processing Algorithms for Early Alerts of Gout Flares from Chief Complaints,” Forecasting, vol. 6, no. 1, pp. 224–238, 2024, doi: 10.3390/forecast6010013.

A. Y. Mir, M. Zaman, S. M. K. Quadri, and S. A. Fayaz, “An Adaptive Classification Framework for Handling the Cold Start Problem in Case of News Items,” Rev. d’Intelligence Artif., vol. 36, no. 6, pp. 889–896, 2022, doi: 10.18280/ria.360609.

C. A. Gonçalves, A. S. Vieira, C. T. Gonçalves, R. Camacho, E. L. Iglesias, and L. B. Diz, “A Novel Multi-View Ensemble Learning Architecture to Improve the Structured Text Classification,” Inf., vol. 13, no. 6, 2022, doi: 10.3390/info13060283.

S. Rijal, P. A. Cakranegara, E. M. S. S. Ciptaningsih, P. H. Pebriana, A. Andiyan, and R. Rahim, “Integrating Information Gain methods for Feature Selection in Distance Education Sentiment Analysis during Covid-19,” TEM J., vol. 12, no. 1, pp. 285–290, 2023, doi: 10.18421/TEM121-35.

L. Van Genugten, E. Dusseldorp, T. L. Webb, and P. Van Empelen, “Which combinations of techniques and modes of delivery in internet-based interventions effectively change health behavior? a meta-analysis,” J. Med. Internet Res., vol. 18, no. 6, 2016, doi: 10.2196/jmir.4218.

S. V Chakrasali, K. Indira, S. Y. Narasimhaiah, and S. Chandraiah, “Performance analysis of different intonation models in Kannada speech synthesis,” Indones. J. Electr. Eng. Comput. Sci., vol. 26, no. 1, pp. 243–252, 2022, doi: 10.11591/ijeecs.v26.i1.pp243-252.

M. M. Rahman, A. I. Shiplu, and Y. Watanobe, “CommentClass: A Robust Ensemble Machine Learning Model for Comment Classification,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, 2024, doi: 10.1007/s44196-024-00589-3.

J. Liu and S. Mi, “American literature news narration based on computer web technology,” PLoS One, vol. 18, no. 10 October, 2023, doi: 10.1371/journal.pone.0292446.

M. O. Raza et al., “Reading Between the Lines: Machine Learning Ensemble and Deep Learning for Implied Threat Detection in Textual Data,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, 2024, doi: 10.1007/s44196-024-00580-y.

H. Zhao, “Research on the Application of Improved Decision Tree Algorithm based on Information Entropy in the Financial Management of Colleges and Universities,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 12, pp. 704–714, 2022, doi: 10.14569/IJACSA.2022.0131284.

M. M. A. H. Alshahrani, H. A. Alzahrani, and M. A. Alharbi, "A study on the performance of machine learning algorithms in predicting the severity of traffic accidents," Mathematics, vol. 8, no. 5, p. 851, May 2020, doi: 10.3390/math8050851..

S. Mei, "Research on the Reform and Practice of Informatization Mode of Adult Higher and Continuing Education Academic Records Management in the Context of Three-Whole Parenting," Appl. Math. Nonlinear Sci., vol. 9, no. 1, pp. 1–17, 2024, doi: 10.2478/amns.2023.2.01470.

S. Jun, “Evolutionary algorithm for improving decision tree with global discretization in manufacturing,” Sensors, vol. 21, no. 8, 2021, doi: 10.3390/s21082849.