Enhancing Machine Learning Model Performance in Addressing Class Imbalance

Authors

  • Lucky Lhaura Van FC University of Lancang Kuning
  • M. Khairul Anam University of Samudra
  • Muhammad Bambang Firdaus University of Mulawarman
  • Yogi Yunefri University Of Lancang Kuning
  • Nadya Alinda Rahmi University of Putra Indonesia YPTK Padang

DOI:

https://doi.org/10.31154/cogito.v10i1.626.478-490

Keywords:

Machine Learning, Support Vector Machine, SMOTE, Undersampling, Over-sampling

Abstract

This research aims to investigate methods for handling class imbalance in machine learning models, with a focus on the Support Vector Machine (SVM) algorithm. We apply oversampling (SMOTE) and undersampling techniques to a dataset with class imbalance and evaluate the performance of SVM using these methods. Experiments are conducted using data from Twitter social media regarding the 2024 general electionsThe findings indicate that incorporating SMOTE effectively enhances the performance of SVM models, particularly within the SVM Polynomial variant. However, the use of undersampling shows limited impact on improving SVM model performance. This study provides valuable insights for researchers and practitioners in choosing the appropriate strategy for handling class imbalance in machine learning models.

References

J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting methods for multi-class imbalanced data classification: an experimental review,” J Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00349-y.

Y. Pristyanto, “Penerapan Metode Ensemble Untuk Meningkatkan Kinerja Algoritme Klasifikasi Pada Imbalanced Dataset,” Jurnal TEKNOINFO, vol. 13, no. 1, pp. 11–16, 2019, doi: 10.33365/jti.v13i1.184.

R. Dwi Fitriani, H. Yasin, D. Statistika, and F. Sains dan Matematika, “Penanganan Klasifikasi Kelas Data Tidak Seimbang Dengan Random Oversampling Pada Naive Bayes (Studi Kasus: Status Peserta KB IUD di Kabupaten Kendal),” JURNAL GAUSSIAN, vol. 10, no. 1, pp. 11–20, 2021, doi: 10.14710/j.gauss.10.1.11-20.

N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Scientific Journal of Informatics, vol. 8, no. 2, pp. 276–282, Nov. 2021, doi: 10.15294/sji.v8i2.32484.

M. K. Anam, T. A. Fitri, Agustin, Lusiana, M. B. Firdaus, and A. T. Nurhuda, “Sentiment Analysis for Online Learning using The Lexicon-Based Method and The Support Vector Machine Algorithm,” ILKOM Jurnal Ilmiah, vol. 15, no. 2, pp. 290–302, 2023, doi: 10.33096/ilkom.v15i2.1590.290-302.

C. Kaope and Y. Pristyanto, “The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 22, no. 2, pp. 227–238, Mar. 2023, doi: 10.30812/matrik.v22i2.2515.

D. Elreedy, A. F. Atiya, and F. Kamalov, “A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning,” Mach Learn, 2023, doi: 10.1007/s10994-022-06296-4.

W. I. Sabilla and C. B. Vista, “Implementasi SMOTE dan Under Sampling pada Imbalanced Dataset untuk Prediksi Kebangkrutan Perusahaan,” Jurnal Komputer Terapan, vol. 7, no. 2, pp. 329–339, 2021, doi: 10.35143/jkt.v7i2.5027.

F. Dwi Astuti and F. Nova Lenti, “Implementasi SMOTE untuk mengatasi Imbalance Class pada Klasifikasi Car Evolution menggunakan K-NN,” Jurnal JUPITER, vol. 13, no. 1, pp. 89–98, 2021.

A. Syukron, E. Saputro, Sardiarinto, and P. Widodo, “Penerapan Metode Smote Untuk Mengatasi Ketidakseimbangan Kelas Pada Prediksi Gagal Jantung,” Jurnal Teknologi Informasi dan Terapan (J-TIT), vol. 10, no. 1, pp. 2580–2291, 2023, doi: 10/25047/jtit.v10i1.312.

B. Santoso, H. Wijayanto, K. A. Notodiputro, and B. Sartono, “Synthetic over Sampling Methods for Handling Class Imbalanced Problems : A Review,” in IOP Conference Series: Earth and Environmental Science, Institute of Physics Publishing, Apr. 2017. doi: 10.1088/1755-1315/58/1/012031.

E. M. S. Rochman et al., “Classification of Thesis Topics Based on Informatics Science Using SVM,” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, May 2021, pp. 1–7. doi: 10.1088/1757-899x/1125/1/012033.

Saikin, S. Fadli, and M. Ashari, “Optimization of Support Vector Machine Method Using Feature Selection to Improve Classification Results,” JISA (Jurnal Informatika dan Sains), vol. 4, no. 1, pp. 22–27, 2021, doi: 10.31326/jisa.v4i1.881.

V. V., R. A. C, R. Mohammed, S. K. V, and P. S. Kumthekar, “Support Vector Machine Implementation to Separate Linear and Non-Linear Dataset,” Saudi Journal of Engineering and Technology, vol. 8, no. 1, pp. 4–15, Jan. 2023, doi: 10.36348/sjet.2023.v08i01.002.

R. H. Muhammadi, T. G. Laksana, and A. B. Arifa, “Combination of Support Vector Machine and Lexicon-Based Algorithm in Twitter Sentiment Analysis,” KHAZANAH INFORMATIKA, vol. 8, no. 1, pp. 59–71, 2022, doi: 10.23917/khif.v8i1.15213.

A. N. Ulfah and M. K. Anam, “Analisis Sentimen Hate Speech Pada Portal Berita Online Menggunakan Support Vector Machine (SVM),” JATISI (Jurnal Teknik Informatika dan Sistem Informasi), vol. 7, no. 1, pp. 1–10, 2020, doi: 10.35957/jatisi.v7i1.196.

M. A. Jassim and S. N. Abdulwahid, “Data Mining preparation: Process, Techniques and Major Issues in Data Analysis,” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, Mar. 2021, p. 012053. doi: 10.1088/1757-899x/1090/1/012053.

R. Rudiman and N. A. Rahmi, “Latent Dirichlet Allocation Utilization as a Text Mining Method to Elaborate Learning Effectiveness,” JSE Journal of Science and Engineering, vol. 1, no. 1, pp. 23–29, Sep. 2023, doi: 10.30650/jse.v1i1.3680.

H. Mukhtar, J. Al Amien, and M. A. Rucyat, “Filtering Spam Email menggunakan Algoritma Naïve Bayes,” Jurnal CoSciTech (Computer Science and Information Technology), vol. 3, no. 1, pp. 9–19, May 2022, doi: 10.37859/coscitech.v3i1.3652.

M. A. Fauzi, “Word2Vec model for sentiment analysis of product reviews in Indonesian language,” International Journal of Electrical and Computer Engineering (IJECE), vol. 9, no. 1, pp. 525–530, Feb. 2019, doi: 10.11591/ijece.v9i1.pp525-530.

K. M. G. S. Karunarathna and R. A. H. M. Rupasingha, “Learning to Use Normalization Techniques for Preprocessing and Classification of Text Documents,” International Journal of Multidisciplinary Studies (IJMS), vol. 9, pp. 69–81, 2022, doi: 10.31357/ijms.v9i2.

D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl Soft Comput, vol. 97, pp. 1–23, Dec. 2020, doi: 10.1016/j.asoc.2019.105524.

M. K. Anam, M. I. Mahendra, W. Agustin, Rahmaddeni, and Nurjayadi, “Framework for Analyzing Netizen Opinions on BPJS Using Sentiment Analysis and Social Network Analysis (SNA),” Intensif, vol. 6, no. 1, pp. 2549–6824, 2022, doi: 10.29407/intensif.v6i1.15870.

K. Davagdorj, L. Wang, M. Li, V. H. Pham, K. H. Ryu, and N. Theera-Umpon, “Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering,” Int J Environ Res Public Health, vol. 19, no. 10, pp. 1–21, May 2022, doi: 10.3390/ijerph19105893.

S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS One, vol. 16, no. 8, pp. 1–13, Aug. 2021, doi: 10.1371/journal.pone.0254937.

P. Koirala and A. Shakya, “A Nepali Rule Based Stemmer and its performance on different NLP applications,” ArXiv, 2020, doi: 10.48550/arXiv.2002.09901.

A. Syakur, “Implementasi Metode Lexicon Base Untuk Analisis Sentimen Kebijakan Pemerintah Dalam Pencegahan Penyebaran Virus Corona Covid-19 Pada Twitter,” Jurnal Ilmiah Informatika Komputer, vol. 26, no. 3, pp. 247–260, 2021, doi: 10.35760/ik.2021.v26i3.4720.

F. Amaliah, I. Kadek, and D. Nuryana, “Perbandingan Akurasi Metode Lexicon Based Dan Naive Bayes Classifier Pada Analisis Sentimen Pendapat Masyarakat Terhadap Aplikasi Investasi Pada Media Twitter,” Journal of Informatics and Computer Science, vol. 3, no. 3, pp. 384–393, 2022, doi: 10.26740/jinacs.v3n03.p384-393.

D. Septiani and I. Isabela, “Analisis Term Frequency Inverse Document Frequency (Tf-Idf) Dalam Temu Kembali Informasi Pada Dokumen Teks,” SINTESIA: Jurnal Sistem dan Teknologi Informasi Indonesia, vol. 1, no. 2, pp. 81–88, 2022.

H. Fan and Y. Qin, “Research on Text Classification Based on Improved TF-IDF Algorithm,” in International Conference on Network, Communication, Computer Engineering, 2018, pp. 501–506. doi: 10.2991/ncce-18.2018.79.

C.-Z. Liu, Y.-X. Sheng, Z.-Q. Wei, and Y.-Q. Yang, “Research of Text Classification Based on Improved TF-IDF Algorithm,” in International Conference of Intelligent Robotic and Control Engineering, 2018, pp. 218–222. doi: 10.1109/IRCE.2018.8492945.

R. Thiruvengatanadhan, “Music Classification using MFCC and SVM,” International Research Journal of Engineering and Technology (IRJET), vol. 5, no. 9, pp. 922–924, 2018.

J. Cao, G. Lv, C. Chang, and H. Li, “A Feature Selection Based Serial SVM Ensemble Classifier,” IEEE Access, vol. 7, pp. 144516–144523, 2019, doi: 10.1109/ACCESS.2019.2917310.

B. Yassin, C. Mohamed, and Y. Al-Amrani, “A Nonlinear Support Vector Machine Analysis Using Kernel Functions for Nature and Medicine,” in E3S Web of Conferences, EDP Sciences, Nov. 2021, pp. 1–5. doi: 10.1051/e3sconf/202131901103.

M. Awad and R. Khanna, “Support Vector Machines for Classification,” in Efficient Learning Machines Theories, Concepts, and Applications For Engineers and System Designers, Apress Media, 2015, pp. 39–66.

S. Al Qodrin, N. Yusliani, and A. Syahrini, “Classification of Indonesian Questions Using the Support Vector Machine Algorithm and Mutual Information Feature Selection,” Jurnal JUPITER, vol. 14, no. 2, p. 44, 2022, doi: 10.5281./4796/5.jupiter.2022.10.

S. H. Hasanah, “Classification Support Vector Machine In Breast Cancer Patients,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 16, no. 1, pp. 129–136, Mar. 2022, doi: 10.30598/barekengvol16iss1pp129-136.

A. A. Ewees, A. A. Hemedan, A. E. Hassanien, and A. T. Sahlol, “Optimized support vector machines for unveiling mortality incidence in Tilapia fish,” Ain Shams Engineering Journal, vol. 12, no. 3, pp. 3081–3090, Sep. 2021, doi: 10.1016/j.asej.2021.01.014.

R. Mukarramah, D. Atmajaya, and L. B. Ilmawan, “Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter,” ILKOM Jurnal Ilmiah, vol. 13, no. 2, pp. 168–174, 2021, doi: 10.33096/ilkom.v13i2.851.168-174.

F. R. Lumbanraja, R. A. Saputra, K. Muludi, A. Hijriani Dan, and A. Junaidi, “Implementasi Support Vector Machine dalam Memprediksi Harga Rumah pada Perumahan di Kota Bandar Lampung,” Jurnal Pepadun, vol. 2, no. 3, pp. 327–335, 2021, doi: 10.23960/pepadun.v2i3.90.

S. A. H. Bahtiar, C. K. Dewa, and A. Luthfi, “Comparison of Naïve Bayes and Logistic Regression in Sentiment Analysis on Marketplace Reviews Using Rating-Based Labeling,” Journal of Information Systems and Informatics, vol. 5, no. 3, pp. 915–927, Aug. 2023, doi: 10.51519/journalisi.v5i3.539.

R. A. Sitorus and I. Zufria, “Application of the Naïve Bayes Algorithm in Sentiment Analysis of Using the Shopee Application on the Play Store,” Digital Zone, vol. 15, no. 1, pp. 53–65, 2024, doi: 10.31849/digitalzone.v15i1.19828.

I. Prayoga, M. Dwifebri p, and Adiwijaya, “Sentiment Analysis on Indonesian Movie Review Using KNN Method With the Implementation of Chi-Square Feature Selection,” Jurnal Media Informatika Budidarma, vol. 7, no. 1, pp. 369–375, 2023, doi: 10.30865/mib.v7i1.5522.

A. Angdresey and G. Saroinsong, “The Decision Tree Algorithm on Sentiment Analysis: Russia and Ukraine War,” Jurnal Sisfotenika, vol. 13, no. 2, pp. 192–200, 2023, doi: 10.30700/jst.v13i2.1397.

R. A. Rudiyanto and E. B. Setiawan, “Sentiment Analysis Using Convolutional Neural Network (CNN) and Particle Swarm Optimization on Twitter,” JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), vol. 9, no. 2, pp. 188–195, Feb. 2024, doi: 10.33480/jitk.v9i2.5201.

A. C. M. V. Srinivas, Ch. Satyanarayana, Ch. Divakar, and K. P. Sirisha, “Sentiment Analysis using Neural Network and LSTM,” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, Feb. 2021, pp. 1–7. doi: 10.1088/1757-899x/1074/1/012007.

Downloads

Published

2024-06-30

How to Cite

Van FC, L. L., Anam, M. K., Firdaus, M. B., Yunefri, Y., & Rahmi, N. A. (2024). Enhancing Machine Learning Model Performance in Addressing Class Imbalance. CogITo Smart Journal, 10(1), 57–69. https://doi.org/10.31154/cogito.v10i1.626.478-490