Journal of information and communication convergence engineering 2024; 22(4): 280-287
Published online December 31, 2024
https://doi.org/10.56977/jicce.2024.22.4.280
© Korea Institute of Information and Communication Engineering
Correspondence to : Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.
Keywords BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)
Journal of information and communication convergence engineering 2024; 22(4): 280-287
Published online December 31, 2024 https://doi.org/10.56977/jicce.2024.22.4.280
Copyright © Korea Institute of Information and Communication Engineering.
Jaehan Jeong 1 and Dongsup Jin1* , Member, KIICE
1Department of IT Convergence, University of Ulsan, 44610, Korea
Correspondence to:Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.
Keywords: BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)