Search 닫기

Regular paper

Journal of information and communication convergence engineering 2024; 22(4): 280-287

Published online December 31, 2024

https://doi.org/10.56977/jicce.2024.22.4.280

© Korea Institute of Information and Communication Engineering

Automatic Classification of Scientific and Technical Papers Using Large Language Models and Retrieval-Augmented Generation

Jaehan Jeong 1 and Dongsup Jin1* , Member, KIICE

1Department of IT Convergence, University of Ulsan, 44610, Korea

Correspondence to : Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea

Received: June 24, 2024; Revised: October 14, 2024; Accepted: October 18, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.

Keywords BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)

Article

Regular paper

Journal of information and communication convergence engineering 2024; 22(4): 280-287

Published online December 31, 2024 https://doi.org/10.56977/jicce.2024.22.4.280

Copyright © Korea Institute of Information and Communication Engineering.

Automatic Classification of Scientific and Technical Papers Using Large Language Models and Retrieval-Augmented Generation

Jaehan Jeong 1 and Dongsup Jin1* , Member, KIICE

1Department of IT Convergence, University of Ulsan, 44610, Korea

Correspondence to:Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea

Received: June 24, 2024; Revised: October 14, 2024; Accepted: October 18, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.

Keywords: BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)

JICCE
Dec 31, 2024 Vol.22 No.4, pp. 267~343

Stats or Metrics

Share this article on

  • line

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255