Search 닫기

Regular paper

Split Viewer

Journal of information and communication convergence engineering 2024; 22(4): 280-287

Published online December 31, 2024

https://doi.org/10.56977/jicce.2024.22.4.280

© Korea Institute of Information and Communication Engineering

Automatic Classification of Scientific and Technical Papers Using Large Language Models and Retrieval-Augmented Generation

Jaehan Jeong 1 and Dongsup Jin1* , Member, KIICE

1Department of IT Convergence, University of Ulsan, 44610, Korea

Correspondence to : Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea

Received: June 24, 2024; Revised: October 14, 2024; Accepted: October 18, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.

Keywords BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)

The number of papers published in the field of science and technology exceeds five million annually, making it very vast. Therefore, technology to automatically classify them for users’ convenience has been actively researched. Domestic papers in Korea are classified into 356 categories based on the National Science and Technology Classification, and various classification technologies have been proposed for this purpose.

Automatic classification technology has rapidly advanced with the development of artificial intelligence (AI), with the core being the advancement of document embedding technology. Before the development of deep learning, keyword extraction-based embedding techniques using the term frequency-inverse document frequency (TF-IDF) method were used. Subsequently, keyword extraction has progressed with the development of probability-based latent Dirichlet allocation (LDA) techniques. Documents embedded in this manner have led to the development of supervised learning-based document classification technologies using classification algorithms such as support vector machine (SVM) and random forest.

With the rise of deep learning, document embedding technology has rapidly advanced, especially after the proposal of the transformer, and more sophisticated document embedding models became possible with large language models (LLMs), such as BERT and GPT [1-5], Technologies utilizing deep learning embed each document with pre-trained models and graph neural networks, add inference structures and fine-tune them [6-8]. However, even for fine-tuning, a significant amount of labeled data is required for supervised learning, which is not easily obtainable, especially in the field of science and technology where new fields constantly emerge and many papers are generated.

In this study, we aim to overcome the limitations of existing technologies by utilizing an LLM-based model. Additionally, we extend the existing document classification model, which primarily focuses on major categories, to encompass 356 detailed classifications, thereby tackling more challenging document classification problems. We experimentally confirmed whether it is possible to achieve comparable classification performance with fewer labeled data by comparing it with supervised learning-based models using the existing BERT and by utilizing appropriate prompt engineering for LLM. Additionally, we propose a system structure that is easy to implement and utilize by integrating the LLM with Retrieval-Augmented Generation (RAG), and a vector database (DB) capable of storing the embedded states of the documents.

A. Supervised Learning Approaches in Document Classification

Supervised learning has been widely adopted in document classification tasks, leveraging labeled data to train models that can predict document categories. Traditional methods such as naïve Bayes and SVMs have been used in this area. naïve Bayes treats document classification probabilistically by making decisions based on word occurrences, whereas SVMs seek to find hyperplanes that optimally separate document categories in the feature space. Despite their effectiveness, these methods rely heavily on manually engineered features such as TF-IDF, which can limit their performance when using complex or large-scale datasets. In recent years, neural networks have become the dominant approach for document classification. Convolutional neural networks (CNNs) have demonstrated a strong performance in text classification by capturing local patterns in documents, whereas recurrent neural networks (RNNs) have been employed to model sequential dependencies in text. More recently, transformerbased models such as BERT have achieved state-of-the-art results by pre-training on vast amounts of text data and finetuning specific tasks, allowing the model to capture deep contextual and semantic relationships in documents.

A significant challenge in supervised learning is the requirement for large amounts of labeled data, which can be expensive and time-consuming. Transfer learning has been used effectively to address this issue. In this approach, models are pre-trained on large, general corpora and then finetuned on smaller, task-specific datasets, significantly improving the performance, especially when labeled data are scarce. Additionally, semi-supervised learning techniques have been explored to, further reduce the dependency on labeled data while maintaining high performance. In summary, supervised learning has played a crucial role in advancing document classification, with approaches evolving from traditional machine learning algorithms to advanced neural and transformer-based models. Although these advancements in document classification have significantly improved accuracy, they are still constrained by the need for large labeled datasets and domain-specific fine-tuning. Moreover, as domain complexity increases, these models struggle with data sparsity and the requirement for constant manual intervention.

To overcome these limitations, this study proposes a document classification system that utilize LLM and RAG. Before introducing the proposed system, the key concepts of the LLM and RAG are presented in the following subsections.

B. Advancement of Pre-trained Language Models and RAG

A transformer uses the self-attention mechanism to model sequence data and analyze the relationships between words within a sentence. The transformer fundamentally adopts an encoder-decoder architecture, with each encoder and decoder consisting of multiple stacked layers. Each encoder layer is composed of a self-attention layer and a fully connected layer. The self-attention layer is designed to refer to the meanings of words in different positions when understanding the meaning of the word at each position in the input sentence. Multiple self-attention layers are simultaneously used to implement multi-head attention. The encoding vectors generated from these multiple self-attention layers were further encoded through a fully connected layer. This structure allows the transformer to process sequence data effectively and can be applied to various natural language processing (NLP) tasks, including machine translation.

Recently, new models that enhance performance by utilizing the transformer architecture in the field of NLP have emerged, with Google's BERT serving as a prominent example. BERT is a large model that uses only the encoder structure of the Transformer and was the largest model before the development of the generative pretrained transformer (GPT).

BERT pre-trains on two major self-supervised learning tasks. The first involves randomly masking words in a sentence and predicting the masked words and the second involves determining whether two sentences are sequentially connected. Typically, such self-supervised learning is conducted with large volumes of text, resulting in embeddings from BERT that show high accuracy when used for transfer learning compared with traditional methods.

Since the release of ChatGPT, generative models based on LLMs have garnered significant attention in the field of artificial intelligence [9]. These models are widely used, replacing traditional chatbot services that provide results by learning from language data such as text.

LLMs are built on machine learning and deep learning technologies and require large datasets and substantial computational power for training. As foundational models for natural language processing (NLP) and natural language generation (NLG) tasks, LLMs are pre-trained on vast amounts of data to learn the complexity and interconnectedness of language. They were developed using prompt engineering techniques such as fine-tuning, in-context learning, and zero/one/few-shot learning. Because of the difficulty in fine-tuning all parameters of these models, methods such as P-tuning, prefix tuning, prompt tuning, and low rank adaptation (LoRA) are used to train only a subset of the parameters [10].

Prominent LLM models include OpenAI’s GPT series and Google’s BERT (Bidirectional Encoder Representations from Transformers). Recently, open-source LLMs such as Mistral AI and Llama have also been growing, offering high-quality performance with relatively fewer parameters.

Retrieval augmented generation (RAG) is a pattern that uses pre-trained LLMs and proprietary data to generate responses. LLMs, trained on publicly available large-scale data generally provide appropriate answers to queries on widely known knowledge. However, they often struggle to provide accurate responses to domain-specific or specialized queries and are prone to generating arbitrary answers, known as hallucinations. The RAG addresses this issue by searching for reference materials that can potentially contain correct answers and generates responses based on this information. In particular, the role of RAG in enhancing the accuracy of the responses in domain-specific applications [11].

C. Prompt Engineering

‘Prompt Engineering’ is a technique aimed at providing appropriate prompts for AI to achieve high-quality output. Becuase AI models generate responses based on the input (prompt) they receive, the performance of AI is significantly influenced by the construction of the prompt. Prompt engineering employs various techniques to maximize the potential of AI models, and methods based on few-shot learning have been used to develop models such as GPT. For an AI model that has been pre-trained on large datasets to extract information e ffectively, the provided prompts must be clear and specific, effectively conveying the required information. The main elements of prompt engineering are as follows [12]:

1. Clarity of Prompt: The prompt must be clear and specific. Ambiguous prompts can confuse the AI model and lead to inaccurate responses.

2. Reflection on User Requirements: The prompt should reflect user requirements. It is important to understand clearly what the user wants to incorporate into the prompt.

3.Understanding Model Characteristics and Limitations: It is essential to understand the characteristics and limitations of AI models and consider them when writing prompts. For instance, if the model lacks knowledge about a specific area, it is necessary to provide additional information or make the prompt more specific.

Prompt engineering is a critical component in harnessing the capabilities of AI models, ensuring that the prompts are well-structured to guide the model towards generating precise and relevant responses.

Although these advancements in document classification have significantly improved accuracy using supervised learning and BERT, they are still constrained by the need for large labeled datasets and domain-specific fine-tuning. To address these limitations, we propose a more scalable approach utilizing LLMs and RAG, which are introduced in this section.

A. Data Set

To demonstrate the effectiveness of the proposed LLM + RAG system in handling complex classification problems, we utilized a dataset consisting of 481,578 papers, as described below. These datasets allow us to evaluate the system performance in scenarios with varying amounts of labeled data, which is a critical factor in overcoming the challenges faced by traditional models. The datasets used in this study are listed in Tables 1 and 2. We utilized only Korean and English abstract data from 481,578 full-text datasets of domestic papers in Korea provided by KISTI and conducted our study using classification data on research fields (Standard Classification Code for Science and Technology, Label) from 30,000 papers.

Table 1 . Article Dataset (481,578)

PropertyCorpus
Doc_idPaper id
Abstract_koAbstract (Korean)
Abstract_enAbstract (English)


Table 2 . Label Dataset (30,000)

PropertyCorpus
IdArticle id
Title_koPaper Title in Korean
Title_enPaper Title in English
Code1Standard Classification Code for Science and Technology
Code2Standard Classification Code for Science and Technology
Code3Standard Classification Code for Science and Technology


We compared paper IDs from the Korean full-text dataset with the IDs from the research field classification data, retrieved the abstract data, and used them accordingly. Among the 30,000 datasets, there were 6,047 instances in which all three classification codes were present, 8,915 instances with only two codes, and 11,291 instances with only one code, indicating fewer data points as the number of classification codes increased. For training purposes, 90% of the collected data were used, and a random 10% was extracted for testing and performance evaluation.

B. Proposed System Architecture for Document Classification Using RAG

Although supervised learning approaches have significantly advanced document classification, they have notable limitations, particularly their reliance on large amounts of labeled data. Collecting and labeling these datasets is often time-consuming, expensive, and, in some cases, impractical, especially for specialized or evolving domains. In Addition, the performance of traditional supervised methods is tightly coupled with the quality and quantity of labeled data, which can limit their applicability to real-world scenarios where labeled data are sparse. To address these limitations, we propose a document classification method that leverages LLMs in conjunction with RAG. LLMs, pre-trained on vast amounts of general text, can generalize across tasks with minimal fine-tuning, thereby reducing the dependence on extensively labeled datasets. By integrating RAG, our approach can further enhance the classification by retrieving the relevant document context from a vectorized DB, thereby enriching the model’s input with domain-specific information. This combination not only mitigates the need for large-scale labeled data but also enables more accurate and context-aware classification, especially in dynamic or under-resourced domains. Through this method, we aim to overcome the shortcomings of traditional supervised learning and provide a more scalable and efficient solution for document classification tasks.

First, the role of LLM is illustrated. Fig. 1 illustrates the structure of a document classification system that utilizes LLM and RAG. In the proposed system, the user inputs the document title and abstract as queries for document classification. The title and abstract provided by the user are used as inputs to the embedding model, which is then used to search for similar documents in the vector DB. In a vector DB, numerous previously classified documents are individually embedded using the embedding model, and the original information of these documents and their embedded vectors are stored for use. In RAG, documents corresponding to embedding vectors similar to the user query are retrieved and used as input to the LLM along with the user query. Based on this, the LLM provides a classification code for documents entered by users. The LLM in Fig. 1 was used to answer the questions and was implemented using the GPT-3.5 API.

Fig. 1. Document classification system architecture using RAG

Fig. 2 illustrates the structure for storing document data in the vector DB introduced in [13]. The structure involves separating abstracts into tokens from source data, and these tokenized sentences are vectorized and stored in a vector DB, utilizing the FAISS vector DB for storage. In general, vector DB is optimized by indexing the data for speeding up search. This optimization is based on clustering of vectors, where quantization is performed based on individual clusters to assign indices which is illustrated in Fig. 2. Searches are then conducted around these indexes, offering faster performance compared to conventional similarity-based searches. For the reduction of the complexity, the LLM model, KoAlpaca, trained in Korean, was quantized into 4 bits using Llama.

Fig. 2. The process of storing data in a Vector DB

C. Prompt Design

The first step in prompt design is to define the role of the LLM. According to the principle of clarity discussed in the previous section, it should be made clear that the LLM is to be used for automatic document classification. To clarify the classification method, the following examples are provided. The example in Fig. 3 presents a scientific document with title, abstract, all three classification codes as labels, and the LLM is prompted to find similar documents and provide their classification codes in response. The LLM recognizes its role as finding similarly labeled documents and providing classification codes in response to the user.

Fig. 3. Prompt Design

The task of the proposed system is completed by providing a document corresponding to the input prompt, searching for similar documents based on similarities in the vector DB, and returning the classification codes of the retrieved documents.

However, it should be noted that the more accurately labeled documents are stored in the vector DB, the more the number of label. Therefore, for a comparative analysis with the BERT-based classification model, the number of label data used by BERT and the number of documents with classification codes stored in the vector DB were set to be equal.

A. Experiment Results

In Tables 3 and 4, the document classification accuracy was evaluated using two models, KoSciDeBERTa and KoAlphaka + RAG, on datasets of different sizes (3k, 10 k, 30 k). The results summarize the accuracies of the first, second, and third classification codes for each model. Before introduction of LLM, various attempts were made to use BERT in studies related to document classification. In this study, we aim to compare existing BERT-based document classification models with the LLM. This model was trained for the Korean Science Document provided by the KISTI [14].

Table 3 . Document Classification Accuracy (%) of KoSciDeBERTa and KoAlphaka+RAG Models

Label data1st Code2nd Code3rd Code
Data 3k KoSciDeBERTa48.812.047.3
Data 10k KoSciDeBERTa61.322.715.08
Data 30k KoSciDeBERTa66.6835.0619.05
Data 3k KoAlphaka+RAG71.632.212.7
Data 10k KoAlphaka+RAG81.333.726.27
Data 30k KoAlphaka+RAG84.440.2627.48


Table 4 . Performance Comparison for three classification code according to the size of label dataset, KoAlphaka+RAG

Label data1st Code2nd Code3rd Code
Data 3k71.632.212.7
Data 10k81.333.726.27
Data 30k84.440.2627.48
Data30k + Pseudo10k97.748.230.6


In addition to the abstracts of papers as inputs for BERT, keywords extracted using Keyword BERT were also embedded and combined with the results of abstract embeddings for input into the BERT model for paper classification. The tokenizer used was developed by SKT T-Brain specifically for BERT models [15].

As shown in Tables 3 and 4, the KoAlphaka+RAG consistently outperformed the BERT-based model, particularly when the dataset size was small. This result confirms our hypothesis that the integration of RAG reduces the dependency on large labeled datasets, addressing one of the key limitations of previous approaches. Fig. 4 highlights the similarity distribution among documents based on the LLM's classification, further demonstrating the model's ability to assign accurate labels even in cases where traditional methods struggle. This qualitative analysis, combined with the quantitative improvements presented in Tables 3 and 4, reinforces the superiority of our approach.

Fig. 4. Similarity distribution of documents for three types of classification codes

From a model perspective, KoAlphaka+RAG consistently showed a higher accuracy than KoSciDe-BERTa across all dataset sizes. In particular, for smaller dataset sizes (3k), KoAlphaka + RAG demonstrated a significantly higher performance with a first code accuracy of 71.6%, compared to KoSciDeBERTa’s 48.8%. The second and third code accuracies also consistently favored the KoAlphaka + RAG model. From the perspective of dataset size, both models showed increased accuracy as the dataset size increased, indicating that more data positively influenced model training. Notably, the KoAlphaka + RAG model exhibited pronounced performance improvements with increasing dataset size.

Generally, document classification labels can be somewhat inaccurate, especially for the second and third codes, although KoAlphaka + RAG has been shown to achieve high accuracy with relatively fewer label data than BERT, particularly for these codes. Fig. 4 illustrates the similarity distribution between documents and queries for the first, second, and third classification codes selected by the LLM. The results indicate that the LLM consistently selects the classification code with the highest similarity to the first code. For the second and third codes, the similarity distributions formed distinct clusters. Although some overlap exists because of the possibility of a single document belonging to multiple categories, LLM demonstrates appropriate classification code selection overall. The following example proves that model-made labels can be more accurate than the manually-made labels.

Examples for the proof evidence:

Example 1)

Query:

Title: Catalyst and Method for Producing High-Purity Glycerol Carbonate

Abstract:

The present invention relates to a method for producing high-purity glycerol carbonate by extracting and distilling low-purity glycerol carbonate, which is produced by reacting glycerol and urea, using alkylene glycol.

Label (Ground Truth):

EC01 (Chemical Processes), EF06 (Renewable Energy),

EA05 (Robotics/Automated Machinery)

LLM response:

EC01 (Chemical Processes), EC03 (Polymer Processing Technology), and EB02 (Ceramic Materials)

Given the title and abstract, the only appropriate label appears to be only EC01 (Chemical Processes). The other labels provided by both the ground truth and LLM response do not appear to be relevant based on the information provided. Therefore, the label should be revised to include only EC01 which was provided by the LLM.

Example 2)

Query:

Title: Marine Environments and Production of Laver Farm at Aphae-do Based on Water Quality and Phytoplankton Community

Abstract:

To understand the marine environmental characteristics of the seaweed cultivation area located in the southwestern waters of Aphaedo, Shinan-gun, three field surveys were conducted from October 2013 to April 2014, coinciding with the growth stages of the seaweed, specifically during the leaf development, middle leaf, and mature leaf stages. The Aphae seaweed farm is located in shallow coastal waters, where physical d isturb ances such a s wind a nd t idal m ixing cause significant resuspension of surface sediments. The resuspension of surface sediments not only supplies nutrients but also obstructs light penetration into the seawater because of the high total suspended solids, which in turn are believed to hinder the growth of phytoplankton competing with seaweed for nutrients during the winter seaweed growth period. (Omitted for brevity.)

Label (Ground Truth):

LB13 (Aquaculture), LB01 (Crop Science)

LLM response:

LB14 (Fisheries Resources/Fishery Environment), EH06 (Marine Environment), ND08 (Marine Science)

In this example, the LLM response was more comprehensive and contextually appropriate. They capture the essence of the study’s focus on the marine environment and scientific analysis, without including less relevant categories such as terrestrial crop science. The manual labels are detailed but include a less relevant category (LB01) and miss the broader marine environmental context captured by the LLM response. Therefore, for this specific document, the LLM-made labels were superior.

Fig. 4 highlights the similarity distribution among documents based on the classification performed by the LLM, further demonstrating the model's ability to assign accurate labels even in cases where traditional methods struggle. This qualitative analysis, combined with the quantitative improvements presented in Tables 3 and 4, reinforces the superiority of our approach.

B. Limitations and Further work

In this study, we verified that using RAG for scientific paper classification can achieve higher accuracy than traditional BERT-based supervised learning methods, especially when utilizing a small amount of labeled data. Experiments demonstrated that the proposed system structure outperformed existing document classification systems in all aspects. However, one limitation of this study is that it was restricted to scientific and technical papers. As previously mentioned, the domain-specific characteristics of RAG still exist in this system. Nevertheless, based on the experimental results, we anticipate that if diverse data from various fields are secured, the proposed simple structure can easily expand the scope of document classification.

The second limitation is the relatively lower accuracy of the second and third classification codes compared with the first classification code. This is because of the insufficient labeled data currently available. However, if obtaining accurate classification codes is difficult, moving away from supervised learning-based methodologies could be an alternative. As shown in the prompt, the proposed system still uses an approach similar to supervised learning, to approximate the given correct answers. For the second and third classification codes, an unsupervised learning approach that reveals new classification codes based on similarity can be employed instead of these supervised learning-based methods. Evaluating system’s performance indirectly by assessing user convenience could also be a viable approach.

This study proposed a system architecture utilizing LLM and RAG to overcome the limitations of existing BERTbased document classification methods in scientific document classification tasks, and compared the performance of the document classification system. The experimental results showed that the LLM with RAG achieved a higher accuracy across all dataset sizes, demonstrating superior performance. Particularly noteworthy was their high accuracy in the first classification code, indicating that LLM with RAG enhanced the semantic understanding of documents, enabling precise classification. As the dataset size increased, the model performance improved, confirming the positive impact of larger datasets on model training. The high accuracy of the first classification code underscores the importance and the capability of model in learning effectively.

This study demonstrated that LLMs combined with RAG provide a robust solution for document classification tasks, particularly in domains with limited labeled data. Future work will focus on extending this approach to other scientific fields and further optimizing prompt engineering to enhance the classification accuracy across more complex datasets.

This result was supported by “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-003).

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of 31st Conference on Neural Information Processing Systems, Long Beach, USA, pp. 1-11, 2017.
  2. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, Oct. 2018. DOI: 10.48550/arXiv.1810.04805.
  3. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving Language Understanding by Generative Pre-Training, 2018. [Online]. Available: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
  4. J. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The Long-Document Transformer,” arXiv preprint arXiv: 2004.05150, Apr. 2020. DOI: 10.48550/arXiv.2004.05150.
  5. P. He, J. Gao, and W. Chen, “DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing,” arXiv preprint arXiv: 2111.09543, Nov. 2021. DOI: 10.48550/arXiv.2111.09543.
  6. Y. Adi, O. Keren, and B. Crammer, “DocBERT: BERT for Document Classification,” arXiv preprint arXiv:1904.08398, Apr. 2019. DOI: 10.48550/arXiv.1904.08398.
  7. N. Chalkidis, I. Androutsopoulos, and D. Gall, “Effectively Leveraging BERT for Legal Document Classification,” in Proceeding of the Natural Legal Language Processing Workshop 2021, Punta Cana, DO, 2021. DOI: 10.18653/v1/2021.nllp-1.22.
    CrossRef
  8. W. Yao, D. Ding, H. Huang, and Z. Yuan, “Scientific Paper Classification by Fusing BERT and GCN,” in Proceedig of the 2023 International Conference on Intelligen Education and Intelligen Research (IEIR), Wuhan, CN, 2023. DOI: 10.1109/IEIR59294.2023.
    CrossRef
  9. T. Brown, B. Mann, N.Ryder, M.Subbiah, J.Kaplan, P. Dhariwal, A.Neelakantan, P.Shyam, and G.Sastry, “Language Models are Few-Shot Learners,” in Proceeding of the 34th Conference on Neural Information Processing System(NeurIPS 2020), Vancouver, CA, 2020.
  10. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2016.09685, Jun. 2021. DOI: 10.48550/arXiv.2106.09685.
  11. C. Jeong, “Generative AI service implementation using LLM application architecture: based on RAG model and LangChain framework,” Journal of Intelligence and Information Systems, vol. 19, no. 4, Dec. 2023. DOI: 10.13088/jiis.2023.29.4.129.
  12. S. Bsharat, A. Myrzakhan, and Z. Shen, “Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4,” arXiv preprint arXiv:2312.16171, Dec. 2023. DOI: 10.48550/arXiv.2312.16171.
  13. E. Wallace, “How Vector Databases Can Enhance GenAI,” RTInsights, 2023. [Online]. Available: https://www.rtinsights.com/how-vector-databases-enhance-genai/.
  14. KISTI-AI, KorSciBERT [Internat], Available: https://github.com/KISTI-AI/KorSciBERT.
  15. SKT Brain, KoBERT. [Internet], Available: https://github.com/SKTBrain/KoBERT.

Jaehan Jung

Jaehan Jung is currently studying Industrial Design with a focus on IT Convergence at the University of Ulsan since 2017. His research interests include natural language processing, machine learning, and human-computer interaction.


Dongsup Jin

Dongsup Jin received the undergraduate degree and Ph. D. degree in electrical and information engineering from Seoul National University, Seoul, Korea in 2006, and 2013, respectively. He has worked as a researcher in signal processing and AI fields at Samsung Electronics and LG AI Research, accumulating experience in data science and AI-related projects. Since 2023, he has been serving as an assistant professor in the Department of IT Convergence at Ulsan University. His research interests include data.

mining, machine learning, and artificial intelligence

utilizing graphs.


Article

Regular paper

Journal of information and communication convergence engineering 2024; 22(4): 280-287

Published online December 31, 2024 https://doi.org/10.56977/jicce.2024.22.4.280

Copyright © Korea Institute of Information and Communication Engineering.

Automatic Classification of Scientific and Technical Papers Using Large Language Models and Retrieval-Augmented Generation

Jaehan Jeong 1 and Dongsup Jin1* , Member, KIICE

1Department of IT Convergence, University of Ulsan, 44610, Korea

Correspondence to:Dongsup Jin (E-mail:dsjin@ulsan.ac.kr)
Department of IT Convergence, University of Ulsan, 44610, Republic of Korea

Received: June 24, 2024; Revised: October 14, 2024; Accepted: October 18, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This study proposes artificial intelligence (AI) technology for the automatic classification of Korean scientific and technical papers, aiming to achieve high accuracy even with a small amount of labeled data. Unlike existing BERT-based Korean document classification models that perform supervised learning based on a large amount of accurately labeled data, this study proposes a structure that utilize large language models (LLMs) and retrieval-augmented generation (RAG) technology. The proposed method experimentally demonstrates that it can achieve higher accuracy than existing technologies across all cases using various amounts of labeled data. Furthermore, a qualitative comparison between manually-generated labels, and recognized as correct answers and those produced by LLM responses confirmed that the LLM responses were more accurate. The findings of this study, while limited to Korean scientific documents, provide evidence that a system utilizing LLM and RAG for document classification can easily be extended to other domains with diverse document datasets, owing to its effectiveness even with limited labels.

Keywords: BERT, Document classification , Large language model (LLM), Retrieval-augmented generation (RAG), Vector database (DB)

I. INTRODUCTION

The number of papers published in the field of science and technology exceeds five million annually, making it very vast. Therefore, technology to automatically classify them for users’ convenience has been actively researched. Domestic papers in Korea are classified into 356 categories based on the National Science and Technology Classification, and various classification technologies have been proposed for this purpose.

Automatic classification technology has rapidly advanced with the development of artificial intelligence (AI), with the core being the advancement of document embedding technology. Before the development of deep learning, keyword extraction-based embedding techniques using the term frequency-inverse document frequency (TF-IDF) method were used. Subsequently, keyword extraction has progressed with the development of probability-based latent Dirichlet allocation (LDA) techniques. Documents embedded in this manner have led to the development of supervised learning-based document classification technologies using classification algorithms such as support vector machine (SVM) and random forest.

With the rise of deep learning, document embedding technology has rapidly advanced, especially after the proposal of the transformer, and more sophisticated document embedding models became possible with large language models (LLMs), such as BERT and GPT [1-5], Technologies utilizing deep learning embed each document with pre-trained models and graph neural networks, add inference structures and fine-tune them [6-8]. However, even for fine-tuning, a significant amount of labeled data is required for supervised learning, which is not easily obtainable, especially in the field of science and technology where new fields constantly emerge and many papers are generated.

In this study, we aim to overcome the limitations of existing technologies by utilizing an LLM-based model. Additionally, we extend the existing document classification model, which primarily focuses on major categories, to encompass 356 detailed classifications, thereby tackling more challenging document classification problems. We experimentally confirmed whether it is possible to achieve comparable classification performance with fewer labeled data by comparing it with supervised learning-based models using the existing BERT and by utilizing appropriate prompt engineering for LLM. Additionally, we propose a system structure that is easy to implement and utilize by integrating the LLM with Retrieval-Augmented Generation (RAG), and a vector database (DB) capable of storing the embedded states of the documents.

II. Related Works

A. Supervised Learning Approaches in Document Classification

Supervised learning has been widely adopted in document classification tasks, leveraging labeled data to train models that can predict document categories. Traditional methods such as naïve Bayes and SVMs have been used in this area. naïve Bayes treats document classification probabilistically by making decisions based on word occurrences, whereas SVMs seek to find hyperplanes that optimally separate document categories in the feature space. Despite their effectiveness, these methods rely heavily on manually engineered features such as TF-IDF, which can limit their performance when using complex or large-scale datasets. In recent years, neural networks have become the dominant approach for document classification. Convolutional neural networks (CNNs) have demonstrated a strong performance in text classification by capturing local patterns in documents, whereas recurrent neural networks (RNNs) have been employed to model sequential dependencies in text. More recently, transformerbased models such as BERT have achieved state-of-the-art results by pre-training on vast amounts of text data and finetuning specific tasks, allowing the model to capture deep contextual and semantic relationships in documents.

A significant challenge in supervised learning is the requirement for large amounts of labeled data, which can be expensive and time-consuming. Transfer learning has been used effectively to address this issue. In this approach, models are pre-trained on large, general corpora and then finetuned on smaller, task-specific datasets, significantly improving the performance, especially when labeled data are scarce. Additionally, semi-supervised learning techniques have been explored to, further reduce the dependency on labeled data while maintaining high performance. In summary, supervised learning has played a crucial role in advancing document classification, with approaches evolving from traditional machine learning algorithms to advanced neural and transformer-based models. Although these advancements in document classification have significantly improved accuracy, they are still constrained by the need for large labeled datasets and domain-specific fine-tuning. Moreover, as domain complexity increases, these models struggle with data sparsity and the requirement for constant manual intervention.

To overcome these limitations, this study proposes a document classification system that utilize LLM and RAG. Before introducing the proposed system, the key concepts of the LLM and RAG are presented in the following subsections.

B. Advancement of Pre-trained Language Models and RAG

A transformer uses the self-attention mechanism to model sequence data and analyze the relationships between words within a sentence. The transformer fundamentally adopts an encoder-decoder architecture, with each encoder and decoder consisting of multiple stacked layers. Each encoder layer is composed of a self-attention layer and a fully connected layer. The self-attention layer is designed to refer to the meanings of words in different positions when understanding the meaning of the word at each position in the input sentence. Multiple self-attention layers are simultaneously used to implement multi-head attention. The encoding vectors generated from these multiple self-attention layers were further encoded through a fully connected layer. This structure allows the transformer to process sequence data effectively and can be applied to various natural language processing (NLP) tasks, including machine translation.

Recently, new models that enhance performance by utilizing the transformer architecture in the field of NLP have emerged, with Google's BERT serving as a prominent example. BERT is a large model that uses only the encoder structure of the Transformer and was the largest model before the development of the generative pretrained transformer (GPT).

BERT pre-trains on two major self-supervised learning tasks. The first involves randomly masking words in a sentence and predicting the masked words and the second involves determining whether two sentences are sequentially connected. Typically, such self-supervised learning is conducted with large volumes of text, resulting in embeddings from BERT that show high accuracy when used for transfer learning compared with traditional methods.

Since the release of ChatGPT, generative models based on LLMs have garnered significant attention in the field of artificial intelligence [9]. These models are widely used, replacing traditional chatbot services that provide results by learning from language data such as text.

LLMs are built on machine learning and deep learning technologies and require large datasets and substantial computational power for training. As foundational models for natural language processing (NLP) and natural language generation (NLG) tasks, LLMs are pre-trained on vast amounts of data to learn the complexity and interconnectedness of language. They were developed using prompt engineering techniques such as fine-tuning, in-context learning, and zero/one/few-shot learning. Because of the difficulty in fine-tuning all parameters of these models, methods such as P-tuning, prefix tuning, prompt tuning, and low rank adaptation (LoRA) are used to train only a subset of the parameters [10].

Prominent LLM models include OpenAI’s GPT series and Google’s BERT (Bidirectional Encoder Representations from Transformers). Recently, open-source LLMs such as Mistral AI and Llama have also been growing, offering high-quality performance with relatively fewer parameters.

Retrieval augmented generation (RAG) is a pattern that uses pre-trained LLMs and proprietary data to generate responses. LLMs, trained on publicly available large-scale data generally provide appropriate answers to queries on widely known knowledge. However, they often struggle to provide accurate responses to domain-specific or specialized queries and are prone to generating arbitrary answers, known as hallucinations. The RAG addresses this issue by searching for reference materials that can potentially contain correct answers and generates responses based on this information. In particular, the role of RAG in enhancing the accuracy of the responses in domain-specific applications [11].

C. Prompt Engineering

‘Prompt Engineering’ is a technique aimed at providing appropriate prompts for AI to achieve high-quality output. Becuase AI models generate responses based on the input (prompt) they receive, the performance of AI is significantly influenced by the construction of the prompt. Prompt engineering employs various techniques to maximize the potential of AI models, and methods based on few-shot learning have been used to develop models such as GPT. For an AI model that has been pre-trained on large datasets to extract information e ffectively, the provided prompts must be clear and specific, effectively conveying the required information. The main elements of prompt engineering are as follows [12]:

1. Clarity of Prompt: The prompt must be clear and specific. Ambiguous prompts can confuse the AI model and lead to inaccurate responses.

2. Reflection on User Requirements: The prompt should reflect user requirements. It is important to understand clearly what the user wants to incorporate into the prompt.

3.Understanding Model Characteristics and Limitations: It is essential to understand the characteristics and limitations of AI models and consider them when writing prompts. For instance, if the model lacks knowledge about a specific area, it is necessary to provide additional information or make the prompt more specific.

Prompt engineering is a critical component in harnessing the capabilities of AI models, ensuring that the prompts are well-structured to guide the model towards generating precise and relevant responses.

III. DATA SET AND PROPOSED ARCHITECTURE

Although these advancements in document classification have significantly improved accuracy using supervised learning and BERT, they are still constrained by the need for large labeled datasets and domain-specific fine-tuning. To address these limitations, we propose a more scalable approach utilizing LLMs and RAG, which are introduced in this section.

A. Data Set

To demonstrate the effectiveness of the proposed LLM + RAG system in handling complex classification problems, we utilized a dataset consisting of 481,578 papers, as described below. These datasets allow us to evaluate the system performance in scenarios with varying amounts of labeled data, which is a critical factor in overcoming the challenges faced by traditional models. The datasets used in this study are listed in Tables 1 and 2. We utilized only Korean and English abstract data from 481,578 full-text datasets of domestic papers in Korea provided by KISTI and conducted our study using classification data on research fields (Standard Classification Code for Science and Technology, Label) from 30,000 papers.

Table 1 . Article Dataset (481,578).

PropertyCorpus
Doc_idPaper id
Abstract_koAbstract (Korean)
Abstract_enAbstract (English)


Table 2 . Label Dataset (30,000).

PropertyCorpus
IdArticle id
Title_koPaper Title in Korean
Title_enPaper Title in English
Code1Standard Classification Code for Science and Technology
Code2Standard Classification Code for Science and Technology
Code3Standard Classification Code for Science and Technology


We compared paper IDs from the Korean full-text dataset with the IDs from the research field classification data, retrieved the abstract data, and used them accordingly. Among the 30,000 datasets, there were 6,047 instances in which all three classification codes were present, 8,915 instances with only two codes, and 11,291 instances with only one code, indicating fewer data points as the number of classification codes increased. For training purposes, 90% of the collected data were used, and a random 10% was extracted for testing and performance evaluation.

B. Proposed System Architecture for Document Classification Using RAG

Although supervised learning approaches have significantly advanced document classification, they have notable limitations, particularly their reliance on large amounts of labeled data. Collecting and labeling these datasets is often time-consuming, expensive, and, in some cases, impractical, especially for specialized or evolving domains. In Addition, the performance of traditional supervised methods is tightly coupled with the quality and quantity of labeled data, which can limit their applicability to real-world scenarios where labeled data are sparse. To address these limitations, we propose a document classification method that leverages LLMs in conjunction with RAG. LLMs, pre-trained on vast amounts of general text, can generalize across tasks with minimal fine-tuning, thereby reducing the dependence on extensively labeled datasets. By integrating RAG, our approach can further enhance the classification by retrieving the relevant document context from a vectorized DB, thereby enriching the model’s input with domain-specific information. This combination not only mitigates the need for large-scale labeled data but also enables more accurate and context-aware classification, especially in dynamic or under-resourced domains. Through this method, we aim to overcome the shortcomings of traditional supervised learning and provide a more scalable and efficient solution for document classification tasks.

First, the role of LLM is illustrated. Fig. 1 illustrates the structure of a document classification system that utilizes LLM and RAG. In the proposed system, the user inputs the document title and abstract as queries for document classification. The title and abstract provided by the user are used as inputs to the embedding model, which is then used to search for similar documents in the vector DB. In a vector DB, numerous previously classified documents are individually embedded using the embedding model, and the original information of these documents and their embedded vectors are stored for use. In RAG, documents corresponding to embedding vectors similar to the user query are retrieved and used as input to the LLM along with the user query. Based on this, the LLM provides a classification code for documents entered by users. The LLM in Fig. 1 was used to answer the questions and was implemented using the GPT-3.5 API.

Figure 1. Document classification system architecture using RAG

Fig. 2 illustrates the structure for storing document data in the vector DB introduced in [13]. The structure involves separating abstracts into tokens from source data, and these tokenized sentences are vectorized and stored in a vector DB, utilizing the FAISS vector DB for storage. In general, vector DB is optimized by indexing the data for speeding up search. This optimization is based on clustering of vectors, where quantization is performed based on individual clusters to assign indices which is illustrated in Fig. 2. Searches are then conducted around these indexes, offering faster performance compared to conventional similarity-based searches. For the reduction of the complexity, the LLM model, KoAlpaca, trained in Korean, was quantized into 4 bits using Llama.

Figure 2. The process of storing data in a Vector DB

C. Prompt Design

The first step in prompt design is to define the role of the LLM. According to the principle of clarity discussed in the previous section, it should be made clear that the LLM is to be used for automatic document classification. To clarify the classification method, the following examples are provided. The example in Fig. 3 presents a scientific document with title, abstract, all three classification codes as labels, and the LLM is prompted to find similar documents and provide their classification codes in response. The LLM recognizes its role as finding similarly labeled documents and providing classification codes in response to the user.

Figure 3. Prompt Design

The task of the proposed system is completed by providing a document corresponding to the input prompt, searching for similar documents based on similarities in the vector DB, and returning the classification codes of the retrieved documents.

However, it should be noted that the more accurately labeled documents are stored in the vector DB, the more the number of label. Therefore, for a comparative analysis with the BERT-based classification model, the number of label data used by BERT and the number of documents with classification codes stored in the vector DB were set to be equal.

IV. RESULTS and DISCUSSION

A. Experiment Results

In Tables 3 and 4, the document classification accuracy was evaluated using two models, KoSciDeBERTa and KoAlphaka + RAG, on datasets of different sizes (3k, 10 k, 30 k). The results summarize the accuracies of the first, second, and third classification codes for each model. Before introduction of LLM, various attempts were made to use BERT in studies related to document classification. In this study, we aim to compare existing BERT-based document classification models with the LLM. This model was trained for the Korean Science Document provided by the KISTI [14].

Table 3 . Document Classification Accuracy (%) of KoSciDeBERTa and KoAlphaka+RAG Models.

Label data1st Code2nd Code3rd Code
Data 3k KoSciDeBERTa48.812.047.3
Data 10k KoSciDeBERTa61.322.715.08
Data 30k KoSciDeBERTa66.6835.0619.05
Data 3k KoAlphaka+RAG71.632.212.7
Data 10k KoAlphaka+RAG81.333.726.27
Data 30k KoAlphaka+RAG84.440.2627.48


Table 4 . Performance Comparison for three classification code according to the size of label dataset, KoAlphaka+RAG.

Label data1st Code2nd Code3rd Code
Data 3k71.632.212.7
Data 10k81.333.726.27
Data 30k84.440.2627.48
Data30k + Pseudo10k97.748.230.6


In addition to the abstracts of papers as inputs for BERT, keywords extracted using Keyword BERT were also embedded and combined with the results of abstract embeddings for input into the BERT model for paper classification. The tokenizer used was developed by SKT T-Brain specifically for BERT models [15].

As shown in Tables 3 and 4, the KoAlphaka+RAG consistently outperformed the BERT-based model, particularly when the dataset size was small. This result confirms our hypothesis that the integration of RAG reduces the dependency on large labeled datasets, addressing one of the key limitations of previous approaches. Fig. 4 highlights the similarity distribution among documents based on the LLM's classification, further demonstrating the model's ability to assign accurate labels even in cases where traditional methods struggle. This qualitative analysis, combined with the quantitative improvements presented in Tables 3 and 4, reinforces the superiority of our approach.

Figure 4. Similarity distribution of documents for three types of classification codes

From a model perspective, KoAlphaka+RAG consistently showed a higher accuracy than KoSciDe-BERTa across all dataset sizes. In particular, for smaller dataset sizes (3k), KoAlphaka + RAG demonstrated a significantly higher performance with a first code accuracy of 71.6%, compared to KoSciDeBERTa’s 48.8%. The second and third code accuracies also consistently favored the KoAlphaka + RAG model. From the perspective of dataset size, both models showed increased accuracy as the dataset size increased, indicating that more data positively influenced model training. Notably, the KoAlphaka + RAG model exhibited pronounced performance improvements with increasing dataset size.

Generally, document classification labels can be somewhat inaccurate, especially for the second and third codes, although KoAlphaka + RAG has been shown to achieve high accuracy with relatively fewer label data than BERT, particularly for these codes. Fig. 4 illustrates the similarity distribution between documents and queries for the first, second, and third classification codes selected by the LLM. The results indicate that the LLM consistently selects the classification code with the highest similarity to the first code. For the second and third codes, the similarity distributions formed distinct clusters. Although some overlap exists because of the possibility of a single document belonging to multiple categories, LLM demonstrates appropriate classification code selection overall. The following example proves that model-made labels can be more accurate than the manually-made labels.

Examples for the proof evidence:

Example 1)

Query:

Title: Catalyst and Method for Producing High-Purity Glycerol Carbonate

Abstract:

The present invention relates to a method for producing high-purity glycerol carbonate by extracting and distilling low-purity glycerol carbonate, which is produced by reacting glycerol and urea, using alkylene glycol.

Label (Ground Truth):

EC01 (Chemical Processes), EF06 (Renewable Energy),

EA05 (Robotics/Automated Machinery)

LLM response:

EC01 (Chemical Processes), EC03 (Polymer Processing Technology), and EB02 (Ceramic Materials)

Given the title and abstract, the only appropriate label appears to be only EC01 (Chemical Processes). The other labels provided by both the ground truth and LLM response do not appear to be relevant based on the information provided. Therefore, the label should be revised to include only EC01 which was provided by the LLM.

Example 2)

Query:

Title: Marine Environments and Production of Laver Farm at Aphae-do Based on Water Quality and Phytoplankton Community

Abstract:

To understand the marine environmental characteristics of the seaweed cultivation area located in the southwestern waters of Aphaedo, Shinan-gun, three field surveys were conducted from October 2013 to April 2014, coinciding with the growth stages of the seaweed, specifically during the leaf development, middle leaf, and mature leaf stages. The Aphae seaweed farm is located in shallow coastal waters, where physical d isturb ances such a s wind a nd t idal m ixing cause significant resuspension of surface sediments. The resuspension of surface sediments not only supplies nutrients but also obstructs light penetration into the seawater because of the high total suspended solids, which in turn are believed to hinder the growth of phytoplankton competing with seaweed for nutrients during the winter seaweed growth period. (Omitted for brevity.)

Label (Ground Truth):

LB13 (Aquaculture), LB01 (Crop Science)

LLM response:

LB14 (Fisheries Resources/Fishery Environment), EH06 (Marine Environment), ND08 (Marine Science)

In this example, the LLM response was more comprehensive and contextually appropriate. They capture the essence of the study’s focus on the marine environment and scientific analysis, without including less relevant categories such as terrestrial crop science. The manual labels are detailed but include a less relevant category (LB01) and miss the broader marine environmental context captured by the LLM response. Therefore, for this specific document, the LLM-made labels were superior.

Fig. 4 highlights the similarity distribution among documents based on the classification performed by the LLM, further demonstrating the model's ability to assign accurate labels even in cases where traditional methods struggle. This qualitative analysis, combined with the quantitative improvements presented in Tables 3 and 4, reinforces the superiority of our approach.

B. Limitations and Further work

In this study, we verified that using RAG for scientific paper classification can achieve higher accuracy than traditional BERT-based supervised learning methods, especially when utilizing a small amount of labeled data. Experiments demonstrated that the proposed system structure outperformed existing document classification systems in all aspects. However, one limitation of this study is that it was restricted to scientific and technical papers. As previously mentioned, the domain-specific characteristics of RAG still exist in this system. Nevertheless, based on the experimental results, we anticipate that if diverse data from various fields are secured, the proposed simple structure can easily expand the scope of document classification.

The second limitation is the relatively lower accuracy of the second and third classification codes compared with the first classification code. This is because of the insufficient labeled data currently available. However, if obtaining accurate classification codes is difficult, moving away from supervised learning-based methodologies could be an alternative. As shown in the prompt, the proposed system still uses an approach similar to supervised learning, to approximate the given correct answers. For the second and third classification codes, an unsupervised learning approach that reveals new classification codes based on similarity can be employed instead of these supervised learning-based methods. Evaluating system’s performance indirectly by assessing user convenience could also be a viable approach.

V. CONCLUSIONS

This study proposed a system architecture utilizing LLM and RAG to overcome the limitations of existing BERTbased document classification methods in scientific document classification tasks, and compared the performance of the document classification system. The experimental results showed that the LLM with RAG achieved a higher accuracy across all dataset sizes, demonstrating superior performance. Particularly noteworthy was their high accuracy in the first classification code, indicating that LLM with RAG enhanced the semantic understanding of documents, enabling precise classification. As the dataset size increased, the model performance improved, confirming the positive impact of larger datasets on model training. The high accuracy of the first classification code underscores the importance and the capability of model in learning effectively.

This study demonstrated that LLMs combined with RAG provide a robust solution for document classification tasks, particularly in domains with limited labeled data. Future work will focus on extending this approach to other scientific fields and further optimizing prompt engineering to enhance the classification accuracy across more complex datasets.

ACKNOWLEDGEMENTS

This result was supported by “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2021RIS-003).

Fig 1.

Figure 1.Document classification system architecture using RAG
Journal of Information and Communication Convergence Engineering 2024; 22: 280-287https://doi.org/10.56977/jicce.2024.22.4.280

Fig 2.

Figure 2.The process of storing data in a Vector DB
Journal of Information and Communication Convergence Engineering 2024; 22: 280-287https://doi.org/10.56977/jicce.2024.22.4.280

Fig 3.

Figure 3.Prompt Design
Journal of Information and Communication Convergence Engineering 2024; 22: 280-287https://doi.org/10.56977/jicce.2024.22.4.280

Fig 4.

Figure 4.Similarity distribution of documents for three types of classification codes
Journal of Information and Communication Convergence Engineering 2024; 22: 280-287https://doi.org/10.56977/jicce.2024.22.4.280

Table 1 . Article Dataset (481,578).

PropertyCorpus
Doc_idPaper id
Abstract_koAbstract (Korean)
Abstract_enAbstract (English)

Table 2 . Label Dataset (30,000).

PropertyCorpus
IdArticle id
Title_koPaper Title in Korean
Title_enPaper Title in English
Code1Standard Classification Code for Science and Technology
Code2Standard Classification Code for Science and Technology
Code3Standard Classification Code for Science and Technology

Table 3 . Document Classification Accuracy (%) of KoSciDeBERTa and KoAlphaka+RAG Models.

Label data1st Code2nd Code3rd Code
Data 3k KoSciDeBERTa48.812.047.3
Data 10k KoSciDeBERTa61.322.715.08
Data 30k KoSciDeBERTa66.6835.0619.05
Data 3k KoAlphaka+RAG71.632.212.7
Data 10k KoAlphaka+RAG81.333.726.27
Data 30k KoAlphaka+RAG84.440.2627.48

Table 4 . Performance Comparison for three classification code according to the size of label dataset, KoAlphaka+RAG.

Label data1st Code2nd Code3rd Code
Data 3k71.632.212.7
Data 10k81.333.726.27
Data 30k84.440.2627.48
Data30k + Pseudo10k97.748.230.6

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of 31st Conference on Neural Information Processing Systems, Long Beach, USA, pp. 1-11, 2017.
  2. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, Oct. 2018. DOI: 10.48550/arXiv.1810.04805.
  3. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving Language Understanding by Generative Pre-Training, 2018. [Online]. Available: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
  4. J. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The Long-Document Transformer,” arXiv preprint arXiv: 2004.05150, Apr. 2020. DOI: 10.48550/arXiv.2004.05150.
  5. P. He, J. Gao, and W. Chen, “DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing,” arXiv preprint arXiv: 2111.09543, Nov. 2021. DOI: 10.48550/arXiv.2111.09543.
  6. Y. Adi, O. Keren, and B. Crammer, “DocBERT: BERT for Document Classification,” arXiv preprint arXiv:1904.08398, Apr. 2019. DOI: 10.48550/arXiv.1904.08398.
  7. N. Chalkidis, I. Androutsopoulos, and D. Gall, “Effectively Leveraging BERT for Legal Document Classification,” in Proceeding of the Natural Legal Language Processing Workshop 2021, Punta Cana, DO, 2021. DOI: 10.18653/v1/2021.nllp-1.22.
    CrossRef
  8. W. Yao, D. Ding, H. Huang, and Z. Yuan, “Scientific Paper Classification by Fusing BERT and GCN,” in Proceedig of the 2023 International Conference on Intelligen Education and Intelligen Research (IEIR), Wuhan, CN, 2023. DOI: 10.1109/IEIR59294.2023.
    CrossRef
  9. T. Brown, B. Mann, N.Ryder, M.Subbiah, J.Kaplan, P. Dhariwal, A.Neelakantan, P.Shyam, and G.Sastry, “Language Models are Few-Shot Learners,” in Proceeding of the 34th Conference on Neural Information Processing System(NeurIPS 2020), Vancouver, CA, 2020.
  10. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2016.09685, Jun. 2021. DOI: 10.48550/arXiv.2106.09685.
  11. C. Jeong, “Generative AI service implementation using LLM application architecture: based on RAG model and LangChain framework,” Journal of Intelligence and Information Systems, vol. 19, no. 4, Dec. 2023. DOI: 10.13088/jiis.2023.29.4.129.
  12. S. Bsharat, A. Myrzakhan, and Z. Shen, “Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4,” arXiv preprint arXiv:2312.16171, Dec. 2023. DOI: 10.48550/arXiv.2312.16171.
  13. E. Wallace, “How Vector Databases Can Enhance GenAI,” RTInsights, 2023. [Online]. Available: https://www.rtinsights.com/how-vector-databases-enhance-genai/.
  14. KISTI-AI, KorSciBERT [Internat], Available: https://github.com/KISTI-AI/KorSciBERT.
  15. SKT Brain, KoBERT. [Internet], Available: https://github.com/SKTBrain/KoBERT.
JICCE
Dec 31, 2024 Vol.22 No.4, pp. 267~343

Stats or Metrics

Share this article on

  • line

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255