Search 닫기

Regular paper

Split Viewer

Journal of information and communication convergence engineering 2024; 22(1): 33-43

Published online March 31, 2024

https://doi.org/10.56977/jicce.2024.22.1.33

© Korea Institute of Information and Communication Engineering

Construction of Text Summarization Corpus in Economics Domain and Baseline Models

Sawittree Jumpathong 1*, Akkharawoot Takhom 2*, Prachya Boonkwan 1, Vipas Sutantayawalee3, Peerachet Porkaew1,5,6, Sitthaa Phaholphinyo1, Charun Phrombut1, Khemarath Choke-mangmi4, Saran Yamasathien4, Nattachai Tretasayuth4, Kasidis Kanwatchara4, Atiwat Aiemleuk4, and Thepchai Supnithi1

1Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, 12120, Thailand
2Faculty of Engineering, Thammasat School of Engineering, Thammasat University, 12120, Thailand
3Promes Co., Ltd., Backyard Group, 10120, Thailand
4PTT Digital Solutions Company Limited, 10900, Thailand
5Institute of Computing Technology, Chinese Academy of Sciences, 100190, China
6University of Chinese Academy of Sciences, 101408, China

Correspondence to : Sawittree Jumpathong (E-mail: sawittree.jumpathong@nectec.or.th)
Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, 12120, Thailand
Akkharawoot Takhom (E-mail: takkhara@tu.ac.th)
Faculty of Engineering, Thammasat School of Engineering, Thammasat University, 12120, Thailand

Received: April 1, 2023; Revised: October 27, 2023; Accepted: November 4, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Automated text summarization (ATS) systems rely on language resources as datasets. However, creating these datasets is a complex and labor-intensive task requiring linguists to extensively annotate the data. Consequently, certain public datasets for ATS, particularly in languages such as Thai, are not as readily available as those for the more popular languages. The primary objective of the ATS approach is to condense large volumes of text into shorter summaries, thereby reducing the time required to extract information from extensive textual data. Owing to the challenges involved in preparing language resources, publicly accessible datasets for Thai ATS are relatively scarce compared to those for widely used languages. The goal is to produce concise summaries and accelerate the information extraction process using vast amounts of textual input. This study introduced ThEconSum, an ATS architecture specifically designed for Thai language, using economy-related data. An evaluation of this research revealed the significant remaining tasks and limitations of the Thai language.

Keywords Abstractive Text Summarization, Natural Language Processing, Transformer Model, Public Dataset, Economics Domain

Automatic text summarization (ATS) shortens lengthy text to produce a summary containing the most important information within the original content. ATS plays an important role in addressing the issue of information overload and can be characterized based on its operations: homogeneity (single-document vs. multi-sourced) and rewriting approaches (extractive vs. abstractive). State-of-the-art techniques for ATS include eigen-decomposition [1], neural methods [2], and large language models [3]. However, all of these techniques require a large amount of training data to learn to identify the gist of the original content and rewrite it in an understandable format. In less-privileged languages, resources for ATS are scarce.

Compared to popular languages, publicly accessible datasets for ATS in Thai language are relatively common. Thai orthography poses challenges for language preprocessing owing to its complex nature, as it lacks explicit markers for word and sentence boundaries, making annotation a challenging task for linguists. TR-TPBS [4], ThaiSum [4], Thai-CrossSum [5], and XL-Sum [6] are datasets created for Thai ATS based on different and unattested guidelines. ATS techniques aim to shorten a given text and extract relevant information from large amounts of textual data. For Thai, ATS can be performed either extractively [7] or abstractively [8]. Although publicly available, they are not designed to evaluate different rewriting methods, and there are no annotation guidelines for dataset construction.

The dataset must be prepared prior to developing the ATS, in which three steps are applied: source observation, data collection, and annotation. The goal of this study is to construct an ATS system for economic news articles in Thai language trained on public datasets. The dataset used in this study was carefully curated from the following reliable news sources: Bangkok Turakij (https://www.bangkokbiznews.com), Prachachat Turakij (https://www.prachachat.net), and Bangkok Today (https://bangkok-today.com).

This paper describes the construction of a Thai language resource called ThEconSum, which is a corpus of ATS for Thai economics news articles. This dataset presents several language difficulties in Thai summarization. This study makes three contributions. First, a standard ATS dataset was distributed free of charge, and guidelines for data construction were elaborated. Second, we developed basic models utilizing powerful language models such as BERT and mT5. Finally, this dataset serves as a valuable resource for condensing knowledge for large-scale ATS applications in the economic domain.

The remainder of this paper is organized as follows. In Section 2, we explain the remarkable linguistic behaviors of Thai related to ATS and public ATS datasets. In Section 3, we elaborate on the construction of our dataset and annotation guidelines. In Section 4, we benchmark the baselines for the Thai ATS using our dataset and report the results. Finally, we conclude the paper in Section 6.

A. Linguistic Behaviors of Thai

Thai belongs to the analytic language family and is characterized by its adherence to subject-verb-object (SVO) word order and head-initialism. Adjectives, adverbs, function words, and syntactic structures are used to express grammatical functions like numerals, tenses, and voices without using any inflection or declension system. Because more words are employed to communicate the same notions due to semantic minimalism, a Thai phrase is often longer than an identical translation in the European language. Furthermore, in Thai orthography, word- and sentence-boundary indicators are implicit and must be implied in context.

This poses several challenges for the ATS. First, data preparation must include word tokenization and sentence-boundary identification. Second, because Thai lacks an inflection system, compound words can arise without the use of inflections, making key phrases difficult to learn. Finally, Thai sentences tend to be long and contain a significant number of function words and complex syntactic structures spanning long ranges. These linguistic characteristics must be carefully considered and properly addressed when designing and developing ATS models for Thai language.

B. Datasets for ATS

Compared to widely spoken languages, public datasets for Thai ATS are relatively limited. This is primarily because of the challenges involved in preprocessing the Thai language. In Thai orthography, words are written consecutively, whereas sentences are written with space delimiters, which causes orthographical ambiguity and makes linguistic annotation a time-consuming task.

However, there are currently four datasets available that specifically cater to Thai ATS. The datasets used were as follows:

  • TR-TPBS [4] was compiled using information from ThaiRath and Thai PBS news websites. More than 333,000 articles and summary pairings produced by journalists are included in this collection.

  • ThaiSum [4] is an extensive dataset designed specifically for Thai ATS. It was sourced from reliable sources, such as The Standard, Prachatai, Thai PBS, and Thai Rath. This dataset comprises a collection of more than 350,000 article-summary pairs that were meticulously compiled by journalists. On average, the article texts spanned 530 words, whereas the corresponding summaries condensed the information into a concise form of approximately 37 words. ThaiSum is a valuable resource for training and evaluating Thai summarization models.

  • ThaiCrossSum Corpora [4] is an extensive dataset specifically designed for the cross-lingual summarization of Thai literature. This dataset consists of 310,926 TH2EN (Thai-to-English) articles and 310,926 TH2ZH (Thai-to-Chinese) articles.

  • XL-Sum [6] stands as the largest dataset accessible for multilingual ATS. It consists of a collection of 1.35 million article-summary pairs spanning 44 different languages.

For Thai, XL-Sum contains 8,268 article-summary pairs. Table 1 lists the statistics of the aforementioned datasets. ThaiSum is the largest dataset for Thai news ATS, with approximately 530 words per article and 37 words per summary. The article parts of XL-Sum and our dataset were longer (967 and 766 words, respectively). Although the summary parts of ThaiSum and XL-Sum are comparable, our dataset is substantially larger. Note that the TR-TBPS and Thai-CrossSum datasets were not included in the table because TR-TBPS is already included in ThaiSum, and the issue statement part of Thai CrossSum is incompatible with our dataset.

Table 1 . Comparison of data statistics

DatasetSizeAvg. Article length (words)Avg. Summary length (words)Annotator
ThaiSum350,00053037Journalist
XL-Sum*8,26896753Journalist
This work3,189766334Linguist

* (only Thai)



C. Related Works on ATS

ATS methods can be classified into two main categories: extractive and abstractive. In an extractive approach, a summary is generated by identifying and selecting the most relevant and significant sentences from the original input text.

There are two approaches to ATS. The extractive approach identifies the most relevant sentences in the original input text. These sentences were then concatenated to form a summary without any further generation or rearrangement. The primary objective is to extract essential information from the input text and present it in a condensed form without modifying the original structure or content of the sentences. On the other hand, the abstractive method is based on communicating the most significant information from the input text and constructing the summary using stylistics and rephrasing.

The ATS has been widely studied and applied in various fields. ATS is becoming increasingly crucial because of the massive volume of textual materials [9]. ANN-based language models are employed to produce a summary by wordby-word prediction given an already generated context [10].

This study focuses on an abstractive approach to single-document summarization (SDS). A summary is produced from a single text document while preserving the relevant information in the original text. Natural language processing (NLP) approaches are required for an abstractive ATS approach to capture the important gists of the original text. It then rewrites them in a shorter and more understandable text [12,13].

D. Thai Text Summarization

Thai ATS has recently gained attention in the NLP community. Most of these studies used an extraction method. Based on this extraction method, Ketui and Theeramunkong [7] proposed a comprehensive strategy for summarizing multiple documents using an extractive approach. This strategy comprises three main components. The first step involves unit segmentation, which is a technique used to preprocess data by performing tasks such as breaking it down into meaningful units through word segmentation and labeling parts of speech (POS). The second step is the unit-graph formulation, which encompasses the creation of a graph representation where each unit is represented as a node with a specific weight and the relationships between these nodes are depicted as weighted links. This graph structure helps capture the interconnections and importance of the units in the text. The third step focuses on unit selection for summarization. This entails identifying the nodes and linkages that hold significant relevance in the summarization task, thereby allowing the extraction of crucial information from the original documents. By following this approach, valuable insights can be obtained by summarizing Thai texts efficiently and effectively. Nuanplord and Sodanil [14] suggested an ontology-based strategy for summarizing Thai health news. Yongkiatpanich and Wichadakul [15] suggested an extractive text-summarizing strategy that integrated ontology and graphbased methodologies. A graph representation technique that integrated the Unified Medical Language System (UMLS), a comprehensive knowledge ontology obtained from the National Library of Medicine (NLM), was employed. This graphbased approach is combined with a method called Word Mover’s Distance, which is a distance function commonly used to quantify the similarity between text documents. By leveraging UMLS and Word Mover’s Distance, the researchers aim to enhance the effectiveness of their summarization method and improve the accuracy of extracting important information from medical texts. To select the main sentences, we retrieved summaries using Google PageRank, which is a popular graph-based technique.

In the abstract method, Chaowalit and Sornil [16] proposed an ATS system for customer reviews. They evaluated their method using genuine reviews of 50 goods chosen randomly from a well-known cosmetic website. Jumpathong et al. [8] developed a deep learning-based performance study on Thai news abstractive summarization. They examined the order in which the words in the original manuscript were used to construct a summary. They also found that the text length affected the model accuracy.

The construction of ThEconSum consists of three steps: data collection, building summary guidelines, and summarizing. (1) Data Curation from News Sources: we gathered news articles from reliable sources, such as Bangkok Turakij, Prachachat Turakij, Bangkok Today, and others, ensuring their reliability. (2) Annotation Guideline Construction: We created guidelines for systematically and consistently constructing a summary for each news article. The output summary is created using extractive and abstractive approaches. The extractive method involves selecting the most salient text chunks from the original content without any alterations or modifications. In the abstractive method, the relevant information in the original text is recreated as a shorter summary using paraphrasing and stylistics. (3) Manual Summarization: Four Thai language specialists were hired to create a text summary corpus. We provided them with a summarizing guideline and an online summarizing tool, which is an online corpus editor that works anywhere. They then built a summary based on the summary guidelines.

Fig. 1. Process of corpus construction

A. Data Collection

Considering the following features, we curated news articles for summary: Articles related to economics were also collected. We gathered these articles exclusively from reliable sources, such as Bangkok Turakij, Prachachat Turakij, and Bangkok Today. To narrow the keyword search, we selected economics articles related to PTT Digital Solutions Company Limited because they are involved in many economic sectors and some of the authors also have business insights. Furthermore, the selected news articles varied in length from 400 to 1000 words per document. Finally, we selected economic news articles posted between June 2021 and June 2022.

After curating the data, they were imported into a corpus annotation system. The dataset was preprocessed to remove embedded HTML tags and JavaScript code, and to determine paragraph boundaries. Paragraphs play an important role in creating summaries, as explained in the following section.

B. Summarizing Guideline

We hired four Thai language specialists (bachelor’s graduates in Thai studies) to conduct the online summarization work. Universal to all languages, the guidelines for constructing a summary for each news article are summarized in the following five steps:

Step 1 (Analyze): The article was analyzed in four consecutive parts: title, leading paragraph (most newsworthy information), organizing bridge (important details), and supporting details. These four parts were inspired by the theory of news structure (a. k. a.) the Inverted Pyramid of Journalism.

Step 2 (Divide): The supporting details are divided into topics with respect to the leading paragraph. If additional topics that were not mentioned in the lead paragraph were identified, they were marked as such.

Step 3 (Rank): Rank each sentence in the leading paragraph based on its relevance to the title. Each topic is then ranked according to its relevance to the leading paragraph. Additional topics are assigned the lowest rank. This step helps to identify the content hierarchy of the article.

Step 4 (Extract): Given the preferred range of word count, select the most relevant sentences from the leading paragraph. If the word count allowed, additional supporting details relevant to the previously extracted sentences were selected. This step provides an extractive summary of the study.

Step 5 (Rewrite): Combine relevant sentences that were previously extracted from the article using syntactic processes, such as the formation of compound and complex sentences and sentence paraphrasing. This section provides an abstractive summary of the article.

Our summary guidelines have a shallow learning curve for language specialists. In addition to learning to use the corpus annotation system, they were able to master these guidelines for several days and provide high-quality news summaries using both extractive and abstractive rewriting methods.

In our settings, we set the threshold of word counts to be in the range of 35-45% of the original content. This range was observed in a preliminary experiment in which summaries of different lengths were crafted from the same articles, and this range was voted on to yield the highest-quality summaries.

For ease of use, our corpus annotation system was equipped with specialized facilities for summary crafting. The left side of the display shows the original content, whereas the right side shows extractive and abstractive summaries in the two divided panes. Each pane has a word counter that flashes green when the summary is in the preferred range of word counts (35-45%).

C. Annotators

For the construction of a text summarization corpus, we hired four Thai language specialists to build a text summarization corpus. We have provided summary guidelines and an online summary and checking system. They then built a summary based on the summarization guidelines.

Fig. 2. Corpus annotation tool

D. Summarizing Software and Output Format

For corpus construction, we offered an online tool called a news-summarizing and checking system. This system serves as an online editor for the corpus, and allows users to access and work at any location. This tool was designed to assist linguists in constructing the text summarization corpus. It shows the essential details for summarizing, that is, the original text and the word frequency in the original text.

E. Data Statistics

The ThEconSum dataset contains 3,189 documents with both extractive and abstractive summaries. The average word count of the documents was approximately 766. The average word counts of the extractive and abstractive summaries were 340 and 334, respectively. When comparing the summary with the original content, it was found to be approximately 44%.

F. Exploratory Data Analysis

Fig. 3 shows the word clouds of the most frequent words in the dataset. The most frequent words include “million baht,” “public company limited,” “expected,” and “in year.” This signifies the predominant emphasis of economic news on a company’s metrics and future predictions.

Fig. 3. Word cloud of the dataset

Fig. 4 illustrates the length distribution of the content and summaries, both of which exhibit symmetrical distributions around their respective means. The content length spanned 400-1200 words, with a mean and median of approximately 770 words. By contrast, the summaries were between 200 and 500 words, with a mean of 340 words. Additionally, based on the mean and median values, extractive summaries were slightly shorter than abstractive summaries.

Fig. 4. Skewed distribution of text lengths in the dataset

The ATS comprises three steps. Preprocessing: The text is processed using stop-word removal, NER, sentence segmentation, and word tokenization. Here, named entities are usually replaced by special tags to improve system accuracy. Second, summarization: the structured text is analyzed, identified with salient information, and converted into a summary. Finally, post-processing addresses language-specific challenges such as anaphora resolution and NE replacement, and forms a final textual summary.

A. Baseline Models

The dataset is divided into two sets of experiments. In both experiments, the text was not tokenized into sentences. We believe that the accuracy reflects the consistency of the summarization, as suggested by the guidelines.

In the first experiment, we trained a transformer model from scratch using our dataset and evaluated its accuracy. The default transformer configuration is used. It consists of six encoding layers and six decoding layers, and the sinusoidal position embedding with the maximum length allowed for each document is 2048 subword tokens. Word embedding had 512 dimensions. We used the parameter initialization method of Huang et al. (2020). The embedding layers for both the encoder and decoder were shared. The model consists of 120 million parameters.

Fig. 5. Architecture of an abstractive text summarization

In the second experiment, we used pre-trained language models for Thai and trained the decoder using our dataset. We compared two pre-trained LLMs, mT5 and HoogBERTa, as described below.

  • mT5-small: mT5 is a cross-lingual pre-trained model for multiple tasks, including machine translation and ATS. It was pre-trained with a data collection of 101 languages from Common Crawl. Due to time and resource constraints, we chose mT5-small, a smaller version with 300 million parameters, as the baseline model.

  • The Pre-trained LLM + Transformer: In Fig. 6, we utilize word embeddings from HoogBERTa, a Thai RoBERTabased pre-trained model. To handle documents that exceeded a maximum length of 512 words, we divided them into 500-subword pieces. For larger documents, we extracted the features of each chunk independently. None of the pieces overlap. The final document-level representation is formed by combining the chunk-level information. In the model architecture, the sinusoidal position encoding is, and there are six encoding and six decoding layers. To accelerate convergence, we freeze the parameters of HoogBERTa and train only the decoder. To ensure compatibility with the pre-trained embedding size, the word embedding of the encoder was set to 768 dimensions. A total of 120 million parameters were used in this model.

    Fig. 6. The encoder of the model, ‘Pre-trained LM + Transformer’

B. Experimental Setup

The ThEconSum dataset used in these experiments was published in the AI for THAI [18]. The dataset consists of 3,189 entries, where each document is attached to extractive and abstractive summaries. We followed the 80-10-10 traintest-validation split practice. To improve the meaningfulness of the subwords, we first tokenized the articles and summaries into words using the Longan library available on AI for THAI [18]. The average word counts of the documents and summaries were 766 and 340, respectively.

The AdamW optimizer [19] was used to train the decoders of each model. We set the learning rate to 1e-3 and iterated the training for 60 epochs with a batch size of two. Validation was performed every 40 iterations. When generating the text, we set the specific hyperparameters as follows:

  • Non-repetition n-gram size = 2

  • Width of beam search = 2

  • Repetitive penalty = 1.2

  • Length penalty = 0.6

C. Evaluation Metrics

Automatic Evaluation: Following the literature, the standard metric for evaluating ATS is the Recall-Oriented Understudy for Gisting Evaluation score (ROUGE) [20]. ROUGE-n compares the n-gram occurrences in a given predicted summary with those in the reference summaries. Popular metrics include ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence). These metrics were also used in the experiments.

Manual Evaluation: Thai language specialists also conducted a manual evaluation for a qualitative assessment of the produced summaries. We take into account The following criteria were considered:

  • Readability: When evaluating readability, it is important to ensure that the summary maintains a coherent rhetorical structure without any gaps or dangling anaphoras.

  • Grammaticality: The produced sentence should be grammatical.

  • Structure and Coherence: The summary should have relevant, cohesive phrases with a clear organization of the structure.

  • Content coverage: Every issue discussed in the original documents should be included in the summary, together with essential details.

  • Non-redundancy: Redundant text and duplications should not be included in the summary.

Each assessment criterion is scored on a scale of 1 (very poor) to 10 (exceptional).

A. Results of Automatic Evaluation

Table 2 presents a comparison of the ROUGE scores of the ATS-TH-01 (mT5-small), ATS-TH-02 (vanilla transformer), and ATS-TH-03 (Transformer + HoogBERTa) models. ATSTH-03 significantly outperformed the other models. When we replaced the mT5 model with a vanilla transformer, we noticed an improvement in the ROGUE-1 accuracy. This suggests that the basic transformer is more efficient at learning keywords than mT5. Furthermore, the accuracy increased when we replaced the pre-trained encoder with HoogBERTa. We observed a similar trend for the ROUGE-2 and ROUGEL metrics. This is because HoogBERTa is better trained in the Thai language.

Table 2 . Results on the economic news dataset

Model CodeAbstractive ModelR-1R-2R-L
ATS-TH-01mT5-small0.30900.17240.2952
ATS-TH-02Transformer0.59750.23260.4813
ATS-TH-03Transformer + HoogBERTa0.62280.38890.5623


B. Automatic Evaluation of Summaries

The results in Table 3 indicate that the ATS-TH-03 model (Transformer + HoogBERTa) outperformed the other models in terms of the qualitative results. Compared to the standard transformer, it provides more understandable and grammatical summaries based on the pre-trained model. ATS-TH-03 (Transformer + HoogBERTa) performs better than ATS-TH-01 (mT5-small). This implies that the multilingual models are likely to underperform when sufficient training data are available.

Table 3 . Assessment of the quality of the system-generated summaries

CriteriaATS-TH-01ATS-TH-02ATS-TH-03
AverageAverage S.D.S.D.AverageS.D.
1Readability4.41.5062.61.8386.12.183
2Grammaticality5.81.1353.72.0036.61.647
3Structure and Coherence4.51.3543.02.1606.12.132
4Content coverage4.11 .7292.71.7675.62.271
5Non-redundancy6.91.1012.81.9324.82.658


As suggested by the structural coherence and content coverage criteria, ATS-TH-03 (Transformer + HoogBERTa) outperformed the other models. When analyzing the produced summaries, we discovered that they preserved more meaningful relevant keywords and keyphrases. This improvement can be attributed to the multihead attention mechanism and syntactic information embedded in the pre-trained model.

However, for the non-redundancy criterion, ATS-TH-01 (mT5-small) outperformed the other criteria. We assumed that this was related to our hyperparameter settings (repetition penalty = 1.2; length penalty = 0.6). These specific hyperparameters discourage the excessive use of repetitive terms. However, despite achieving a lower non-redundancy score, the resulting outputs did not exhibit human-readable characteristics, as illustrated in subsequent sections.

A. Linguistic Analysis

ATS-TH-01 (mT5-small) was unable to capture the main points of news articles. While the generated summaries may be grammatically correct and related to keywords, they tend to be disorganized and lack coherence. Compared to the other models, ATS-TH-01 produced fewer repetitions and duplications.

ATS-TH-02 (vanilla Transformer) exhibits the lowest readability. This model randomly selects text fragments from a source document and combines them into an output, resulting in an incoherent summary. Relevant keywords and keyphrases were rarely generated, whereas repetitions were frequently found.

The ATS-TH-03 (Transformer + HoogBERTa) demonstrated superior performance. It exhibits the highest level of readability and effectively covers relevant keywords. Additionally, there were significantly fewer repetitions and duplicates.

Furthermore, we investigated the correlation between document length and the ROUGE score of the ATS-TH-03 model output. We categorized the test data into three ranges based on their lengths and evaluated their individual ROUGE scores (Table 4). The summaries of news articles between 801 and 1000 words had the highest ROUGE-1, ROUGE-2, and ROUGE-L.

Table 4 . Effects of document lengths and ROUGE Scores of the ATS-TH-03 Model

Document Length (word)Number of DocumentROUGE Score
R-1R-2R-L
400-6001030.61020.38800.5639
601-8001250.61710.37260.5462
801-1000920.64440.41220.5823


B. Implications

Field generalization: Because the model was trained in a specific domain, it has the potential to be utilized for ATS in the economic arena. Our experimental results demonstrate that a general-domain pre-trained model can be fine-tuned with a relatively small domain-specific dataset, and it manifests promising quantitative and qualitative results.

Limitations: First, the stylistics of the training data directly affect the accuracy of the ATS. Because the model was trained on news articles written in formal Thai, it recognized the structure of the written language. When applied to social media data, such as Facebook and Twitter, the accuracy may decrease significantly because of informal spoken language. Second, the maximum input and output lengths significantly affect the accuracy of the ATS. For all the models, we set the maximum input length to 1,024 tokens and the maximum output length to 512 tokens. Varying these numbers may result in shorter or longer summaries; however, this does not guarantee an increase in quality gain.

This paper describes the development of a language resource for ATS, called ThEconSum, in which we address several linguistic challenges in Thai. This study makes three major contributions to the literature. First, we present a standardized dataset for the Thai ATS, accompanied by our detailed guidelines. Second, we introduce three baseline models based on language models. Finally, the dataset can be effectively employed for knowledge distillation, especially to train large-scale ATS models specifically designed for economics. For instance, a dataset can be utilized to create a domain-specific ATS system using knowledge distillation, in which a teacher-student model is employed.

Our future work will be as follows: First, we investigate the subjectivity of summarizing jobs and design a precise data-building rule. Our next step involves incorporating semantic roles similar to the 5W1H questions (who, what, when, where, why, and how) into cross-domain ATS models. This inclusion aims to enhance the accuracy and coverage of the models. Finally, we aim to select the core essence of the papers by leveraging the weak signals identified through a graph-based semantic interpretation.

This study was conducted with collaborative support from Promes Co., Ltd. (Backyard Group) as part of the Joint Research Project funded by the National Science and Technology Development Agency (NSTDA), Thailand. Additionally, we extend our gratitude to PTT Digital Solutions Co. Ltd., Thailand for generously providing public datasets for Thai ATS. The Program Management Unit for National Competitiveness Enhancement, under the Office of the National Higher Education Science Research and Innovation Policy Council in Thailand, provided financial support for data collection and construction.

  1. S. Deo and D. Banik, “Text summarization using textrank and lexrank through latent semantic analysis,” in in Proceeding of the International Conference on Information Technology (OCIT) 2022, Odisha, India, pp. 113-118, 2022. DOI: 10.1109/OCIT56763.2022.00031.
    CrossRef
  2. K. Kaikhah, “Automatic text summarization with neural networks,” in in Proceeding of the 2nd International IEEE Conference on 'Intelligent Systems', Varna, Bulgaria, pp. 40-44, 2004. DOI: 10.1109/IS.2004.1344614.
    CrossRef
  3. Z. Yang, Y. Dong, J. Deng, B. Sha, and T. Xu, “Research on automatic news text summarization technology based on GPT2 model,” in in Proceeding of the 3rd International Conference on Artificial Intelligence and Advanced Manufacture, Manchester, United Kingdom, pp. 418-423, 2021. DOI: 10.1145/3495018.3495091.
    CrossRef
  4. N. Chumpolsathien, “Using knowledge distillation from keyword extraction to improve the informativeness of neural cross-lingual summarization,” Masters thesis, Beijing Institute of Technology, 2020.
  5. J. Zhu, Q. Wang, Y. Wang, Y. Zhou, J. Zhang, S. Wang, and C. Zong, “NCLS: Neural cross-lingual summarization,” arXiv preprint arXiv:1909.00156, Aug. 2019. DOI: 10.48550/arXiv.1909.00156.
    Pubmed KoreaMed CrossRef
  6. T. Hasan, et al, “XL-sum: Large-scale multilingual abstractive summarization for 44 languages,” arXiv preprint arXiv:2106.13822, Jun. 2021. DOI: 10.48550/arXiv.2106.13822.
    CrossRef
  7. N. Ketui, T. Theeramunkong, and C. Onsuwan, “An EDU-based approach for Thai multi-document summarization and its application,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 1, pp. 1-26, Jan. 2015. DOI: 10.1145/2641567.
    CrossRef
  8. S. Jumpathong, T. Theeramunkong, T. Supnithi, and M. Okumura, “A performance analysis of deep-learning-based thai news abstractive summarization: Word positions and document length,” in in Proceeding of the 7th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, pp. 279-284, 2022. DOI: 10.1109/ICBIR54589.2022.9786413.
    CrossRef
  9. W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, “Automatic text summarization: A comprehensive survey,” Expert systems with applications, vol. 165, p. 113679, Mar. 2021. DOI: 10.1016/j.eswa.2020.113679.
    CrossRef
  10. P. Wang, B. Xu, J. Xu, G. Tian, C. L. Liu, and H. Hao, “Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification,” Neurocomputing, vol. 174, pp. 806-814, 2016. DOI: 10.1016/j.neucom.2015.09.096.
    CrossRef
  11. M. Joshi, H. Wang, and S. McClean, “Dense semantic graph and its application in single document summarisation,” in Emerging Ideas on Information Filtering and Retrieval, Springer, pp. 55-67, Oct. 2017. DOI: 10.1007/978-3-319-68392-8_4.
    CrossRef
  12. R. Z. Al-Abdallah and A. T. Al-Taani, “Arabic single-document text summarization using particle swarm optimization algorithm,” Procedia Computer Science, vol. 117, pp. 30-37, 2017. DOI: 10.1016/j.procs.2017.10.091.
    CrossRef
  13. K. Krishnakumari and E. Sivasankar, “Scalable aspect-based summarization in the hadoop environment,” in Big Data Analytics, Springer, pp. 439-449, Oct. 2017. DOI: 10.1007/978-981-10-6620-7_42.
    CrossRef
  14. P. Nuanplord and M. Sodanil, “Health news summarization using semantic ontology,” in Proceeding of the 3rd International Conference on Next Generation Computing, Chiang Mai, Thailand, 2017.
  15. C. Yongkiatpanich and D. Wichadakul, “Extractive text summarization using ontology and graph-based method,” in Proceeding of the 4th International Conference on Computer and Communication Systems (ICCCS), pp. 105-110, 2019. DOI: 10.1109/CCOMS.2019.8821755.
    CrossRef
  16. O. Chaowalit and O. Sornil, “Abstractive thai opinion summarization,” Advanced Materials Research, vol. 971-973, pp. 2273-2280, 2014. DOI: 10.4028/www.scientific.net/AMR.971-973.2273.
    CrossRef
  17. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483-498, Jun. 2021. DOI: 10.18653/v1/2021.naacl-main.41.
    Pubmed KoreaMed CrossRef
  18. “Economic news dataset,” Thailand's national electronics and computer technology center, Thailand, [Internet], Available: https://aiforthai.in.th/.
  19. I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” in 6th International Conference on Learning Representations, pp. 1-14, 2018. [Internet], Available: https://openreview.net/pdf?id=rk6qdGgCZ.
  20. C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proceedings of the workshop on text summarization branches out (WAS 2004), no. 1, pp. 25-26, 2004. [Internet], Available: papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-2AD316DAEF85.
  21. S. Jumpathong, A. Takhom, P. Boonkwan, V. Sutantayawalee, P. Porkaew, S. Phaholphinyo, C. Phrombut, T. Supnithi, K. Choke-Mangmi, S. Yamasathien, N. Tretasayuth, K. Kanwatchara, and A. Aiemleuk, “ThEconSum: an Economics-domained Dataset for Thai Text Summarization and Baseline Models,” in in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, pp. 1-6, 2022. DOI: 10.1109/iSAI-NLP56921.2022.9960271.
    CrossRef

Sawittree Jumpathong

She received the B.Sc. degree in Computer Science from Naresuan University in 2008 and M.Eng. degree in Information and Communication Technology for Embedded Systems from SIIT, Thammasat University, in 2022. Since 2008, she has been working with the Language and Semantic Technology Lab at NECTEC in Thailand.


Akkharawoot Takhom

He earned his B.S. degree in Management of Information Technology from Mae Fah Luang University in 2009 and his M.Eng. degree in Information and Communication Technology for Embedded Systems from SIIT in 2013. Subsequently, he completed his Ph.D. in Knowledge Science from JAIST in 2018, followed by another Ph.D. in Engineering and Technology from SIIT in 2019. In 2021, he worked as a researcher at the Language and Semantic Technology Lab at NECTEC. Since 2023, he has been serving as a lecturer in the Faculty of Engineering at the Thammasat School of Engineering, Thammasat University, Thailand.


Prachya Boonkwan

He received B.Eng. and M.Eng. degrees in Computer Engineering from Kasetsart University in 2002 and 2005, respectively. He received a Ph.D. degree in Informatics from the University of Edinburgh, UK, in 2014. Since 2005, he has been with Language and Semantic Technology Lab at NECTEC in Thailand.


Vipas Sutantayawalee

He is an experienced Research Assistant with a demonstrated history of working in the research industry and is skilled in Python, Natural Language Generation, Computer Science, and Natural Language Processing. HE is a strong research professional with a Master's degree focused in Natural Language Processing from University of Southern California in 2012. Before that, he received a Bachelor's degree in Computer Engineering from King Mongkut's Institute of Technology, Ladkrabang, from 2007.


Peerchet Porkaew

He received a B.Eng. in Computer Engineering from Chiang Mai University in 2007 and an M.S. in Computer Science from the University of Chinese Academy of Sciences in 2018. Since 2008, he has been an active member of NECTEC's Language and Semantic Technology Lab. He is now pursuing a PhD at the Institute of Computing Technology, University of Chinese Academy of Sciences.


Sitthaa Phaholphinyo

He received his B.A. in Linguistics from Thammasat University in 1997 and M.A. in Language and Communication from National Institute of Development Administration in 2007. Since 2000, he has been working with the Language and Semantic Technology Lab at NECTEC in Thailand.


Charun Phrombut

He received the B.Eng. degree in Computer Engineering from Suranaree University of Technology in 2011. Since 2019, he has been working with the Language and Semantic Technology Lab at NECTEC in Thailand.


Khemarath Choke-mangmi

He received a B.S. degree in Physics from Chulalongkorn University and an MBA degree from National Institute of Development Administration. He has served PTT Digital Solutions as The Head of Digital Innovation and Technology Center and has been an Advisory Sub-committee of NECTEC for two consecutive terms as well as the Vice President of Thai IoT Association in Research & Innovation. Recently, he has been a member of King Mongkut Institute of Technology, Ladkrabang Advancement Advisory Board.


Saran Yamasathien

He received a B.B.A. degree in Business Information Systems from Assumption University in 2009 and an M.Sc. degree in Software Engineering from Chulalongkorn University in 2014. Since 2015, he has been working with the Digital Innovation and Technology Center at PTT Digital Solution in Bangkok, Thailand.


Nattachai Tretasayuth

He earned both a Bachelor of Engineering and Master of Engineering degree in Computer Engineering from Chulalongkorn University in 2013 and 2018, respectively. Since 2018, he has been employed at PTT Digital Solution's Digital Innovation and Technology Center in Thailand


Kasidis Kanwatchara

He received his B.Eng. and M.Eng. degrees in Computer Engineering from Chulalongkorn University in 2019 and 2022, respectively. Since 2019, he has been working with Digital Innovation and Technology Center at PTT Digital Solution in Thailand.


Atiwat Aiemluek

He received a Bachelor’s degree of Science in Information Technology from Burapha University in 2017. Since 2017, he has been working with Digital Innovation and Technology Center at PTT Digital Solution in Thailand.


Thepchai Supnithi

He received a B.S. degree in Mathematics from Chulalongkorn University in 1992. He received M.S. and Ph.D. degrees in Engineering from Osaka University, Japan, in 1997 and 2001, respectively. Since 2001, he has been working with Artificial Intelligence Research Group at NECTEC in Thailand.


Article

Regular paper

Journal of information and communication convergence engineering 2024; 22(1): 33-43

Published online March 31, 2024 https://doi.org/10.56977/jicce.2024.22.1.33

Copyright © Korea Institute of Information and Communication Engineering.

Construction of Text Summarization Corpus in Economics Domain and Baseline Models

Sawittree Jumpathong 1*, Akkharawoot Takhom 2*, Prachya Boonkwan 1, Vipas Sutantayawalee3, Peerachet Porkaew1,5,6, Sitthaa Phaholphinyo1, Charun Phrombut1, Khemarath Choke-mangmi4, Saran Yamasathien4, Nattachai Tretasayuth4, Kasidis Kanwatchara4, Atiwat Aiemleuk4, and Thepchai Supnithi1

1Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, 12120, Thailand
2Faculty of Engineering, Thammasat School of Engineering, Thammasat University, 12120, Thailand
3Promes Co., Ltd., Backyard Group, 10120, Thailand
4PTT Digital Solutions Company Limited, 10900, Thailand
5Institute of Computing Technology, Chinese Academy of Sciences, 100190, China
6University of Chinese Academy of Sciences, 101408, China

Correspondence to:Sawittree Jumpathong (E-mail: sawittree.jumpathong@nectec.or.th)
Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center, 12120, Thailand
Akkharawoot Takhom (E-mail: takkhara@tu.ac.th)
Faculty of Engineering, Thammasat School of Engineering, Thammasat University, 12120, Thailand

Received: April 1, 2023; Revised: October 27, 2023; Accepted: November 4, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Automated text summarization (ATS) systems rely on language resources as datasets. However, creating these datasets is a complex and labor-intensive task requiring linguists to extensively annotate the data. Consequently, certain public datasets for ATS, particularly in languages such as Thai, are not as readily available as those for the more popular languages. The primary objective of the ATS approach is to condense large volumes of text into shorter summaries, thereby reducing the time required to extract information from extensive textual data. Owing to the challenges involved in preparing language resources, publicly accessible datasets for Thai ATS are relatively scarce compared to those for widely used languages. The goal is to produce concise summaries and accelerate the information extraction process using vast amounts of textual input. This study introduced ThEconSum, an ATS architecture specifically designed for Thai language, using economy-related data. An evaluation of this research revealed the significant remaining tasks and limitations of the Thai language.

Keywords: Abstractive Text Summarization, Natural Language Processing, Transformer Model, Public Dataset, Economics Domain

I. INTRODUCTION

Automatic text summarization (ATS) shortens lengthy text to produce a summary containing the most important information within the original content. ATS plays an important role in addressing the issue of information overload and can be characterized based on its operations: homogeneity (single-document vs. multi-sourced) and rewriting approaches (extractive vs. abstractive). State-of-the-art techniques for ATS include eigen-decomposition [1], neural methods [2], and large language models [3]. However, all of these techniques require a large amount of training data to learn to identify the gist of the original content and rewrite it in an understandable format. In less-privileged languages, resources for ATS are scarce.

Compared to popular languages, publicly accessible datasets for ATS in Thai language are relatively common. Thai orthography poses challenges for language preprocessing owing to its complex nature, as it lacks explicit markers for word and sentence boundaries, making annotation a challenging task for linguists. TR-TPBS [4], ThaiSum [4], Thai-CrossSum [5], and XL-Sum [6] are datasets created for Thai ATS based on different and unattested guidelines. ATS techniques aim to shorten a given text and extract relevant information from large amounts of textual data. For Thai, ATS can be performed either extractively [7] or abstractively [8]. Although publicly available, they are not designed to evaluate different rewriting methods, and there are no annotation guidelines for dataset construction.

The dataset must be prepared prior to developing the ATS, in which three steps are applied: source observation, data collection, and annotation. The goal of this study is to construct an ATS system for economic news articles in Thai language trained on public datasets. The dataset used in this study was carefully curated from the following reliable news sources: Bangkok Turakij (https://www.bangkokbiznews.com), Prachachat Turakij (https://www.prachachat.net), and Bangkok Today (https://bangkok-today.com).

This paper describes the construction of a Thai language resource called ThEconSum, which is a corpus of ATS for Thai economics news articles. This dataset presents several language difficulties in Thai summarization. This study makes three contributions. First, a standard ATS dataset was distributed free of charge, and guidelines for data construction were elaborated. Second, we developed basic models utilizing powerful language models such as BERT and mT5. Finally, this dataset serves as a valuable resource for condensing knowledge for large-scale ATS applications in the economic domain.

The remainder of this paper is organized as follows. In Section 2, we explain the remarkable linguistic behaviors of Thai related to ATS and public ATS datasets. In Section 3, we elaborate on the construction of our dataset and annotation guidelines. In Section 4, we benchmark the baselines for the Thai ATS using our dataset and report the results. Finally, we conclude the paper in Section 6.

II. RELATED WORK

A. Linguistic Behaviors of Thai

Thai belongs to the analytic language family and is characterized by its adherence to subject-verb-object (SVO) word order and head-initialism. Adjectives, adverbs, function words, and syntactic structures are used to express grammatical functions like numerals, tenses, and voices without using any inflection or declension system. Because more words are employed to communicate the same notions due to semantic minimalism, a Thai phrase is often longer than an identical translation in the European language. Furthermore, in Thai orthography, word- and sentence-boundary indicators are implicit and must be implied in context.

This poses several challenges for the ATS. First, data preparation must include word tokenization and sentence-boundary identification. Second, because Thai lacks an inflection system, compound words can arise without the use of inflections, making key phrases difficult to learn. Finally, Thai sentences tend to be long and contain a significant number of function words and complex syntactic structures spanning long ranges. These linguistic characteristics must be carefully considered and properly addressed when designing and developing ATS models for Thai language.

B. Datasets for ATS

Compared to widely spoken languages, public datasets for Thai ATS are relatively limited. This is primarily because of the challenges involved in preprocessing the Thai language. In Thai orthography, words are written consecutively, whereas sentences are written with space delimiters, which causes orthographical ambiguity and makes linguistic annotation a time-consuming task.

However, there are currently four datasets available that specifically cater to Thai ATS. The datasets used were as follows:

  • TR-TPBS [4] was compiled using information from ThaiRath and Thai PBS news websites. More than 333,000 articles and summary pairings produced by journalists are included in this collection.

  • ThaiSum [4] is an extensive dataset designed specifically for Thai ATS. It was sourced from reliable sources, such as The Standard, Prachatai, Thai PBS, and Thai Rath. This dataset comprises a collection of more than 350,000 article-summary pairs that were meticulously compiled by journalists. On average, the article texts spanned 530 words, whereas the corresponding summaries condensed the information into a concise form of approximately 37 words. ThaiSum is a valuable resource for training and evaluating Thai summarization models.

  • ThaiCrossSum Corpora [4] is an extensive dataset specifically designed for the cross-lingual summarization of Thai literature. This dataset consists of 310,926 TH2EN (Thai-to-English) articles and 310,926 TH2ZH (Thai-to-Chinese) articles.

  • XL-Sum [6] stands as the largest dataset accessible for multilingual ATS. It consists of a collection of 1.35 million article-summary pairs spanning 44 different languages.

For Thai, XL-Sum contains 8,268 article-summary pairs. Table 1 lists the statistics of the aforementioned datasets. ThaiSum is the largest dataset for Thai news ATS, with approximately 530 words per article and 37 words per summary. The article parts of XL-Sum and our dataset were longer (967 and 766 words, respectively). Although the summary parts of ThaiSum and XL-Sum are comparable, our dataset is substantially larger. Note that the TR-TBPS and Thai-CrossSum datasets were not included in the table because TR-TBPS is already included in ThaiSum, and the issue statement part of Thai CrossSum is incompatible with our dataset.

Table 1 . Comparison of data statistics.

DatasetSizeAvg. Article length (words)Avg. Summary length (words)Annotator
ThaiSum350,00053037Journalist
XL-Sum*8,26896753Journalist
This work3,189766334Linguist

* (only Thai).



C. Related Works on ATS

ATS methods can be classified into two main categories: extractive and abstractive. In an extractive approach, a summary is generated by identifying and selecting the most relevant and significant sentences from the original input text.

There are two approaches to ATS. The extractive approach identifies the most relevant sentences in the original input text. These sentences were then concatenated to form a summary without any further generation or rearrangement. The primary objective is to extract essential information from the input text and present it in a condensed form without modifying the original structure or content of the sentences. On the other hand, the abstractive method is based on communicating the most significant information from the input text and constructing the summary using stylistics and rephrasing.

The ATS has been widely studied and applied in various fields. ATS is becoming increasingly crucial because of the massive volume of textual materials [9]. ANN-based language models are employed to produce a summary by wordby-word prediction given an already generated context [10].

This study focuses on an abstractive approach to single-document summarization (SDS). A summary is produced from a single text document while preserving the relevant information in the original text. Natural language processing (NLP) approaches are required for an abstractive ATS approach to capture the important gists of the original text. It then rewrites them in a shorter and more understandable text [12,13].

D. Thai Text Summarization

Thai ATS has recently gained attention in the NLP community. Most of these studies used an extraction method. Based on this extraction method, Ketui and Theeramunkong [7] proposed a comprehensive strategy for summarizing multiple documents using an extractive approach. This strategy comprises three main components. The first step involves unit segmentation, which is a technique used to preprocess data by performing tasks such as breaking it down into meaningful units through word segmentation and labeling parts of speech (POS). The second step is the unit-graph formulation, which encompasses the creation of a graph representation where each unit is represented as a node with a specific weight and the relationships between these nodes are depicted as weighted links. This graph structure helps capture the interconnections and importance of the units in the text. The third step focuses on unit selection for summarization. This entails identifying the nodes and linkages that hold significant relevance in the summarization task, thereby allowing the extraction of crucial information from the original documents. By following this approach, valuable insights can be obtained by summarizing Thai texts efficiently and effectively. Nuanplord and Sodanil [14] suggested an ontology-based strategy for summarizing Thai health news. Yongkiatpanich and Wichadakul [15] suggested an extractive text-summarizing strategy that integrated ontology and graphbased methodologies. A graph representation technique that integrated the Unified Medical Language System (UMLS), a comprehensive knowledge ontology obtained from the National Library of Medicine (NLM), was employed. This graphbased approach is combined with a method called Word Mover’s Distance, which is a distance function commonly used to quantify the similarity between text documents. By leveraging UMLS and Word Mover’s Distance, the researchers aim to enhance the effectiveness of their summarization method and improve the accuracy of extracting important information from medical texts. To select the main sentences, we retrieved summaries using Google PageRank, which is a popular graph-based technique.

In the abstract method, Chaowalit and Sornil [16] proposed an ATS system for customer reviews. They evaluated their method using genuine reviews of 50 goods chosen randomly from a well-known cosmetic website. Jumpathong et al. [8] developed a deep learning-based performance study on Thai news abstractive summarization. They examined the order in which the words in the original manuscript were used to construct a summary. They also found that the text length affected the model accuracy.

III. THECONSUM DATASET

The construction of ThEconSum consists of three steps: data collection, building summary guidelines, and summarizing. (1) Data Curation from News Sources: we gathered news articles from reliable sources, such as Bangkok Turakij, Prachachat Turakij, Bangkok Today, and others, ensuring their reliability. (2) Annotation Guideline Construction: We created guidelines for systematically and consistently constructing a summary for each news article. The output summary is created using extractive and abstractive approaches. The extractive method involves selecting the most salient text chunks from the original content without any alterations or modifications. In the abstractive method, the relevant information in the original text is recreated as a shorter summary using paraphrasing and stylistics. (3) Manual Summarization: Four Thai language specialists were hired to create a text summary corpus. We provided them with a summarizing guideline and an online summarizing tool, which is an online corpus editor that works anywhere. They then built a summary based on the summary guidelines.

Figure 1. Process of corpus construction

A. Data Collection

Considering the following features, we curated news articles for summary: Articles related to economics were also collected. We gathered these articles exclusively from reliable sources, such as Bangkok Turakij, Prachachat Turakij, and Bangkok Today. To narrow the keyword search, we selected economics articles related to PTT Digital Solutions Company Limited because they are involved in many economic sectors and some of the authors also have business insights. Furthermore, the selected news articles varied in length from 400 to 1000 words per document. Finally, we selected economic news articles posted between June 2021 and June 2022.

After curating the data, they were imported into a corpus annotation system. The dataset was preprocessed to remove embedded HTML tags and JavaScript code, and to determine paragraph boundaries. Paragraphs play an important role in creating summaries, as explained in the following section.

B. Summarizing Guideline

We hired four Thai language specialists (bachelor’s graduates in Thai studies) to conduct the online summarization work. Universal to all languages, the guidelines for constructing a summary for each news article are summarized in the following five steps:

Step 1 (Analyze): The article was analyzed in four consecutive parts: title, leading paragraph (most newsworthy information), organizing bridge (important details), and supporting details. These four parts were inspired by the theory of news structure (a. k. a.) the Inverted Pyramid of Journalism.

Step 2 (Divide): The supporting details are divided into topics with respect to the leading paragraph. If additional topics that were not mentioned in the lead paragraph were identified, they were marked as such.

Step 3 (Rank): Rank each sentence in the leading paragraph based on its relevance to the title. Each topic is then ranked according to its relevance to the leading paragraph. Additional topics are assigned the lowest rank. This step helps to identify the content hierarchy of the article.

Step 4 (Extract): Given the preferred range of word count, select the most relevant sentences from the leading paragraph. If the word count allowed, additional supporting details relevant to the previously extracted sentences were selected. This step provides an extractive summary of the study.

Step 5 (Rewrite): Combine relevant sentences that were previously extracted from the article using syntactic processes, such as the formation of compound and complex sentences and sentence paraphrasing. This section provides an abstractive summary of the article.

Our summary guidelines have a shallow learning curve for language specialists. In addition to learning to use the corpus annotation system, they were able to master these guidelines for several days and provide high-quality news summaries using both extractive and abstractive rewriting methods.

In our settings, we set the threshold of word counts to be in the range of 35-45% of the original content. This range was observed in a preliminary experiment in which summaries of different lengths were crafted from the same articles, and this range was voted on to yield the highest-quality summaries.

For ease of use, our corpus annotation system was equipped with specialized facilities for summary crafting. The left side of the display shows the original content, whereas the right side shows extractive and abstractive summaries in the two divided panes. Each pane has a word counter that flashes green when the summary is in the preferred range of word counts (35-45%).

C. Annotators

For the construction of a text summarization corpus, we hired four Thai language specialists to build a text summarization corpus. We have provided summary guidelines and an online summary and checking system. They then built a summary based on the summarization guidelines.

Figure 2. Corpus annotation tool

D. Summarizing Software and Output Format

For corpus construction, we offered an online tool called a news-summarizing and checking system. This system serves as an online editor for the corpus, and allows users to access and work at any location. This tool was designed to assist linguists in constructing the text summarization corpus. It shows the essential details for summarizing, that is, the original text and the word frequency in the original text.

E. Data Statistics

The ThEconSum dataset contains 3,189 documents with both extractive and abstractive summaries. The average word count of the documents was approximately 766. The average word counts of the extractive and abstractive summaries were 340 and 334, respectively. When comparing the summary with the original content, it was found to be approximately 44%.

F. Exploratory Data Analysis

Fig. 3 shows the word clouds of the most frequent words in the dataset. The most frequent words include “million baht,” “public company limited,” “expected,” and “in year.” This signifies the predominant emphasis of economic news on a company’s metrics and future predictions.

Figure 3. Word cloud of the dataset

Fig. 4 illustrates the length distribution of the content and summaries, both of which exhibit symmetrical distributions around their respective means. The content length spanned 400-1200 words, with a mean and median of approximately 770 words. By contrast, the summaries were between 200 and 500 words, with a mean of 340 words. Additionally, based on the mean and median values, extractive summaries were slightly shorter than abstractive summaries.

Figure 4. Skewed distribution of text lengths in the dataset

IV. EXPERIMENT AND BENCHMARK: ABSTRACTIVE ATS MODEL

The ATS comprises three steps. Preprocessing: The text is processed using stop-word removal, NER, sentence segmentation, and word tokenization. Here, named entities are usually replaced by special tags to improve system accuracy. Second, summarization: the structured text is analyzed, identified with salient information, and converted into a summary. Finally, post-processing addresses language-specific challenges such as anaphora resolution and NE replacement, and forms a final textual summary.

A. Baseline Models

The dataset is divided into two sets of experiments. In both experiments, the text was not tokenized into sentences. We believe that the accuracy reflects the consistency of the summarization, as suggested by the guidelines.

In the first experiment, we trained a transformer model from scratch using our dataset and evaluated its accuracy. The default transformer configuration is used. It consists of six encoding layers and six decoding layers, and the sinusoidal position embedding with the maximum length allowed for each document is 2048 subword tokens. Word embedding had 512 dimensions. We used the parameter initialization method of Huang et al. (2020). The embedding layers for both the encoder and decoder were shared. The model consists of 120 million parameters.

Figure 5. Architecture of an abstractive text summarization

In the second experiment, we used pre-trained language models for Thai and trained the decoder using our dataset. We compared two pre-trained LLMs, mT5 and HoogBERTa, as described below.

  • mT5-small: mT5 is a cross-lingual pre-trained model for multiple tasks, including machine translation and ATS. It was pre-trained with a data collection of 101 languages from Common Crawl. Due to time and resource constraints, we chose mT5-small, a smaller version with 300 million parameters, as the baseline model.

  • The Pre-trained LLM + Transformer: In Fig. 6, we utilize word embeddings from HoogBERTa, a Thai RoBERTabased pre-trained model. To handle documents that exceeded a maximum length of 512 words, we divided them into 500-subword pieces. For larger documents, we extracted the features of each chunk independently. None of the pieces overlap. The final document-level representation is formed by combining the chunk-level information. In the model architecture, the sinusoidal position encoding is, and there are six encoding and six decoding layers. To accelerate convergence, we freeze the parameters of HoogBERTa and train only the decoder. To ensure compatibility with the pre-trained embedding size, the word embedding of the encoder was set to 768 dimensions. A total of 120 million parameters were used in this model.

    Figure 6. The encoder of the model, ‘Pre-trained LM + Transformer’

B. Experimental Setup

The ThEconSum dataset used in these experiments was published in the AI for THAI [18]. The dataset consists of 3,189 entries, where each document is attached to extractive and abstractive summaries. We followed the 80-10-10 traintest-validation split practice. To improve the meaningfulness of the subwords, we first tokenized the articles and summaries into words using the Longan library available on AI for THAI [18]. The average word counts of the documents and summaries were 766 and 340, respectively.

The AdamW optimizer [19] was used to train the decoders of each model. We set the learning rate to 1e-3 and iterated the training for 60 epochs with a batch size of two. Validation was performed every 40 iterations. When generating the text, we set the specific hyperparameters as follows:

  • Non-repetition n-gram size = 2

  • Width of beam search = 2

  • Repetitive penalty = 1.2

  • Length penalty = 0.6

C. Evaluation Metrics

Automatic Evaluation: Following the literature, the standard metric for evaluating ATS is the Recall-Oriented Understudy for Gisting Evaluation score (ROUGE) [20]. ROUGE-n compares the n-gram occurrences in a given predicted summary with those in the reference summaries. Popular metrics include ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence). These metrics were also used in the experiments.

Manual Evaluation: Thai language specialists also conducted a manual evaluation for a qualitative assessment of the produced summaries. We take into account The following criteria were considered:

  • Readability: When evaluating readability, it is important to ensure that the summary maintains a coherent rhetorical structure without any gaps or dangling anaphoras.

  • Grammaticality: The produced sentence should be grammatical.

  • Structure and Coherence: The summary should have relevant, cohesive phrases with a clear organization of the structure.

  • Content coverage: Every issue discussed in the original documents should be included in the summary, together with essential details.

  • Non-redundancy: Redundant text and duplications should not be included in the summary.

Each assessment criterion is scored on a scale of 1 (very poor) to 10 (exceptional).

V. RESULTS

A. Results of Automatic Evaluation

Table 2 presents a comparison of the ROUGE scores of the ATS-TH-01 (mT5-small), ATS-TH-02 (vanilla transformer), and ATS-TH-03 (Transformer + HoogBERTa) models. ATSTH-03 significantly outperformed the other models. When we replaced the mT5 model with a vanilla transformer, we noticed an improvement in the ROGUE-1 accuracy. This suggests that the basic transformer is more efficient at learning keywords than mT5. Furthermore, the accuracy increased when we replaced the pre-trained encoder with HoogBERTa. We observed a similar trend for the ROUGE-2 and ROUGEL metrics. This is because HoogBERTa is better trained in the Thai language.

Table 2 . Results on the economic news dataset.

Model CodeAbstractive ModelR-1R-2R-L
ATS-TH-01mT5-small0.30900.17240.2952
ATS-TH-02Transformer0.59750.23260.4813
ATS-TH-03Transformer + HoogBERTa0.62280.38890.5623


B. Automatic Evaluation of Summaries

The results in Table 3 indicate that the ATS-TH-03 model (Transformer + HoogBERTa) outperformed the other models in terms of the qualitative results. Compared to the standard transformer, it provides more understandable and grammatical summaries based on the pre-trained model. ATS-TH-03 (Transformer + HoogBERTa) performs better than ATS-TH-01 (mT5-small). This implies that the multilingual models are likely to underperform when sufficient training data are available.

Table 3 . Assessment of the quality of the system-generated summaries.

CriteriaATS-TH-01ATS-TH-02ATS-TH-03
AverageAverage S.D.S.D.AverageS.D.
1Readability4.41.5062.61.8386.12.183
2Grammaticality5.81.1353.72.0036.61.647
3Structure and Coherence4.51.3543.02.1606.12.132
4Content coverage4.11 .7292.71.7675.62.271
5Non-redundancy6.91.1012.81.9324.82.658


As suggested by the structural coherence and content coverage criteria, ATS-TH-03 (Transformer + HoogBERTa) outperformed the other models. When analyzing the produced summaries, we discovered that they preserved more meaningful relevant keywords and keyphrases. This improvement can be attributed to the multihead attention mechanism and syntactic information embedded in the pre-trained model.

However, for the non-redundancy criterion, ATS-TH-01 (mT5-small) outperformed the other criteria. We assumed that this was related to our hyperparameter settings (repetition penalty = 1.2; length penalty = 0.6). These specific hyperparameters discourage the excessive use of repetitive terms. However, despite achieving a lower non-redundancy score, the resulting outputs did not exhibit human-readable characteristics, as illustrated in subsequent sections.

A. Linguistic Analysis

ATS-TH-01 (mT5-small) was unable to capture the main points of news articles. While the generated summaries may be grammatically correct and related to keywords, they tend to be disorganized and lack coherence. Compared to the other models, ATS-TH-01 produced fewer repetitions and duplications.

ATS-TH-02 (vanilla Transformer) exhibits the lowest readability. This model randomly selects text fragments from a source document and combines them into an output, resulting in an incoherent summary. Relevant keywords and keyphrases were rarely generated, whereas repetitions were frequently found.

The ATS-TH-03 (Transformer + HoogBERTa) demonstrated superior performance. It exhibits the highest level of readability and effectively covers relevant keywords. Additionally, there were significantly fewer repetitions and duplicates.

Furthermore, we investigated the correlation between document length and the ROUGE score of the ATS-TH-03 model output. We categorized the test data into three ranges based on their lengths and evaluated their individual ROUGE scores (Table 4). The summaries of news articles between 801 and 1000 words had the highest ROUGE-1, ROUGE-2, and ROUGE-L.

Table 4 . Effects of document lengths and ROUGE Scores of the ATS-TH-03 Model.

Document Length (word)Number of DocumentROUGE Score
R-1R-2R-L
400-6001030.61020.38800.5639
601-8001250.61710.37260.5462
801-1000920.64440.41220.5823


B. Implications

Field generalization: Because the model was trained in a specific domain, it has the potential to be utilized for ATS in the economic arena. Our experimental results demonstrate that a general-domain pre-trained model can be fine-tuned with a relatively small domain-specific dataset, and it manifests promising quantitative and qualitative results.

Limitations: First, the stylistics of the training data directly affect the accuracy of the ATS. Because the model was trained on news articles written in formal Thai, it recognized the structure of the written language. When applied to social media data, such as Facebook and Twitter, the accuracy may decrease significantly because of informal spoken language. Second, the maximum input and output lengths significantly affect the accuracy of the ATS. For all the models, we set the maximum input length to 1,024 tokens and the maximum output length to 512 tokens. Varying these numbers may result in shorter or longer summaries; however, this does not guarantee an increase in quality gain.

VII. CONCLUSION

This paper describes the development of a language resource for ATS, called ThEconSum, in which we address several linguistic challenges in Thai. This study makes three major contributions to the literature. First, we present a standardized dataset for the Thai ATS, accompanied by our detailed guidelines. Second, we introduce three baseline models based on language models. Finally, the dataset can be effectively employed for knowledge distillation, especially to train large-scale ATS models specifically designed for economics. For instance, a dataset can be utilized to create a domain-specific ATS system using knowledge distillation, in which a teacher-student model is employed.

Our future work will be as follows: First, we investigate the subjectivity of summarizing jobs and design a precise data-building rule. Our next step involves incorporating semantic roles similar to the 5W1H questions (who, what, when, where, why, and how) into cross-domain ATS models. This inclusion aims to enhance the accuracy and coverage of the models. Finally, we aim to select the core essence of the papers by leveraging the weak signals identified through a graph-based semantic interpretation.

ACKNOWLEDGEMENTS

This study was conducted with collaborative support from Promes Co., Ltd. (Backyard Group) as part of the Joint Research Project funded by the National Science and Technology Development Agency (NSTDA), Thailand. Additionally, we extend our gratitude to PTT Digital Solutions Co. Ltd., Thailand for generously providing public datasets for Thai ATS. The Program Management Unit for National Competitiveness Enhancement, under the Office of the National Higher Education Science Research and Innovation Policy Council in Thailand, provided financial support for data collection and construction.

Fig 1.

Figure 1.Process of corpus construction
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Fig 2.

Figure 2.Corpus annotation tool
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Fig 3.

Figure 3.Word cloud of the dataset
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Fig 4.

Figure 4.Skewed distribution of text lengths in the dataset
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Fig 5.

Figure 5.Architecture of an abstractive text summarization
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Fig 6.

Figure 6.The encoder of the model, ‘Pre-trained LM + Transformer’
Journal of Information and Communication Convergence Engineering 2024; 22: 33-43https://doi.org/10.56977/jicce.2024.22.1.33

Table 1 . Comparison of data statistics.

DatasetSizeAvg. Article length (words)Avg. Summary length (words)Annotator
ThaiSum350,00053037Journalist
XL-Sum*8,26896753Journalist
This work3,189766334Linguist

* (only Thai).


Table 2 . Results on the economic news dataset.

Model CodeAbstractive ModelR-1R-2R-L
ATS-TH-01mT5-small0.30900.17240.2952
ATS-TH-02Transformer0.59750.23260.4813
ATS-TH-03Transformer + HoogBERTa0.62280.38890.5623

Table 3 . Assessment of the quality of the system-generated summaries.

CriteriaATS-TH-01ATS-TH-02ATS-TH-03
AverageAverage S.D.S.D.AverageS.D.
1Readability4.41.5062.61.8386.12.183
2Grammaticality5.81.1353.72.0036.61.647
3Structure and Coherence4.51.3543.02.1606.12.132
4Content coverage4.11 .7292.71.7675.62.271
5Non-redundancy6.91.1012.81.9324.82.658

Table 4 . Effects of document lengths and ROUGE Scores of the ATS-TH-03 Model.

Document Length (word)Number of DocumentROUGE Score
R-1R-2R-L
400-6001030.61020.38800.5639
601-8001250.61710.37260.5462
801-1000920.64440.41220.5823

References

  1. S. Deo and D. Banik, “Text summarization using textrank and lexrank through latent semantic analysis,” in in Proceeding of the International Conference on Information Technology (OCIT) 2022, Odisha, India, pp. 113-118, 2022. DOI: 10.1109/OCIT56763.2022.00031.
    CrossRef
  2. K. Kaikhah, “Automatic text summarization with neural networks,” in in Proceeding of the 2nd International IEEE Conference on 'Intelligent Systems', Varna, Bulgaria, pp. 40-44, 2004. DOI: 10.1109/IS.2004.1344614.
    CrossRef
  3. Z. Yang, Y. Dong, J. Deng, B. Sha, and T. Xu, “Research on automatic news text summarization technology based on GPT2 model,” in in Proceeding of the 3rd International Conference on Artificial Intelligence and Advanced Manufacture, Manchester, United Kingdom, pp. 418-423, 2021. DOI: 10.1145/3495018.3495091.
    CrossRef
  4. N. Chumpolsathien, “Using knowledge distillation from keyword extraction to improve the informativeness of neural cross-lingual summarization,” Masters thesis, Beijing Institute of Technology, 2020.
  5. J. Zhu, Q. Wang, Y. Wang, Y. Zhou, J. Zhang, S. Wang, and C. Zong, “NCLS: Neural cross-lingual summarization,” arXiv preprint arXiv:1909.00156, Aug. 2019. DOI: 10.48550/arXiv.1909.00156.
    Pubmed KoreaMed CrossRef
  6. T. Hasan, et al, “XL-sum: Large-scale multilingual abstractive summarization for 44 languages,” arXiv preprint arXiv:2106.13822, Jun. 2021. DOI: 10.48550/arXiv.2106.13822.
    CrossRef
  7. N. Ketui, T. Theeramunkong, and C. Onsuwan, “An EDU-based approach for Thai multi-document summarization and its application,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 1, pp. 1-26, Jan. 2015. DOI: 10.1145/2641567.
    CrossRef
  8. S. Jumpathong, T. Theeramunkong, T. Supnithi, and M. Okumura, “A performance analysis of deep-learning-based thai news abstractive summarization: Word positions and document length,” in in Proceeding of the 7th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, pp. 279-284, 2022. DOI: 10.1109/ICBIR54589.2022.9786413.
    CrossRef
  9. W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, “Automatic text summarization: A comprehensive survey,” Expert systems with applications, vol. 165, p. 113679, Mar. 2021. DOI: 10.1016/j.eswa.2020.113679.
    CrossRef
  10. P. Wang, B. Xu, J. Xu, G. Tian, C. L. Liu, and H. Hao, “Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification,” Neurocomputing, vol. 174, pp. 806-814, 2016. DOI: 10.1016/j.neucom.2015.09.096.
    CrossRef
  11. M. Joshi, H. Wang, and S. McClean, “Dense semantic graph and its application in single document summarisation,” in Emerging Ideas on Information Filtering and Retrieval, Springer, pp. 55-67, Oct. 2017. DOI: 10.1007/978-3-319-68392-8_4.
    CrossRef
  12. R. Z. Al-Abdallah and A. T. Al-Taani, “Arabic single-document text summarization using particle swarm optimization algorithm,” Procedia Computer Science, vol. 117, pp. 30-37, 2017. DOI: 10.1016/j.procs.2017.10.091.
    CrossRef
  13. K. Krishnakumari and E. Sivasankar, “Scalable aspect-based summarization in the hadoop environment,” in Big Data Analytics, Springer, pp. 439-449, Oct. 2017. DOI: 10.1007/978-981-10-6620-7_42.
    CrossRef
  14. P. Nuanplord and M. Sodanil, “Health news summarization using semantic ontology,” in Proceeding of the 3rd International Conference on Next Generation Computing, Chiang Mai, Thailand, 2017.
  15. C. Yongkiatpanich and D. Wichadakul, “Extractive text summarization using ontology and graph-based method,” in Proceeding of the 4th International Conference on Computer and Communication Systems (ICCCS), pp. 105-110, 2019. DOI: 10.1109/CCOMS.2019.8821755.
    CrossRef
  16. O. Chaowalit and O. Sornil, “Abstractive thai opinion summarization,” Advanced Materials Research, vol. 971-973, pp. 2273-2280, 2014. DOI: 10.4028/www.scientific.net/AMR.971-973.2273.
    CrossRef
  17. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483-498, Jun. 2021. DOI: 10.18653/v1/2021.naacl-main.41.
    Pubmed KoreaMed CrossRef
  18. “Economic news dataset,” Thailand's national electronics and computer technology center, Thailand, [Internet], Available: https://aiforthai.in.th/.
  19. I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” in 6th International Conference on Learning Representations, pp. 1-14, 2018. [Internet], Available: https://openreview.net/pdf?id=rk6qdGgCZ.
  20. C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proceedings of the workshop on text summarization branches out (WAS 2004), no. 1, pp. 25-26, 2004. [Internet], Available: papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-2AD316DAEF85.
  21. S. Jumpathong, A. Takhom, P. Boonkwan, V. Sutantayawalee, P. Porkaew, S. Phaholphinyo, C. Phrombut, T. Supnithi, K. Choke-Mangmi, S. Yamasathien, N. Tretasayuth, K. Kanwatchara, and A. Aiemleuk, “ThEconSum: an Economics-domained Dataset for Thai Text Summarization and Baseline Models,” in in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, pp. 1-6, 2022. DOI: 10.1109/iSAI-NLP56921.2022.9960271.
    CrossRef
JICCE
Mar 31, 2024 Vol.22 No.1, pp. 1~87

Stats or Metrics

Share this article on

  • line

Related articles in JICCE

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255