Journal of information and communication convergence engineering 2022; 20(2): 113-124
Published online June 30, 2022
https://doi.org/10.6109/jicce.2022.20.2.113
© Korea Institute of Information and Communication Engineering
Correspondence to : Muditha Tissera (E-mail: mudithat@kln.ac.lk, Tel: +94-1129-12709)
Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.
Keywords Automatic Knowledge Extraction, Relation extraction, Natural Language Processing, Semantic Web, Triples Extraction
Open domain also known to as “domain-independent” or “unconstraint-domain” refers to unstructured text from news articles, magazines, World Wide Web (WWW), email text, blogs, and social media comments, where the content is not limited to a single domain. These are the vast information sources among the various types of information generators available today. The knowledge/information facts embedded in these sources are presented using natural language text, which is unstructured and mostly in heterogeneous formats; thus, only humans can read and understand. However, humans bear limited cognitive processing power, and this neverending information generation leads to the problem of information overload. Hence, these knowledge sources are not effectively used.
The main objective of this study is to automatically extract surface knowledge from open-domain news sources and convert it into structured formats so that it can be interpreted by machines. We propose an approach based on the grammatical structure of a sentence to extract triples with a remarkably meaningful knowledge extraction rate. The extracted surface knowledge, in terms of triples, was validated using an interrater agreement validation method, which has high reliability.
Some have already attempted to solve the aforementioned problem by automatically extracting knowledge (AKE) from unstructured text and organizing it in structured knowledge bases that allow machines to reason out knowledge in a useful way. These attempts include, extracting different semantic components such as keywords, key phrases, entities, and relations.
The work, “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information” introduces a novel algorithm for Keyword extraction [1]. This study is unique because it does not require a corpus of similar documents. Another interesting domain-dependent keyword extraction approach was found in [2], where it extracts biological information from full-text scientific articles, not limiting the effort to only the content of the abstract, as in many other similar approaches. A survey was conducted to analyze the use of graph-based methods for keyword extraction. Furthermore, they highlighted the fact that graph-based methods are better than supervised and unsupervised methods in terms of complexity and computational resources [3]. Testing the hypothesis, “keywords are more likely to be found among influential nodes of a graph of words rather than among its nodes high on eigenvector-related centrality measure,” Tixier, Malliaros, and Vazirgiannis attempted to extract keywords from documents [4]. Hasan, Sanyal, Chaki, and Ali [5] conducted an empirical study of important keyword extraction. They concluded that the Support Vector Machine (SVM) and Conditional Random Field (CRF) methods yielded better results. A survey was conducted by [6] for automatic keyword extraction, which is used for text summarization. They introduced a hybrid extraction technique, which is a codependent algorithm for keyword extraction and text summarization.
Several studies have focused on automatic keyphrase extraction. Liu et al. [7] proposed an unsupervised approach for key phrase extraction by using a clustering mechanism to find exemplar terms, and then used the extracted exemplar terms to find key phrases. Ouyang, Li, and Zhang [8] participated in a task titled “Automatic Keyphrase Extraction from Scientific Articles” in SemEval-2, task number 5. In their approach, they identified the core words of the articles as the most essential words in the document and expanded them toward proper Key phrases using “Word expansion” approach. The competition and its results are detailed in [9]. Key2Vec is a phrase embedding method that is used for ranking keyphrases extracted from scientific articles. According to the experimental results the proposed Key2Vec technique produced state-of-the-art results on benchmark datasets [10]. Rabby and Azad [11] proposed a rooted tree-based, domain-independent, automatic keyphrase extraction method based on nominal statistical knowledge. Typically, unsupervised systems have poor accuracy and require a large corpus. To address these drawbacks, Bennani-Smires et al. used a novel unsupervised method called “EmbedRank,” which leverages sentence embeddings to extract keyphrases from a single document [12].
Named Entity Recognition (NER) is another important task in knowledge extraction research. KNOWITALL [13] extracted a large collection of named entities from web data. KNOWITALL [14] was enhanced to improve the recall and extraction rates. Ritter et al. [15] stated that the performance of standard NLP tools is severely degraded in tweets. Therefore, they proposed a mechanism to rebuild the NLP pipeline, starting with POS tagging, chunking, and NER using a distantly supervised approach.
Semantic binary relation extraction is another area of the AKE. Mintz et al. [16] conducted a useful research based on the concept of “Distance Supervision”. Their methodology does not require labeled corpora; hence, no domain dependency exists. Instead, Freebase was used for distance supervision. Nguyen and Verspoor investigated the performance after integrating character-based word representations into a standard Convolutional Neural Network (CNN) based relation extraction model [17]. The SemEval-2018-Task7 was designed to identify and classify instances of semantic relations between concepts in a set of six discrete categories. When analyzing the approaches of the 32 participants, it was revealed that the most popular methods include CNN and Long Short Term Memory (LSTM) networks, with word embedding-based features calculated based on domain-specific corpora [18]. Triple extraction can be considered as a sort of relation extraction in the knowledge extraction domain. A survey on relation extraction has revealed that, out of all supervised approaches including feature-based and kernel-based, syntactic tree kernel-based techniques were the most effective [19].
Another popular topic is taxonomy extraction, which focuses on extracting hypernym-hyponym relationships from text. SemEval-2016, Task 13, was also designed for taxonomy extraction from a specified multi-domain term list [20]. Maitra and Das came in fourth place in the monolingual competition (English language) and second place in the multilingual competition (Dutch, Italian, and French languages) [21]. They used an unsupervised approach with two modules to produce possible hypernym candidates, then merged the results. Panchenko et al. [22] secured the first place in the same competition. Substring matching and lexico-syntactic pattern matching were used in their methodology to develop their product called TAXI.
The target domain used in AKE significantly influences the decision on the methods and techniques to use and the expected level of accuracy. Therefore, knowledge extraction can be classified as domain-dependent and domain-independent, also known as open domain (nontrivial), which can be extended to massive web data. The unsupervised approach, TEXTRUNNER, is an Open Information Extraction (OIE) system that extracts information from the unconstrained Web [23]. TEXTRUNNER had a 33% lower error rate than KNOWITALL did. The Wikipedia-based Open Extractor (WOE) [24] is an extension of TEXTRUNNER with improved precision and recall. Etzioni et al. attempted to scale knowledge extraction into a sizable and heterogeneous web corpus [25]. The authors of this research highlighted that their work could be dubbed the second generation of OIE because the novel model doubled the precision and recall compared with previous works such as TEXTRUNNER and WOE. The hypothesis “Shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system,” proved in research on AKE from documents [26]. This approach was implemented in the IBM Watson DeepQA system, and its overall accuracy improved by 2.4%. Soderland et al. mapped domain-independent open information extractions (tuples) into ontologies using domains from the DARPA Machine Reading Project [27].
Never-Ending Language Learning (NELL) is a learning agent whose task is to learn to read the web all the time. This is an alternative paradigm for machine learning that accurately models human learning. NELL is successfully learning to improve its reading competence over time and plans to expand its knowledge base of world beliefs [28].
Although recent literature has shown many advancements in the AKE domain, it suffers from low precision and semantic drift in the evaluation of results.
According to David Bennet and Alex Bennet, there are three levels of knowledge: deep, shallow, and surface [29]. Surface knowledge is explicit knowledge that requires minimal meaning, without the context. Our attempt was made toward extracting surface knowledge from unstructured texts in Sri Lankan news. Fig. 1 depicts a high-level design diagram of the proposed approach.
As depicted in the diagram, several subtasks were performed with the aim of extracting surface knowledge in the form of triples. Our primary data source, the Sri Lankan new corpus, consists of various types of news articles in text format, from different domains such as business and finance, politics, education, and breaking news. It contains 5,409 plain text files with approximately 116,839 sentences. There were 2,524,876 words in the corpus.
First, the news articles in text format were pre-processed and tokenized into sentences. However, these sentences are ineligible for knowledge extraction. Our in-depth investigations helped us to identify and foresee problematic sentences. Hence, we ignored such problematic sentences during pre-processing and kept them aside for handling in future work. These include 1) Sentences that are outside the effective sentence length range. We set the minimum and maximum sentence lengths to between five and 65 words. 2) Sentences with more than two commas. 3) If quotes are found in the sentence. 4) If a question is found in a sentence. 5) If the semi-colon was found more than once in the sentence. 6) If the sentence starts with special characters (these are mostly metadata in news).
Tokenized sentences were parsed using a dependency parser to obtain the grammatical structure of the sentence. The grammatical structure of the sentence is organized as an array of tokens (words of the sentence in its left to right order) with details such as text/token (the original token text), dependency tag (the syntactic relation connecting child to head), head text (the original text of the token head), head part-of-speech (POS) (the POS tag of the token head), and children (the immediate syntactic dependents of the token). Fig. 2 shows an example of the grammatical structure of the sentence: “
The triple extraction algorithm, in which the discovered validation rules were implemented, navigated through the sentence’s grammatical structure and extracted knowledge facts in the form of a triple. Dependency parser annotations such as ‘nsubj’, dobj, and ‘ROOT’ are standard parser tags, which are typically used in any dependency parser. The spacy syntactic dependency parser was chosen for dependency parsing. All coding was performed using Python version 3.6.5.
The surface knowledge extracted from open domain news should be stored in a semantic pattern that represents structured knowledge. In our study, the triple, which is in the form of “subject | predicate | object, was chosen as the best template for this purpose. The following example shows a lengthy unstructured sentence and its extracted knowledge in the form of a triple. (Triple components are separated from the pipe symbol)
Sentence: “
Triple: Investigations | handed over to | Criminal Investigation Department
A set of valuable rules for accurately extracting the components of the triple were discovered by comprehensively analyzing the sentence’s grammatical structure. These rules were implemented as a rule validation layer in the surface knowledge extraction process. During the execution of this algorithm, every sentence is parsed through the rule validation layer, and the triple components are extracted accurately. The remainder of this section introduces the discovered rules and their implementations using pseudocodes.
Rule 01:
“In basic level triple extraction, ROOT token should be assigned as the Predicate component of the triple. Then, a token with the dependency tag ‘nsubj’ or ‘nsubjpass’ for the subject component and ‘dobj’ or ‘pobj’ or ‘attr’ token as the object component of the triple should be assigned depending on their availability in the grammatical structure.”
The pseudocode shown in Fig. 3 illustrates the logic implemented for the basic-level triple extraction rule (Rule 01) using the sentence’s grammatical structure.
News sentences are lengthy and complex. They can be complex, compound, direct speech, indirect speech, active voice, passive voice, and so on. When there are multiple clauses in a sentence, there is a high chance of inaccurate triple extraction because of the complexity. For example,
As you can see in the above example, there could be multiple tokens with the dependency tag ‘nsubj’, ‘dobj’, ‘pobj’, or ‘attr’ that are capable of being selected as subject/object component of the triple. This is mainly because of multiple clauses within a single sentence. No matter how complex it is, there should be only one ‘ROOT’ token in a single sentence. Identifying the most appropriate token for the subject component and the most appropriate token for the object component (out of multiple candidate tokens) are very vital to form a meaningful triple. For that purpose, a new rule has been discovered, and it is mentioned as Rule 02 or “Main rule.”
Rule 02 (Main rule):
Although there may be multiple candidate tokens for the subject and object components of the triple, the aforementioned main rule will be satisfied by only one candidate token. Therefore, using this rule, the most appropriate candidate token can be selected and a meaningful link can be created. To implement this main rule, the ROOT token should first be identified and assigned to the predicate part of the triple. When extracting other components later, the following condition should be checked to make the appropriate link to the ROOT/Predicate.
token.head.text = Predicate and token.head.POS = ‘VERB’ (Where the value for “Predicate” has already been selected.)
Refer to line 5 or line 8 of the pseudocode shown in Fig. 3 for the implementation details. The pseudocode in Fig. 3 implements regular rules that can be used to extract three components of the triple at its basic level. Although such basic-level extraction would result in a triple, it may not have adequate details to be meaningful. Therefore, enhancements are essential for making the triple more meaningful. Furthermore, in all such improvements, the above-mentioned main rule must be adhered to.
Basic-level triple extraction may result in an acceptable meaning for a triple. Affixing appropriate terms to the triple components yields an enhanced meaning for the triple. Although our target was to extract surface knowledge, triple extraction with the best possible meaning was our ultimate target. However, introducing additional terms minimally is crucial but challenging.
For example,
Basic level triple: Act | passed | parliament
Enhanced triple: Independence Act | passed by| British parliament
Rule 03:
For example,
Triple before: Decision | taken | meeting
Triple after: Decision | taken at | meeting
In the abovementioned example, the word ‘at’ is the preposition and ‘meeting’ is the prepositional object assigned as the object component of the triple. With the added token ‘at’ as a suffix to the predicate, expanded triple yields a better meaning. However, one exception was observed when there were words between ROOT and the preposition as below.
For example,
Triple before: He | contended of | memorial
As you can see in the above example, the output triple may not be meaningful without the middle word “construction.” Therefore, the expected triple can be expressed as follows:
Triple after: He | contended construction of | memorial
As a solution to this drawback, relevant words between the selected predicate and the preposition were identified by traversing the grammatical structure in the backward direction. Refer to the pseudocode in Fig. 4 for the implementation of Rule 03.
Rule 04:
For example,
Triple before: investigations | handed to | Investigation Department
Triple after: investigations | handed over to | Investigation Department
In the above example, ‘handed over’ is a phrasal verb. The grammatical structure denotes the token ‘handed’ as the ROOT, and it is selected as the predicate. Additionally, the token ‘over’ is denoted as a particle and selected as the suffix for the predicate. Refer to the pseudocode in Fig. 5 for the implementation of Rule 04.
The tokens selected as subject and object components of the basic-level triple may not be meaningful. Therefore, appropriate prefixes can be combined to render the triple components more meaningful.
Rule 05:
For example,
Triple before: countries | achieved | gains
Triple after: most countries | achieved | significant gains
Once the basic-level triple components have been identified, such prefixes are sought by re-traversing the same grammatical structure. Refer to the pseudocode in Fig. 6 for the implementation of Rule 05.
As in Rule 05, appropriate tokens can be combined as suffixes for the subject and object components of the triple. This becomes mandatory when the word “of” comes after the tokens that have been selected as the subject and object components of the triple. When the sentences have “lot of people”, “percentage of students”, “crisis of confidence” etc., representing its subject or object, the output triple becomes meaningless.
Rule 06:
Example 1:
Triple before: Lot |taught |important characteristics
Triple after: Lot of teachers |taught |important characteristics
Example 2:
Triple before: we | increase | percentage
Triple after: we | increase | percentage of entrants
Refer to pseudocode in Fig. 7 for the Rule 06 implementation.
In some sentences, the main verb that holds the ’ROOT’ dependency tag comes with an open clausal complement. For example,
In this example “build” is an open clausal token. If the predicate is formed based on the ‘ROOT’ token, the following would be the output triple.
Central Bank | hoping | foreign reserves
However, the above triple does not provide the expected meaning. Therefore, Rule 07 has been introduced to address this issue.
Rule 07:
Triple after: Central Bank | build | foreign reserves
Refer to pseudocode in Fig. 8 for the Rule 07 implementation.
If a sentence is complex, very often it has linking words. Therefore, triple extraction from such sentences should be handled differently.
For example,
In such complex sentences, “Discourse markers” or “linking phrases/words” are commonly used (For example, although, however, since, assuming that, and in fact).
Rule 08:
“When a sentence is a complex type that has multiple clauses linked to each other using discourse markers, split the sentence into separate segments based on the discourse markers and then triple extraction should be carried out from each segment separately.”
Discourse markers can be identified in the grammatical structure using the tokens with their dependency tag set as “mark.” According to Rule 08, the given example results in two triples extracted from two sentence segments/clauses, as follows:
Segment 1:
Token with a discourse mark tag: “although”
Segment 2:
Triples generated:
Government | cut down | cost of living
politicians | continued| hedonistic lifestyles
In any news article, the reported speech is a frequently used writing style. Because our research also uses news as its data source, reported speech handling cannot be avoided. The grammatical structure of the sentences that have “said” word in it, identifies the “said” word as the ‘ROOT’ token. Therefore, the output triple is meaningless. Refer to the following two examples.
Example 1:
Extracted triple: Minister Athukorala | said | MoU
Example 2:
Extracted triple: Sri Lanka Freedom Party | said|
To resolve this anomaly, rule 09 has been implemented.
Rule 09:
In the reported speech sentences, the reported clause includes what the original speaker said. Therefore, a reported clause is more important for knowledge extraction. The automatic classification of reported clauses and reporting clauses is challenging. Even though the probability of success is not at a higher rate, simple logic has been identified to perform this task. It was observed from many examples that the reported clause is found in the right-side segment (as in Example 1 above) unless the left-side segment length is much higher the right-side segment length and when a comma “,” found in the left-side segment. The implementation of this logic is depicted as a pseudocode in Fig. 9.
The grammatical structures of active and passive voice sentences are slightly different from each other. The two main differences include the following: 1) Active voice sentences have tokens of dependency tag ‘nsubj’ for nouns that represent the subject. However, passive voice sentences have tokens of dependency tag ‘nsubjpass’ for nouns that represent their nominal subject, 2) Passive voice sentences mostly have a token with a dependency tag ‘agent’, immediately after the preposition ‘by’. However, with minor modifications to the algorithm, passive-voice sentences can also be processed using the same triple-extraction algorithm.
Rule 10:
The implementation of Rule 10 is represented in the pseudocode shown in Fig. 10 by changing lines 3, 4, and 5 of the basic triple extraction pseudocode depicted in Fig. 3.
For example,
Extracted triple: dolphins | killed by | fishermen
Even though our regular triple extraction algorithm is suitable for passive voice sentences with minor modifications, as described above, one anomaly was observed with some ‘Predicate’ part of the triple, as shown below.
For example,
Extracted triple: Temporary workers | paid | wages
In the example above, the passive sense of the triple is lost. If the agent is unknown (done by whom) and past participle verbs are the same as past tense verbs (for example, paid, made, and read), the passive voice triple gives incorrect meaning. Rule 11 was introduced to address this issue.
Rule 11:
Extracted triple after rule implementation:
Temporary workers | are paid | wages
The proposed approach was successfully implemented and validated using the Sri Lankan English news corpus. The algorithmic details are presented as a pseudocode from Figs. 3-10.
Output generation:
The output triples were written into a CSV file with the pipe character used as the delimiter within the triple components. Some of the extracted triples are shown in Fig. 11. Table 1 contains some important statistical facts related to the extraction process. According to Table 1 every valid sentence used for extraction has resulted in a well-formed triple (value exists for all three components). This may be caused by the ignorance of the sentences that foresee problems before extraction.
Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus
Fact | Figure |
---|---|
Number of documents in the corpus | 5,409 |
Number of sentences after pre-processing | 116,839 |
Number of sentences ignored | 62,638 |
Number of sentences valid for extractions | 54,201 |
Number of triples extracted | 54,201 |
Number of distinct predicates extracted | 10,116 |
Number of distinct subjects extracted | 6,736 |
Number of distinct objects extracted | 7,937 |
Sentences ignored during the algorithm execution:
When a sentence is a command, the ROOT token is the first in the grammatical structure array. In such cases, the subject component of the triple becomes empty, which results in a malformed triple. Therefore, these sentences were ignored.
For example, “Shift the insurance liability toward manufacturers.”
The purpose of this validation process was to evaluate the significance of the output (how meaningful) of the proposed surface knowledge extraction algorithm. Accuracy was measured by conducting an inter-rater agreement test among the four participants selected from our testing team. Our testing team consisted of academics and professionals who were experts in English and with different professional backgrounds. For this validation process, four sample sets of 387 triples were randomly chosen from the extracted triples, using sampling without replacement method. Altogether, 1548 triples were selected and distributed among the four participants of the testing team. This sample distribution ensures that every triple has been verified by at least three participants, and the meaningful/correctness of the surface knowledge extraction is decided by a majority vote. The validation results of all four members were combined and re-processed to determine the final accuracy rate, as mentioned above. The results are presented in Table 2. Because the generated output is large, performing a validation test with high variability and reliability is vital. Variability was achieved by randomly choosing the test triples from the entire corpus using sampling without replacement method. Reliability was assessed using an inter-rater agreement mechanism. The results show that the surface knowledge extraction achieved a meaningful triple extraction rate of 83.5%, which is good. This indicates that the surface knowledge extraction algorithm yields accurate results.
Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus
Fact | Figure |
---|---|
Total number of triples in the samples | 1,548 |
Number of triples that voted as meaningful | 1,293 |
Number of triples that voted as meaningless | 255 |
Meaningful triple extraction rate (as a percentage) | 83.5 |
Error rate (as a percentage) | 16.5 |
IRR score (as a percentage) | 76 |
However, this performance was not achieved in one attempt. Several improvement cycles were performed by experimenting with and discovering new rules one by one. After introducing the main rule, the accuracy rate increased from 60% to 83.5%. Therefore, Rule 02 (the main rule) contributes significantly to this improved accuracy, as it makes accurate linkages among tokens in the grammatical structure.
For the inter-rater agreement test, the level of agreement between raters, also known as Inter-Rater Reliability (IRR), was computed using the percent agreement method. The IRR score obtained was 76%, which ensured the high reliability of our testing team.
The same knowledge extraction algorithm was executed on the BBC news dataset that had 2225 news articles that represented a good public open domain corpus [30]. Table 3 presents some important statistics for this experiment.
Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset
Fact | Figure |
---|---|
Number of documents in the corpus | 2,225 |
Number of sentences after pre-processing (sentence tokenization) | 41,983 |
Number of sentences ignored | 23,799 |
Number of sentences selected as valid for extractions | 18,184 |
Number of well-formed triples extracted | 18,184 |
Number of malformed triples extracted | 0 |
The resulting triplets are shown in Fig. 12. These extracted triples demonstrate the applicability of our triple-extraction algorithm for any open domain text.
Validating these results is crucial for claiming effectiveness. Triples extracted from the BBC news corpus were also validated based on the inter-rater-agreement method (similar to the triples of the Sri Lankan news corpus that were validated previously). However, the BBC news context may not be familiar to Sri Lankans. Therefore, to obtain a fair judgment of the validation process, an examiner panel was formed using four native British examiners. During this validation, 800 triples were selected and distributed among the four examiners of the evaluation team. This sample distribution also ensured that every triple was manually verified by at least three qualified examiners. Therefore, at least two examiners should agree that the triple is meaningful/correct. The validation results of all four examiners were combined and re-processed to determine the final accuracy rate. The results are presented in Table 4.
Table 4 . Inter-rater-agreement test results - BBC news corpus
Fact | Figure |
---|---|
Total number of triples in the samples | 800 |
Number of triples that voted as meaningful | 741 |
Number of triples that voted as meaningless | 59 |
Meaningful triple extraction rate (as a percentage) | 92.6 |
Error rate (as a percentage) | 7.4 |
IRR score (as a percentage) | 90 |
The results show that the surface knowledge extraction achieved a 92.6% meaningful triple extraction rate, which is significant. Not only that but also the computed IRR score (using the percent agreement method) for this validation process was 90%, which proves the high reliability of our validation team.
The BBC news corpus yielded better results than the Sri Lankan news corpus. This demonstrates the higher accuracy and generalized applicability of our surface knowledge extraction algorithm to any open-domain unstructured text.
Even though the surface knowledge extraction accuracy is good, it is still below 100% because of some unhandled exceptions. The triples marked as ‘meaningless’ by the evaluators were mostly caused by the exceptional cases described below.
Unexpected behavior in grammatical structure:
Refer to the example sentence;
Sentences with “said” word as the ‘ROOT’:
This type of sentence was handled differently using our own logic (Rule 08) introduced to the extraction algorithm. However, some examples deviate from the assumption made in the relevant rule, and the extraction becomes incorrect.
Sentences with synonyms of “said” word as the ‘ROOT’:
There are sentences that have synonyms of ‘said’ word such as “stated,” and “commented.” However, the same rule introduced to ‘said’ word cannot be applied for these types because such sentence’s formats were different, and the same rule applied to “said’ word sentences become inappropriate.
Identifying the best term from a noun phrase that has “of” word:
In the example “Percentage of entrants,” the most appropriate noun for the triple component should be the word “entrants” (that is, prepositional object). However, with the example “Claims of abuse,” the most appropriate noun for the triple component should be the word “claims” (not the prepositional object). Therefore, the selection of the best word for the triple component from such noun phrases is nontrivial.
Even with the aforementioned unhandled exceptions, the accuracy of our surface knowledge extraction algorithm is remarkable. It is important to emphasize that this accuracy rate was achieved even without using the contextual information of the news articles as a whole.
Our tokenized corpus contained 116,839 sentences. The surface knowledge extraction algorithm extracted an average speed of 0.02 seconds per sentence. Our statistics indicate that the total triple extraction time is less than one hour (without preprocessing time), making a single pass over the entire corpus, which proves better efficiency. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.
Handling reported speech in a surface knowledge extraction algorithm requires improvement. It is necessary to find a robust mechanism to identify the reported clause of a sentence, as it is mandatory when extracting knowledge accurately from reported speech sentences. Identifying the most suitable word from the noun phrase is required when a preposition and its prepositional object have been appended as the suffix for the subject and/or object component of the triple. “Lot of students” (subject = Lot, preposition = of, prepositional object = students), “percentage of entrants” “bunch of grapes” and “claims of abuse” can be considered as examples. In such situations, our algorithm uses the entire noun phrase as the triple component (subject or object) for a better meaning. Then, the triple becomes lengthy, and if a compound word or modifier is appended, the problem worsens. Therefore, it is preferable to find a method to identify the most appropriate word for the triple component from the identified noun phrase.
The ever-increasing effect of information overload requires humans to be extremely selective regarding what we read and store as knowledge. Much more can be made accessible to humans if such textual content can be encoded into forms that lend themselves more readily to inference. This gap in converting natural-language textual content into a machine-processable form requires the extraction of surface knowledge. This study attempted to provide a solution to the aforementioned problem by automatically extracting surface knowledge in terms of triples from open domain news.
After a thorough examination of all features of the grammatical structure, 11 important rules were discovered that could be used in the extraction of meaningful surface knowledge. Among these, complex sentence handling, passive voice handling, and reported speech handling have remarkably contributed to the domain of AKE. The results were validated using an interrater agreement test to ensure high reliability. These tests obtained acceptable IRR scores computed using the percentage agreement method. The proposed approach achieved meaningful triple extraction rates of 83.5% for the Sri Lankan news corpus and 92.6% for BBC news, demonstrating a significant performance. The preprocessed Sri Lankan English news corpus and the discovered rules for surface knowledge extraction using the grammatical structure of the sentence can be highlighted as valuable contributions of this study.
With these research findings, AKE from open-domain texts has become feasible. Hence, machines can interpret open-domain unstructured text with cutting-edge computer-processing power. Our findings will also be useful for the development of the semantic web. Finally, the news/web data, which is an extremely large knowledge source, may not be wasted in the future.
Having played a lead development role in the software development industry for more than 15 years, she changed her career into academia in 2013. She received her PhD in Computing (Natural Language Processing) from University of Colombo, Sri Lanka in 2020. Her research interests include Computational linguistics, Automatic Knowledge Extraction, Text analytics, Semantic Web and Ontological modeling. https://orcid.org/0000-0002-3398-5438
received his PhD in Computing from the University of Cardiff, UK and leads a research group in natural language processing at the University of Colombo School of Computing, Sri Lanka. His research interests include data-driven language processing, computational biology and more recently, complex adaptive systems. https://orcid.org/0000-0002-1392-7791.
Journal of information and communication convergence engineering 2022; 20(2): 113-124
Published online June 30, 2022 https://doi.org/10.6109/jicce.2022.20.2.113
Copyright © Korea Institute of Information and Communication Engineering.
Muditha Tissera 1* and Ruvan Weerasinghe2
1Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka
2School of Computing, University of Colombo, Colombo 00100, Sri Lanka
Correspondence to:Muditha Tissera (E-mail: mudithat@kln.ac.lk, Tel: +94-1129-12709)
Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.
Keywords: Automatic Knowledge Extraction, Relation extraction, Natural Language Processing, Semantic Web, Triples Extraction
Open domain also known to as “domain-independent” or “unconstraint-domain” refers to unstructured text from news articles, magazines, World Wide Web (WWW), email text, blogs, and social media comments, where the content is not limited to a single domain. These are the vast information sources among the various types of information generators available today. The knowledge/information facts embedded in these sources are presented using natural language text, which is unstructured and mostly in heterogeneous formats; thus, only humans can read and understand. However, humans bear limited cognitive processing power, and this neverending information generation leads to the problem of information overload. Hence, these knowledge sources are not effectively used.
The main objective of this study is to automatically extract surface knowledge from open-domain news sources and convert it into structured formats so that it can be interpreted by machines. We propose an approach based on the grammatical structure of a sentence to extract triples with a remarkably meaningful knowledge extraction rate. The extracted surface knowledge, in terms of triples, was validated using an interrater agreement validation method, which has high reliability.
Some have already attempted to solve the aforementioned problem by automatically extracting knowledge (AKE) from unstructured text and organizing it in structured knowledge bases that allow machines to reason out knowledge in a useful way. These attempts include, extracting different semantic components such as keywords, key phrases, entities, and relations.
The work, “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information” introduces a novel algorithm for Keyword extraction [1]. This study is unique because it does not require a corpus of similar documents. Another interesting domain-dependent keyword extraction approach was found in [2], where it extracts biological information from full-text scientific articles, not limiting the effort to only the content of the abstract, as in many other similar approaches. A survey was conducted to analyze the use of graph-based methods for keyword extraction. Furthermore, they highlighted the fact that graph-based methods are better than supervised and unsupervised methods in terms of complexity and computational resources [3]. Testing the hypothesis, “keywords are more likely to be found among influential nodes of a graph of words rather than among its nodes high on eigenvector-related centrality measure,” Tixier, Malliaros, and Vazirgiannis attempted to extract keywords from documents [4]. Hasan, Sanyal, Chaki, and Ali [5] conducted an empirical study of important keyword extraction. They concluded that the Support Vector Machine (SVM) and Conditional Random Field (CRF) methods yielded better results. A survey was conducted by [6] for automatic keyword extraction, which is used for text summarization. They introduced a hybrid extraction technique, which is a codependent algorithm for keyword extraction and text summarization.
Several studies have focused on automatic keyphrase extraction. Liu et al. [7] proposed an unsupervised approach for key phrase extraction by using a clustering mechanism to find exemplar terms, and then used the extracted exemplar terms to find key phrases. Ouyang, Li, and Zhang [8] participated in a task titled “Automatic Keyphrase Extraction from Scientific Articles” in SemEval-2, task number 5. In their approach, they identified the core words of the articles as the most essential words in the document and expanded them toward proper Key phrases using “Word expansion” approach. The competition and its results are detailed in [9]. Key2Vec is a phrase embedding method that is used for ranking keyphrases extracted from scientific articles. According to the experimental results the proposed Key2Vec technique produced state-of-the-art results on benchmark datasets [10]. Rabby and Azad [11] proposed a rooted tree-based, domain-independent, automatic keyphrase extraction method based on nominal statistical knowledge. Typically, unsupervised systems have poor accuracy and require a large corpus. To address these drawbacks, Bennani-Smires et al. used a novel unsupervised method called “EmbedRank,” which leverages sentence embeddings to extract keyphrases from a single document [12].
Named Entity Recognition (NER) is another important task in knowledge extraction research. KNOWITALL [13] extracted a large collection of named entities from web data. KNOWITALL [14] was enhanced to improve the recall and extraction rates. Ritter et al. [15] stated that the performance of standard NLP tools is severely degraded in tweets. Therefore, they proposed a mechanism to rebuild the NLP pipeline, starting with POS tagging, chunking, and NER using a distantly supervised approach.
Semantic binary relation extraction is another area of the AKE. Mintz et al. [16] conducted a useful research based on the concept of “Distance Supervision”. Their methodology does not require labeled corpora; hence, no domain dependency exists. Instead, Freebase was used for distance supervision. Nguyen and Verspoor investigated the performance after integrating character-based word representations into a standard Convolutional Neural Network (CNN) based relation extraction model [17]. The SemEval-2018-Task7 was designed to identify and classify instances of semantic relations between concepts in a set of six discrete categories. When analyzing the approaches of the 32 participants, it was revealed that the most popular methods include CNN and Long Short Term Memory (LSTM) networks, with word embedding-based features calculated based on domain-specific corpora [18]. Triple extraction can be considered as a sort of relation extraction in the knowledge extraction domain. A survey on relation extraction has revealed that, out of all supervised approaches including feature-based and kernel-based, syntactic tree kernel-based techniques were the most effective [19].
Another popular topic is taxonomy extraction, which focuses on extracting hypernym-hyponym relationships from text. SemEval-2016, Task 13, was also designed for taxonomy extraction from a specified multi-domain term list [20]. Maitra and Das came in fourth place in the monolingual competition (English language) and second place in the multilingual competition (Dutch, Italian, and French languages) [21]. They used an unsupervised approach with two modules to produce possible hypernym candidates, then merged the results. Panchenko et al. [22] secured the first place in the same competition. Substring matching and lexico-syntactic pattern matching were used in their methodology to develop their product called TAXI.
The target domain used in AKE significantly influences the decision on the methods and techniques to use and the expected level of accuracy. Therefore, knowledge extraction can be classified as domain-dependent and domain-independent, also known as open domain (nontrivial), which can be extended to massive web data. The unsupervised approach, TEXTRUNNER, is an Open Information Extraction (OIE) system that extracts information from the unconstrained Web [23]. TEXTRUNNER had a 33% lower error rate than KNOWITALL did. The Wikipedia-based Open Extractor (WOE) [24] is an extension of TEXTRUNNER with improved precision and recall. Etzioni et al. attempted to scale knowledge extraction into a sizable and heterogeneous web corpus [25]. The authors of this research highlighted that their work could be dubbed the second generation of OIE because the novel model doubled the precision and recall compared with previous works such as TEXTRUNNER and WOE. The hypothesis “Shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system,” proved in research on AKE from documents [26]. This approach was implemented in the IBM Watson DeepQA system, and its overall accuracy improved by 2.4%. Soderland et al. mapped domain-independent open information extractions (tuples) into ontologies using domains from the DARPA Machine Reading Project [27].
Never-Ending Language Learning (NELL) is a learning agent whose task is to learn to read the web all the time. This is an alternative paradigm for machine learning that accurately models human learning. NELL is successfully learning to improve its reading competence over time and plans to expand its knowledge base of world beliefs [28].
Although recent literature has shown many advancements in the AKE domain, it suffers from low precision and semantic drift in the evaluation of results.
According to David Bennet and Alex Bennet, there are three levels of knowledge: deep, shallow, and surface [29]. Surface knowledge is explicit knowledge that requires minimal meaning, without the context. Our attempt was made toward extracting surface knowledge from unstructured texts in Sri Lankan news. Fig. 1 depicts a high-level design diagram of the proposed approach.
As depicted in the diagram, several subtasks were performed with the aim of extracting surface knowledge in the form of triples. Our primary data source, the Sri Lankan new corpus, consists of various types of news articles in text format, from different domains such as business and finance, politics, education, and breaking news. It contains 5,409 plain text files with approximately 116,839 sentences. There were 2,524,876 words in the corpus.
First, the news articles in text format were pre-processed and tokenized into sentences. However, these sentences are ineligible for knowledge extraction. Our in-depth investigations helped us to identify and foresee problematic sentences. Hence, we ignored such problematic sentences during pre-processing and kept them aside for handling in future work. These include 1) Sentences that are outside the effective sentence length range. We set the minimum and maximum sentence lengths to between five and 65 words. 2) Sentences with more than two commas. 3) If quotes are found in the sentence. 4) If a question is found in a sentence. 5) If the semi-colon was found more than once in the sentence. 6) If the sentence starts with special characters (these are mostly metadata in news).
Tokenized sentences were parsed using a dependency parser to obtain the grammatical structure of the sentence. The grammatical structure of the sentence is organized as an array of tokens (words of the sentence in its left to right order) with details such as text/token (the original token text), dependency tag (the syntactic relation connecting child to head), head text (the original text of the token head), head part-of-speech (POS) (the POS tag of the token head), and children (the immediate syntactic dependents of the token). Fig. 2 shows an example of the grammatical structure of the sentence: “
The triple extraction algorithm, in which the discovered validation rules were implemented, navigated through the sentence’s grammatical structure and extracted knowledge facts in the form of a triple. Dependency parser annotations such as ‘nsubj’, dobj, and ‘ROOT’ are standard parser tags, which are typically used in any dependency parser. The spacy syntactic dependency parser was chosen for dependency parsing. All coding was performed using Python version 3.6.5.
The surface knowledge extracted from open domain news should be stored in a semantic pattern that represents structured knowledge. In our study, the triple, which is in the form of “subject | predicate | object, was chosen as the best template for this purpose. The following example shows a lengthy unstructured sentence and its extracted knowledge in the form of a triple. (Triple components are separated from the pipe symbol)
Sentence: “
Triple: Investigations | handed over to | Criminal Investigation Department
A set of valuable rules for accurately extracting the components of the triple were discovered by comprehensively analyzing the sentence’s grammatical structure. These rules were implemented as a rule validation layer in the surface knowledge extraction process. During the execution of this algorithm, every sentence is parsed through the rule validation layer, and the triple components are extracted accurately. The remainder of this section introduces the discovered rules and their implementations using pseudocodes.
Rule 01:
“In basic level triple extraction, ROOT token should be assigned as the Predicate component of the triple. Then, a token with the dependency tag ‘nsubj’ or ‘nsubjpass’ for the subject component and ‘dobj’ or ‘pobj’ or ‘attr’ token as the object component of the triple should be assigned depending on their availability in the grammatical structure.”
The pseudocode shown in Fig. 3 illustrates the logic implemented for the basic-level triple extraction rule (Rule 01) using the sentence’s grammatical structure.
News sentences are lengthy and complex. They can be complex, compound, direct speech, indirect speech, active voice, passive voice, and so on. When there are multiple clauses in a sentence, there is a high chance of inaccurate triple extraction because of the complexity. For example,
As you can see in the above example, there could be multiple tokens with the dependency tag ‘nsubj’, ‘dobj’, ‘pobj’, or ‘attr’ that are capable of being selected as subject/object component of the triple. This is mainly because of multiple clauses within a single sentence. No matter how complex it is, there should be only one ‘ROOT’ token in a single sentence. Identifying the most appropriate token for the subject component and the most appropriate token for the object component (out of multiple candidate tokens) are very vital to form a meaningful triple. For that purpose, a new rule has been discovered, and it is mentioned as Rule 02 or “Main rule.”
Rule 02 (Main rule):
Although there may be multiple candidate tokens for the subject and object components of the triple, the aforementioned main rule will be satisfied by only one candidate token. Therefore, using this rule, the most appropriate candidate token can be selected and a meaningful link can be created. To implement this main rule, the ROOT token should first be identified and assigned to the predicate part of the triple. When extracting other components later, the following condition should be checked to make the appropriate link to the ROOT/Predicate.
token.head.text = Predicate and token.head.POS = ‘VERB’ (Where the value for “Predicate” has already been selected.)
Refer to line 5 or line 8 of the pseudocode shown in Fig. 3 for the implementation details. The pseudocode in Fig. 3 implements regular rules that can be used to extract three components of the triple at its basic level. Although such basic-level extraction would result in a triple, it may not have adequate details to be meaningful. Therefore, enhancements are essential for making the triple more meaningful. Furthermore, in all such improvements, the above-mentioned main rule must be adhered to.
Basic-level triple extraction may result in an acceptable meaning for a triple. Affixing appropriate terms to the triple components yields an enhanced meaning for the triple. Although our target was to extract surface knowledge, triple extraction with the best possible meaning was our ultimate target. However, introducing additional terms minimally is crucial but challenging.
For example,
Basic level triple: Act | passed | parliament
Enhanced triple: Independence Act | passed by| British parliament
Rule 03:
For example,
Triple before: Decision | taken | meeting
Triple after: Decision | taken at | meeting
In the abovementioned example, the word ‘at’ is the preposition and ‘meeting’ is the prepositional object assigned as the object component of the triple. With the added token ‘at’ as a suffix to the predicate, expanded triple yields a better meaning. However, one exception was observed when there were words between ROOT and the preposition as below.
For example,
Triple before: He | contended of | memorial
As you can see in the above example, the output triple may not be meaningful without the middle word “construction.” Therefore, the expected triple can be expressed as follows:
Triple after: He | contended construction of | memorial
As a solution to this drawback, relevant words between the selected predicate and the preposition were identified by traversing the grammatical structure in the backward direction. Refer to the pseudocode in Fig. 4 for the implementation of Rule 03.
Rule 04:
For example,
Triple before: investigations | handed to | Investigation Department
Triple after: investigations | handed over to | Investigation Department
In the above example, ‘handed over’ is a phrasal verb. The grammatical structure denotes the token ‘handed’ as the ROOT, and it is selected as the predicate. Additionally, the token ‘over’ is denoted as a particle and selected as the suffix for the predicate. Refer to the pseudocode in Fig. 5 for the implementation of Rule 04.
The tokens selected as subject and object components of the basic-level triple may not be meaningful. Therefore, appropriate prefixes can be combined to render the triple components more meaningful.
Rule 05:
For example,
Triple before: countries | achieved | gains
Triple after: most countries | achieved | significant gains
Once the basic-level triple components have been identified, such prefixes are sought by re-traversing the same grammatical structure. Refer to the pseudocode in Fig. 6 for the implementation of Rule 05.
As in Rule 05, appropriate tokens can be combined as suffixes for the subject and object components of the triple. This becomes mandatory when the word “of” comes after the tokens that have been selected as the subject and object components of the triple. When the sentences have “lot of people”, “percentage of students”, “crisis of confidence” etc., representing its subject or object, the output triple becomes meaningless.
Rule 06:
Example 1:
Triple before: Lot |taught |important characteristics
Triple after: Lot of teachers |taught |important characteristics
Example 2:
Triple before: we | increase | percentage
Triple after: we | increase | percentage of entrants
Refer to pseudocode in Fig. 7 for the Rule 06 implementation.
In some sentences, the main verb that holds the ’ROOT’ dependency tag comes with an open clausal complement. For example,
In this example “build” is an open clausal token. If the predicate is formed based on the ‘ROOT’ token, the following would be the output triple.
Central Bank | hoping | foreign reserves
However, the above triple does not provide the expected meaning. Therefore, Rule 07 has been introduced to address this issue.
Rule 07:
Triple after: Central Bank | build | foreign reserves
Refer to pseudocode in Fig. 8 for the Rule 07 implementation.
If a sentence is complex, very often it has linking words. Therefore, triple extraction from such sentences should be handled differently.
For example,
In such complex sentences, “Discourse markers” or “linking phrases/words” are commonly used (For example, although, however, since, assuming that, and in fact).
Rule 08:
“When a sentence is a complex type that has multiple clauses linked to each other using discourse markers, split the sentence into separate segments based on the discourse markers and then triple extraction should be carried out from each segment separately.”
Discourse markers can be identified in the grammatical structure using the tokens with their dependency tag set as “mark.” According to Rule 08, the given example results in two triples extracted from two sentence segments/clauses, as follows:
Segment 1:
Token with a discourse mark tag: “although”
Segment 2:
Triples generated:
Government | cut down | cost of living
politicians | continued| hedonistic lifestyles
In any news article, the reported speech is a frequently used writing style. Because our research also uses news as its data source, reported speech handling cannot be avoided. The grammatical structure of the sentences that have “said” word in it, identifies the “said” word as the ‘ROOT’ token. Therefore, the output triple is meaningless. Refer to the following two examples.
Example 1:
Extracted triple: Minister Athukorala | said | MoU
Example 2:
Extracted triple: Sri Lanka Freedom Party | said|
To resolve this anomaly, rule 09 has been implemented.
Rule 09:
In the reported speech sentences, the reported clause includes what the original speaker said. Therefore, a reported clause is more important for knowledge extraction. The automatic classification of reported clauses and reporting clauses is challenging. Even though the probability of success is not at a higher rate, simple logic has been identified to perform this task. It was observed from many examples that the reported clause is found in the right-side segment (as in Example 1 above) unless the left-side segment length is much higher the right-side segment length and when a comma “,” found in the left-side segment. The implementation of this logic is depicted as a pseudocode in Fig. 9.
The grammatical structures of active and passive voice sentences are slightly different from each other. The two main differences include the following: 1) Active voice sentences have tokens of dependency tag ‘nsubj’ for nouns that represent the subject. However, passive voice sentences have tokens of dependency tag ‘nsubjpass’ for nouns that represent their nominal subject, 2) Passive voice sentences mostly have a token with a dependency tag ‘agent’, immediately after the preposition ‘by’. However, with minor modifications to the algorithm, passive-voice sentences can also be processed using the same triple-extraction algorithm.
Rule 10:
The implementation of Rule 10 is represented in the pseudocode shown in Fig. 10 by changing lines 3, 4, and 5 of the basic triple extraction pseudocode depicted in Fig. 3.
For example,
Extracted triple: dolphins | killed by | fishermen
Even though our regular triple extraction algorithm is suitable for passive voice sentences with minor modifications, as described above, one anomaly was observed with some ‘Predicate’ part of the triple, as shown below.
For example,
Extracted triple: Temporary workers | paid | wages
In the example above, the passive sense of the triple is lost. If the agent is unknown (done by whom) and past participle verbs are the same as past tense verbs (for example, paid, made, and read), the passive voice triple gives incorrect meaning. Rule 11 was introduced to address this issue.
Rule 11:
Extracted triple after rule implementation:
Temporary workers | are paid | wages
The proposed approach was successfully implemented and validated using the Sri Lankan English news corpus. The algorithmic details are presented as a pseudocode from Figs. 3-10.
Output generation:
The output triples were written into a CSV file with the pipe character used as the delimiter within the triple components. Some of the extracted triples are shown in Fig. 11. Table 1 contains some important statistical facts related to the extraction process. According to Table 1 every valid sentence used for extraction has resulted in a well-formed triple (value exists for all three components). This may be caused by the ignorance of the sentences that foresee problems before extraction.
Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus.
Fact | Figure |
---|---|
Number of documents in the corpus | 5,409 |
Number of sentences after pre-processing | 116,839 |
Number of sentences ignored | 62,638 |
Number of sentences valid for extractions | 54,201 |
Number of triples extracted | 54,201 |
Number of distinct predicates extracted | 10,116 |
Number of distinct subjects extracted | 6,736 |
Number of distinct objects extracted | 7,937 |
Sentences ignored during the algorithm execution:
When a sentence is a command, the ROOT token is the first in the grammatical structure array. In such cases, the subject component of the triple becomes empty, which results in a malformed triple. Therefore, these sentences were ignored.
For example, “Shift the insurance liability toward manufacturers.”
The purpose of this validation process was to evaluate the significance of the output (how meaningful) of the proposed surface knowledge extraction algorithm. Accuracy was measured by conducting an inter-rater agreement test among the four participants selected from our testing team. Our testing team consisted of academics and professionals who were experts in English and with different professional backgrounds. For this validation process, four sample sets of 387 triples were randomly chosen from the extracted triples, using sampling without replacement method. Altogether, 1548 triples were selected and distributed among the four participants of the testing team. This sample distribution ensures that every triple has been verified by at least three participants, and the meaningful/correctness of the surface knowledge extraction is decided by a majority vote. The validation results of all four members were combined and re-processed to determine the final accuracy rate, as mentioned above. The results are presented in Table 2. Because the generated output is large, performing a validation test with high variability and reliability is vital. Variability was achieved by randomly choosing the test triples from the entire corpus using sampling without replacement method. Reliability was assessed using an inter-rater agreement mechanism. The results show that the surface knowledge extraction achieved a meaningful triple extraction rate of 83.5%, which is good. This indicates that the surface knowledge extraction algorithm yields accurate results.
Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus.
Fact | Figure |
---|---|
Total number of triples in the samples | 1,548 |
Number of triples that voted as meaningful | 1,293 |
Number of triples that voted as meaningless | 255 |
Meaningful triple extraction rate (as a percentage) | 83.5 |
Error rate (as a percentage) | 16.5 |
IRR score (as a percentage) | 76 |
However, this performance was not achieved in one attempt. Several improvement cycles were performed by experimenting with and discovering new rules one by one. After introducing the main rule, the accuracy rate increased from 60% to 83.5%. Therefore, Rule 02 (the main rule) contributes significantly to this improved accuracy, as it makes accurate linkages among tokens in the grammatical structure.
For the inter-rater agreement test, the level of agreement between raters, also known as Inter-Rater Reliability (IRR), was computed using the percent agreement method. The IRR score obtained was 76%, which ensured the high reliability of our testing team.
The same knowledge extraction algorithm was executed on the BBC news dataset that had 2225 news articles that represented a good public open domain corpus [30]. Table 3 presents some important statistics for this experiment.
Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset.
Fact | Figure |
---|---|
Number of documents in the corpus | 2,225 |
Number of sentences after pre-processing (sentence tokenization) | 41,983 |
Number of sentences ignored | 23,799 |
Number of sentences selected as valid for extractions | 18,184 |
Number of well-formed triples extracted | 18,184 |
Number of malformed triples extracted | 0 |
The resulting triplets are shown in Fig. 12. These extracted triples demonstrate the applicability of our triple-extraction algorithm for any open domain text.
Validating these results is crucial for claiming effectiveness. Triples extracted from the BBC news corpus were also validated based on the inter-rater-agreement method (similar to the triples of the Sri Lankan news corpus that were validated previously). However, the BBC news context may not be familiar to Sri Lankans. Therefore, to obtain a fair judgment of the validation process, an examiner panel was formed using four native British examiners. During this validation, 800 triples were selected and distributed among the four examiners of the evaluation team. This sample distribution also ensured that every triple was manually verified by at least three qualified examiners. Therefore, at least two examiners should agree that the triple is meaningful/correct. The validation results of all four examiners were combined and re-processed to determine the final accuracy rate. The results are presented in Table 4.
Table 4 . Inter-rater-agreement test results - BBC news corpus.
Fact | Figure |
---|---|
Total number of triples in the samples | 800 |
Number of triples that voted as meaningful | 741 |
Number of triples that voted as meaningless | 59 |
Meaningful triple extraction rate (as a percentage) | 92.6 |
Error rate (as a percentage) | 7.4 |
IRR score (as a percentage) | 90 |
The results show that the surface knowledge extraction achieved a 92.6% meaningful triple extraction rate, which is significant. Not only that but also the computed IRR score (using the percent agreement method) for this validation process was 90%, which proves the high reliability of our validation team.
The BBC news corpus yielded better results than the Sri Lankan news corpus. This demonstrates the higher accuracy and generalized applicability of our surface knowledge extraction algorithm to any open-domain unstructured text.
Even though the surface knowledge extraction accuracy is good, it is still below 100% because of some unhandled exceptions. The triples marked as ‘meaningless’ by the evaluators were mostly caused by the exceptional cases described below.
Unexpected behavior in grammatical structure:
Refer to the example sentence;
Sentences with “said” word as the ‘ROOT’:
This type of sentence was handled differently using our own logic (Rule 08) introduced to the extraction algorithm. However, some examples deviate from the assumption made in the relevant rule, and the extraction becomes incorrect.
Sentences with synonyms of “said” word as the ‘ROOT’:
There are sentences that have synonyms of ‘said’ word such as “stated,” and “commented.” However, the same rule introduced to ‘said’ word cannot be applied for these types because such sentence’s formats were different, and the same rule applied to “said’ word sentences become inappropriate.
Identifying the best term from a noun phrase that has “of” word:
In the example “Percentage of entrants,” the most appropriate noun for the triple component should be the word “entrants” (that is, prepositional object). However, with the example “Claims of abuse,” the most appropriate noun for the triple component should be the word “claims” (not the prepositional object). Therefore, the selection of the best word for the triple component from such noun phrases is nontrivial.
Even with the aforementioned unhandled exceptions, the accuracy of our surface knowledge extraction algorithm is remarkable. It is important to emphasize that this accuracy rate was achieved even without using the contextual information of the news articles as a whole.
Our tokenized corpus contained 116,839 sentences. The surface knowledge extraction algorithm extracted an average speed of 0.02 seconds per sentence. Our statistics indicate that the total triple extraction time is less than one hour (without preprocessing time), making a single pass over the entire corpus, which proves better efficiency. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.
Handling reported speech in a surface knowledge extraction algorithm requires improvement. It is necessary to find a robust mechanism to identify the reported clause of a sentence, as it is mandatory when extracting knowledge accurately from reported speech sentences. Identifying the most suitable word from the noun phrase is required when a preposition and its prepositional object have been appended as the suffix for the subject and/or object component of the triple. “Lot of students” (subject = Lot, preposition = of, prepositional object = students), “percentage of entrants” “bunch of grapes” and “claims of abuse” can be considered as examples. In such situations, our algorithm uses the entire noun phrase as the triple component (subject or object) for a better meaning. Then, the triple becomes lengthy, and if a compound word or modifier is appended, the problem worsens. Therefore, it is preferable to find a method to identify the most appropriate word for the triple component from the identified noun phrase.
The ever-increasing effect of information overload requires humans to be extremely selective regarding what we read and store as knowledge. Much more can be made accessible to humans if such textual content can be encoded into forms that lend themselves more readily to inference. This gap in converting natural-language textual content into a machine-processable form requires the extraction of surface knowledge. This study attempted to provide a solution to the aforementioned problem by automatically extracting surface knowledge in terms of triples from open domain news.
After a thorough examination of all features of the grammatical structure, 11 important rules were discovered that could be used in the extraction of meaningful surface knowledge. Among these, complex sentence handling, passive voice handling, and reported speech handling have remarkably contributed to the domain of AKE. The results were validated using an interrater agreement test to ensure high reliability. These tests obtained acceptable IRR scores computed using the percentage agreement method. The proposed approach achieved meaningful triple extraction rates of 83.5% for the Sri Lankan news corpus and 92.6% for BBC news, demonstrating a significant performance. The preprocessed Sri Lankan English news corpus and the discovered rules for surface knowledge extraction using the grammatical structure of the sentence can be highlighted as valuable contributions of this study.
With these research findings, AKE from open-domain texts has become feasible. Hence, machines can interpret open-domain unstructured text with cutting-edge computer-processing power. Our findings will also be useful for the development of the semantic web. Finally, the news/web data, which is an extremely large knowledge source, may not be wasted in the future.
Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus.
Fact | Figure |
---|---|
Number of documents in the corpus | 5,409 |
Number of sentences after pre-processing | 116,839 |
Number of sentences ignored | 62,638 |
Number of sentences valid for extractions | 54,201 |
Number of triples extracted | 54,201 |
Number of distinct predicates extracted | 10,116 |
Number of distinct subjects extracted | 6,736 |
Number of distinct objects extracted | 7,937 |
Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus.
Fact | Figure |
---|---|
Total number of triples in the samples | 1,548 |
Number of triples that voted as meaningful | 1,293 |
Number of triples that voted as meaningless | 255 |
Meaningful triple extraction rate (as a percentage) | 83.5 |
Error rate (as a percentage) | 16.5 |
IRR score (as a percentage) | 76 |
Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset.
Fact | Figure |
---|---|
Number of documents in the corpus | 2,225 |
Number of sentences after pre-processing (sentence tokenization) | 41,983 |
Number of sentences ignored | 23,799 |
Number of sentences selected as valid for extractions | 18,184 |
Number of well-formed triples extracted | 18,184 |
Number of malformed triples extracted | 0 |
Table 4 . Inter-rater-agreement test results - BBC news corpus.
Fact | Figure |
---|---|
Total number of triples in the samples | 800 |
Number of triples that voted as meaningful | 741 |
Number of triples that voted as meaningless | 59 |
Meaningful triple extraction rate (as a percentage) | 92.6 |
Error rate (as a percentage) | 7.4 |
IRR score (as a percentage) | 90 |
Sawittree Jumpathong, Akkharawoot Takhom, Prachya Boonkwan, Vipas Sutantayawalee, Peerachet Porkaew, Sitthaa Phaholphinyo, Charun Phrombut, Khemarath Choke-mangmi, Saran Yamasathien, Nattachai Tretasayuth, Kasidis Kanwatchara, Atiwat Aiemleuk, and Thepchai Supnithi
Journal of information and communication convergence engineering 2024; 22(1): 33-43 https://doi.org/10.56977/jicce.2024.22.1.33Yun, Hong-Won;
The Korea Institute of Information and Commucation Engineering 2004; 2(1): 26-31 https://doi.org/10.7853/.2004.2.1.26Song, Jong-Cheol;Moon, Byung-Joo;Jung, Hoe-Kyung;
The Korea Institute of Information and Commucation Engineering 2007; 5(3): 249-253 https://doi.org/10.7853/.2007.5.3.249