Journal of information and communication convergence engineering 2022; 20(2): 113-124

Published online June 30, 2022

https://doi.org/10.6109/jicce.2022.20.2.113

© Korea Institute of Information and Communication Engineering

Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text

Muditha Tissera 1* and Ruvan Weerasinghe2

1Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka
2School of Computing, University of Colombo, Colombo 00100, Sri Lanka

Correspondence to : Muditha Tissera (E-mail: mudithat@kln.ac.lk, Tel: +94-1129-12709)
Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka.

Received: June 20, 2021; Revised: October 27, 2021; Accepted: November 21, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.

Keywords Automatic Knowledge Extraction, Relation extraction, Natural Language Processing, Semantic Web, Triples Extraction

A. Problem Formation

Open domain also known to as “domain-independent” or “unconstraint-domain” refers to unstructured text from news articles, magazines, World Wide Web (WWW), email text, blogs, and social media comments, where the content is not limited to a single domain. These are the vast information sources among the various types of information generators available today. The knowledge/information facts embedded in these sources are presented using natural language text, which is unstructured and mostly in heterogeneous formats; thus, only humans can read and understand. However, humans bear limited cognitive processing power, and this neverending information generation leads to the problem of information overload. Hence, these knowledge sources are not effectively used.

B. Proposed Solution

The main objective of this study is to automatically extract surface knowledge from open-domain news sources and convert it into structured formats so that it can be interpreted by machines. We propose an approach based on the grammatical structure of a sentence to extract triples with a remarkably meaningful knowledge extraction rate. The extracted surface knowledge, in terms of triples, was validated using an interrater agreement validation method, which has high reliability.

Some have already attempted to solve the aforementioned problem by automatically extracting knowledge (AKE) from unstructured text and organizing it in structured knowledge bases that allow machines to reason out knowledge in a useful way. These attempts include, extracting different semantic components such as keywords, key phrases, entities, and relations.

The work, “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information” introduces a novel algorithm for Keyword extraction [1]. This study is unique because it does not require a corpus of similar documents. Another interesting domain-dependent keyword extraction approach was found in [2], where it extracts biological information from full-text scientific articles, not limiting the effort to only the content of the abstract, as in many other similar approaches. A survey was conducted to analyze the use of graph-based methods for keyword extraction. Furthermore, they highlighted the fact that graph-based methods are better than supervised and unsupervised methods in terms of complexity and computational resources [3]. Testing the hypothesis, “keywords are more likely to be found among influential nodes of a graph of words rather than among its nodes high on eigenvector-related centrality measure,” Tixier, Malliaros, and Vazirgiannis attempted to extract keywords from documents [4]. Hasan, Sanyal, Chaki, and Ali [5] conducted an empirical study of important keyword extraction. They concluded that the Support Vector Machine (SVM) and Conditional Random Field (CRF) methods yielded better results. A survey was conducted by [6] for automatic keyword extraction, which is used for text summarization. They introduced a hybrid extraction technique, which is a codependent algorithm for keyword extraction and text summarization.

Several studies have focused on automatic keyphrase extraction. Liu et al. [7] proposed an unsupervised approach for key phrase extraction by using a clustering mechanism to find exemplar terms, and then used the extracted exemplar terms to find key phrases. Ouyang, Li, and Zhang [8] participated in a task titled “Automatic Keyphrase Extraction from Scientific Articles” in SemEval-2, task number 5. In their approach, they identified the core words of the articles as the most essential words in the document and expanded them toward proper Key phrases using “Word expansion” approach. The competition and its results are detailed in [9]. Key2Vec is a phrase embedding method that is used for ranking keyphrases extracted from scientific articles. According to the experimental results the proposed Key2Vec technique produced state-of-the-art results on benchmark datasets [10]. Rabby and Azad [11] proposed a rooted tree-based, domain-independent, automatic keyphrase extraction method based on nominal statistical knowledge. Typically, unsupervised systems have poor accuracy and require a large corpus. To address these drawbacks, Bennani-Smires et al. used a novel unsupervised method called “EmbedRank,” which leverages sentence embeddings to extract keyphrases from a single document [12].

Named Entity Recognition (NER) is another important task in knowledge extraction research. KNOWITALL [13] extracted a large collection of named entities from web data. KNOWITALL [14] was enhanced to improve the recall and extraction rates. Ritter et al. [15] stated that the performance of standard NLP tools is severely degraded in tweets. Therefore, they proposed a mechanism to rebuild the NLP pipeline, starting with POS tagging, chunking, and NER using a distantly supervised approach.

Semantic binary relation extraction is another area of the AKE. Mintz et al. [16] conducted a useful research based on the concept of “Distance Supervision”. Their methodology does not require labeled corpora; hence, no domain dependency exists. Instead, Freebase was used for distance supervision. Nguyen and Verspoor investigated the performance after integrating character-based word representations into a standard Convolutional Neural Network (CNN) based relation extraction model [17]. The SemEval-2018-Task7 was designed to identify and classify instances of semantic relations between concepts in a set of six discrete categories. When analyzing the approaches of the 32 participants, it was revealed that the most popular methods include CNN and Long Short Term Memory (LSTM) networks, with word embedding-based features calculated based on domain-specific corpora [18]. Triple extraction can be considered as a sort of relation extraction in the knowledge extraction domain. A survey on relation extraction has revealed that, out of all supervised approaches including feature-based and kernel-based, syntactic tree kernel-based techniques were the most effective [19].

Another popular topic is taxonomy extraction, which focuses on extracting hypernym-hyponym relationships from text. SemEval-2016, Task 13, was also designed for taxonomy extraction from a specified multi-domain term list [20]. Maitra and Das came in fourth place in the monolingual competition (English language) and second place in the multilingual competition (Dutch, Italian, and French languages) [21]. They used an unsupervised approach with two modules to produce possible hypernym candidates, then merged the results. Panchenko et al. [22] secured the first place in the same competition. Substring matching and lexico-syntactic pattern matching were used in their methodology to develop their product called TAXI.

The target domain used in AKE significantly influences the decision on the methods and techniques to use and the expected level of accuracy. Therefore, knowledge extraction can be classified as domain-dependent and domain-independent, also known as open domain (nontrivial), which can be extended to massive web data. The unsupervised approach, TEXTRUNNER, is an Open Information Extraction (OIE) system that extracts information from the unconstrained Web [23]. TEXTRUNNER had a 33% lower error rate than KNOWITALL did. The Wikipedia-based Open Extractor (WOE) [24] is an extension of TEXTRUNNER with improved precision and recall. Etzioni et al. attempted to scale knowledge extraction into a sizable and heterogeneous web corpus [25]. The authors of this research highlighted that their work could be dubbed the second generation of OIE because the novel model doubled the precision and recall compared with previous works such as TEXTRUNNER and WOE. The hypothesis “Shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system,” proved in research on AKE from documents [26]. This approach was implemented in the IBM Watson DeepQA system, and its overall accuracy improved by 2.4%. Soderland et al. mapped domain-independent open information extractions (tuples) into ontologies using domains from the DARPA Machine Reading Project [27].

Never-Ending Language Learning (NELL) is a learning agent whose task is to learn to read the web all the time. This is an alternative paradigm for machine learning that accurately models human learning. NELL is successfully learning to improve its reading competence over time and plans to expand its knowledge base of world beliefs [28].

Although recent literature has shown many advancements in the AKE domain, it suffers from low precision and semantic drift in the evaluation of results.

According to David Bennet and Alex Bennet, there are three levels of knowledge: deep, shallow, and surface [29]. Surface knowledge is explicit knowledge that requires minimal meaning, without the context. Our attempt was made toward extracting surface knowledge from unstructured texts in Sri Lankan news. Fig. 1 depicts a high-level design diagram of the proposed approach.

Fig. 1. High-level Design Diagram-Automatic Surface Knowledge Extraction.

As depicted in the diagram, several subtasks were performed with the aim of extracting surface knowledge in the form of triples. Our primary data source, the Sri Lankan new corpus, consists of various types of news articles in text format, from different domains such as business and finance, politics, education, and breaking news. It contains 5,409 plain text files with approximately 116,839 sentences. There were 2,524,876 words in the corpus.

First, the news articles in text format were pre-processed and tokenized into sentences. However, these sentences are ineligible for knowledge extraction. Our in-depth investigations helped us to identify and foresee problematic sentences. Hence, we ignored such problematic sentences during pre-processing and kept them aside for handling in future work. These include 1) Sentences that are outside the effective sentence length range. We set the minimum and maximum sentence lengths to between five and 65 words. 2) Sentences with more than two commas. 3) If quotes are found in the sentence. 4) If a question is found in a sentence. 5) If the semi-colon was found more than once in the sentence. 6) If the sentence starts with special characters (these are mostly metadata in news).

Tokenized sentences were parsed using a dependency parser to obtain the grammatical structure of the sentence. The grammatical structure of the sentence is organized as an array of tokens (words of the sentence in its left to right order) with details such as text/token (the original token text), dependency tag (the syntactic relation connecting child to head), head text (the original text of the token head), head part-of-speech (POS) (the POS tag of the token head), and children (the immediate syntactic dependents of the token). Fig. 2 shows an example of the grammatical structure of the sentence: “In fact, for political benefits, many political leaders brought communal politics to the fore.

Fig. 2. Grammatical structure.

The triple extraction algorithm, in which the discovered validation rules were implemented, navigated through the sentence’s grammatical structure and extracted knowledge facts in the form of a triple. Dependency parser annotations such as ‘nsubj’, dobj, and ‘ROOT’ are standard parser tags, which are typically used in any dependency parser. The spacy syntactic dependency parser was chosen for dependency parsing. All coding was performed using Python version 3.6.5.

A. Triple as a structured semantic pattern

The surface knowledge extracted from open domain news should be stored in a semantic pattern that represents structured knowledge. In our study, the triple, which is in the form of “subject | predicate | object, was chosen as the best template for this purpose. The following example shows a lengthy unstructured sentence and its extracted knowledge in the form of a triple. (Triple components are separated from the pipe symbol)

Sentence: “Investigations into the Thalawa State Bank robbery were handed over to the Criminal Investigation Department (CID) on IGP Pujith Jayasundara's orders, Police Spokesman Ruwan Gunasekara.”

Triple: Investigations | handed over to | Criminal Investigation Department

B. Parsing the Rule Validation Layer

A set of valuable rules for accurately extracting the components of the triple were discovered by comprehensively analyzing the sentence’s grammatical structure. These rules were implemented as a rule validation layer in the surface knowledge extraction process. During the execution of this algorithm, every sentence is parsed through the rule validation layer, and the triple components are extracted accurately. The remainder of this section introduces the discovered rules and their implementations using pseudocodes.

C. Rules to Extract Basic Level Triple

Rule 01:

“In basic level triple extraction, ROOT token should be assigned as the Predicate component of the triple. Then, a token with the dependency tag ‘nsubj’ or ‘nsubjpass’ for the subject component and ‘dobj’ or ‘pobj’ or ‘attr’ token as the object component of the triple should be assigned depending on their availability in the grammatical structure.”

The pseudocode shown in Fig. 3 illustrates the logic implemented for the basic-level triple extraction rule (Rule 01) using the sentence’s grammatical structure.

Fig. 3. Pseudocode for Rule 01 and Rule 02 implementations.

News sentences are lengthy and complex. They can be complex, compound, direct speech, indirect speech, active voice, passive voice, and so on. When there are multiple clauses in a sentence, there is a high chance of inaccurate triple extraction because of the complexity. For example, “Citizen and the Constitution Under the Constitution any citizen who is qualified to be elected to the office of President may be nominated by a recognized political party (Article 31) elected by the people (Art (30-2) and hold office for a period of six years.”

As you can see in the above example, there could be multiple tokens with the dependency tag ‘nsubj’, ‘dobj’, ‘pobj’, or ‘attr’ that are capable of being selected as subject/object component of the triple. This is mainly because of multiple clauses within a single sentence. No matter how complex it is, there should be only one ‘ROOT’ token in a single sentence. Identifying the most appropriate token for the subject component and the most appropriate token for the object component (out of multiple candidate tokens) are very vital to form a meaningful triple. For that purpose, a new rule has been discovered, and it is mentioned as Rule 02 or “Main rule.”

Rule 02 (Main rule):

“Any token utilized in the triple should be directly or indirectly (via another token) linked with the ROOT token.”

Although there may be multiple candidate tokens for the subject and object components of the triple, the aforementioned main rule will be satisfied by only one candidate token. Therefore, using this rule, the most appropriate candidate token can be selected and a meaningful link can be created. To implement this main rule, the ROOT token should first be identified and assigned to the predicate part of the triple. When extracting other components later, the following condition should be checked to make the appropriate link to the ROOT/Predicate.

token.head.text = Predicate and token.head.POS = ‘VERB’ (Where the value for “Predicate” has already been selected.)

Refer to line 5 or line 8 of the pseudocode shown in Fig. 3 for the implementation details. The pseudocode in Fig. 3 implements regular rules that can be used to extract three components of the triple at its basic level. Although such basic-level extraction would result in a triple, it may not have adequate details to be meaningful. Therefore, enhancements are essential for making the triple more meaningful. Furthermore, in all such improvements, the above-mentioned main rule must be adhered to.

D. Advanced Rules to Enhance the Basic Level Triple

Basic-level triple extraction may result in an acceptable meaning for a triple. Affixing appropriate terms to the triple components yields an enhanced meaning for the triple. Although our target was to extract surface knowledge, triple extraction with the best possible meaning was our ultimate target. However, introducing additional terms minimally is crucial but challenging.

For example, “The Ceylon Independence Act was passed by the British Parliament.”

Basic level triple: Act | passed | parliament

Enhanced triple: Independence Act | passed by| British parliament

1) Appending Preposition as a Suffix to Predicate

Rule 03:

“If the prepositional object (‘pobj’) has contributed to form the triple object, its relevant preposition should be appended to the predicate as a suffix.”

For example,

Triple before: Decision | taken | meeting

Triple after: Decision | taken at | meeting

In the abovementioned example, the word ‘at’ is the preposition and ‘meeting’ is the prepositional object assigned as the object component of the triple. With the added token ‘at’ as a suffix to the predicate, expanded triple yields a better meaning. However, one exception was observed when there were words between ROOT and the preposition as below.

For example, “He contended the construction of the memorial and museum.”

Triple before: He | contended of | memorial

As you can see in the above example, the output triple may not be meaningful without the middle word “construction.” Therefore, the expected triple can be expressed as follows:

Triple after: He | contended construction of | memorial

As a solution to this drawback, relevant words between the selected predicate and the preposition were identified by traversing the grammatical structure in the backward direction. Refer to the pseudocode in Fig. 4 for the implementation of Rule 03.

Fig. 4. Pseudocode for Rule 03 implementation.

2) Appending Particles as Suffixes to Predicate when Phrasal Verbs Exist

Rule 04:

“When a sentence has a phrasal verb such as ‘call off’, ‘rule out’ etc., the grammatical structure denotes the first word of the phrase as the ROOT and other words as particles. In such sentences, when the predicate is formed, particles of the ‘ROOT’ should also be appended to the predicate as a suffix to get the proper meaning.”

For example, “investigations into the Thalawa state bank robbery had been handed over to the Criminal Investigation Department, Police Spokesperson Ruwan Gunasekara.”

Triple before: investigations | handed to | Investigation Department

Triple after: investigations | handed over to | Investigation Department

In the above example, ‘handed over’ is a phrasal verb. The grammatical structure denotes the token ‘handed’ as the ROOT, and it is selected as the predicate. Additionally, the token ‘over’ is denoted as a particle and selected as the suffix for the predicate. Refer to the pseudocode in Fig. 5 for the implementation of Rule 04.

Fig. 5. Pseudocode for Rule 04 implementation.

3) Appending compound words and modifiers as prefixes to subject and object

The tokens selected as subject and object components of the basic-level triple may not be meaningful. Therefore, appropriate prefixes can be combined to render the triple components more meaningful.

Rule 05:

“If Subject and/or Object components of the basic level triple consist of a modifier or a compound word prior to them in the original sentence, such tokens should be added as prefixes to make Subject and Object components more meaningful.”

For example, “Most countries had achieved significant gains in reducing hunger in the last 25 years, progress in the majority of countries affected by conflict had stagnated or deteriorated.”

Triple before: countries | achieved | gains

Triple after: most countries | achieved | significant gains

Once the basic-level triple components have been identified, such prefixes are sought by re-traversing the same grammatical structure. Refer to the pseudocode in Fig. 6 for the implementation of Rule 05.

Fig. 6. Pseudocode for Rule 05 implementation.

4) When the Word “of” Comes after the Subject or Object token

As in Rule 05, appropriate tokens can be combined as suffixes for the subject and object components of the triple. This becomes mandatory when the word “of” comes after the tokens that have been selected as the subject and object components of the triple. When the sentences have “lot of people”, “percentage of students”, “crisis of confidence” etc., representing its subject or object, the output triple becomes meaningless.

Rule 06:

“If the preposition ‘of’ is found right after the Subject/Object, it cannot stand meaningfully without the relevant prepositional object. In such situations, the preposition and prepositional object should be appended to the subject/object words as suffixes.”

Example 1: “Lot of teachers must be taught these important characteristics”

Triple before: Lot |taught |important characteristics

Triple after: Lot of teachers |taught |important characteristics

Example 2: “This way we can increase the percentage of University entrants to over 50%, the amount now between15 and 16%.”

Triple before: we | increase | percentage

Triple after: we | increase | percentage of entrants

Refer to pseudocode in Fig. 7 for the Rule 06 implementation.

Fig. 7. Pseudocode for Rule 06 implementation.

E. Rule to Replace Predicate with Open Clausal Complement

In some sentences, the main verb that holds the ’ROOT’ dependency tag comes with an open clausal complement. For example, “The Central Bank is hoping to build its foreign reserves of up to US$ 10 billion this year.”

In this example “build” is an open clausal token. If the predicate is formed based on the ‘ROOT’ token, the following would be the output triple.

Central Bank | hoping | foreign reserves

However, the above triple does not provide the expected meaning. Therefore, Rule 07 has been introduced to address this issue.

Rule 07:

“When, a token with an open clausal complement dependency tag, comes after the ROOT token, which has already been selected as the predicate, such open clausal complement token should be used as the predicate of the triple instead the ROOT token of the grammatical structure.”

Triple after: Central Bank | build | foreign reserves

Refer to pseudocode in Fig. 8 for the Rule 07 implementation.

Fig. 8. Pseudocode for Rule 07 implementation.

F. Handling complex sentences

If a sentence is complex, very often it has linking words. Therefore, triple extraction from such sentences should be handled differently.

For example, “The government was unable to significantly reduce the cost of living with subsidies due to the necessary fiscal and monetary discipline to face the upcoming debt repayment cycle, although politicians continued their hedonistic lifestyles.”

In such complex sentences, “Discourse markers” or “linking phrases/words” are commonly used (For example, although, however, since, assuming that, and in fact).

Rule 08:

“When a sentence is a complex type that has multiple clauses linked to each other using discourse markers, split the sentence into separate segments based on the discourse markers and then triple extraction should be carried out from each segment separately.”

Discourse markers can be identified in the grammatical structure using the tokens with their dependency tag set as “mark.” According to Rule 08, the given example results in two triples extracted from two sentence segments/clauses, as follows:

Segment 1: “The government was unable to significantly reduce the cost of living with subsidies due to the necessary fiscal and monetary discipline to face the upcoming debt repayment cycle.”

Token with a discourse mark tag: “although”

Segment 2: “politicians continued their hedonistic lifestyles.”

Triples generated:

Government | cut down | cost of living

politicians | continued| hedonistic lifestyles

G. Handling Reported Speech with ‘Said’ Word

In any news article, the reported speech is a frequently used writing style. Because our research also uses news as its data source, reported speech handling cannot be avoided. The grammatical structure of the sentences that have “said” word in it, identifies the “said” word as the ‘ROOT’ token. Therefore, the output triple is meaningless. Refer to the following two examples.

Example 1: “Minister Athukorala stated that the MoU indicates both nations' keenness to bring transparency in all stages of recruitment and employment.”

Extracted triple: Minister Athukorala | said | MoU

Example 2: “the Sri Lanka Freedom Party (SLFP) members of the Unity Government would continue to remain in it, backing SLFP Chairman and President Maithripala Sirisena, Minister S.B. Dissanayake today said.”

Extracted triple: Sri Lanka Freedom Party | said|

To resolve this anomaly, rule 09 has been implemented.

Rule 09:

“If the word “said” found in the sentence (reported speech), separate the sentence into two clauses as left side and right side of the “said” word. Then the reported clause is identified from these two clauses and extract the triple from that reported clause. Ignore the reporting clause.”

In the reported speech sentences, the reported clause includes what the original speaker said. Therefore, a reported clause is more important for knowledge extraction. The automatic classification of reported clauses and reporting clauses is challenging. Even though the probability of success is not at a higher rate, simple logic has been identified to perform this task. It was observed from many examples that the reported clause is found in the right-side segment (as in Example 1 above) unless the left-side segment length is much higher the right-side segment length and when a comma “,” found in the left-side segment. The implementation of this logic is depicted as a pseudocode in Fig. 9.

Fig. 9. Pseudocode for Rule 09 implementation.

H. Handling Passive voice

The grammatical structures of active and passive voice sentences are slightly different from each other. The two main differences include the following: 1) Active voice sentences have tokens of dependency tag ‘nsubj’ for nouns that represent the subject. However, passive voice sentences have tokens of dependency tag ‘nsubjpass’ for nouns that represent their nominal subject, 2) Passive voice sentences mostly have a token with a dependency tag ‘agent’, immediately after the preposition ‘by’. However, with minor modifications to the algorithm, passive-voice sentences can also be processed using the same triple-extraction algorithm.

Rule 10:

“When the Subject component of the triple is selected from the grammatical structure of the sentences or in other words, wherever the ‘nsubj’ dependency tag is searched, ‘nsubjpass’ dependency tag should also be considered for passive voice.”

The implementation of Rule 10 is represented in the pseudocode shown in Fig. 10 by changing lines 3, 4, and 5 of the basic triple extraction pseudocode depicted in Fig. 3.

For example, “When investigated it was found that 12 dolphins were killed by nine fishermen who were arrested by the Trincomalee Crime Prevention Unit.”

Extracted triple: dolphins | killed by | fishermen

Even though our regular triple extraction algorithm is suitable for passive voice sentences with minor modifications, as described above, one anomaly was observed with some ‘Predicate’ part of the triple, as shown below.

For example, “Temporary workers are paid their wages on weekly basis.”

Extracted triple: Temporary workers | paid | wages

In the example above, the passive sense of the triple is lost. If the agent is unknown (done by whom) and past participle verbs are the same as past tense verbs (for example, paid, made, and read), the passive voice triple gives incorrect meaning. Rule 11 was introduced to address this issue.

Rule 11:

“If the sentence is passive voice and the verb has the same token for its past tense and past participle tense, then predicate should be formed with the ROOT token and its auxiliary verb prior to it.”

Extracted triple after rule implementation:

Temporary workers | are paid | wages

A. Experimental Study Over Sri Lankan Open Domain News Corpus

The proposed approach was successfully implemented and validated using the Sri Lankan English news corpus. The algorithmic details are presented as a pseudocode from Figs. 3-10.

Fig. 10. Pseudocode for Rule 10 implementation.

Output generation:

The output triples were written into a CSV file with the pipe character used as the delimiter within the triple components. Some of the extracted triples are shown in Fig. 11. Table 1 contains some important statistical facts related to the extraction process. According to Table 1 every valid sentence used for extraction has resulted in a well-formed triple (value exists for all three components). This may be caused by the ignorance of the sentences that foresee problems before extraction.

Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus

FactFigure
Number of documents in the corpus5,409
Number of sentences after pre-processing116,839
Number of sentences ignored62,638
Number of sentences valid for extractions54,201
Number of triples extracted54,201
Number of distinct predicates extracted10,116
Number of distinct subjects extracted6,736
Number of distinct objects extracted7,937

Fig. 11. Some extracted surface knowledge (Triples) from the Sri Lankan news.

Sentences ignored during the algorithm execution:

When a sentence is a command, the ROOT token is the first in the grammatical structure array. In such cases, the subject component of the triple becomes empty, which results in a malformed triple. Therefore, these sentences were ignored.

For example, “Shift the insurance liability toward manufacturers.”

B. Inter-rater-agreement-based Validation of Extracted Triples - Sri Lankan News

The purpose of this validation process was to evaluate the significance of the output (how meaningful) of the proposed surface knowledge extraction algorithm. Accuracy was measured by conducting an inter-rater agreement test among the four participants selected from our testing team. Our testing team consisted of academics and professionals who were experts in English and with different professional backgrounds. For this validation process, four sample sets of 387 triples were randomly chosen from the extracted triples, using sampling without replacement method. Altogether, 1548 triples were selected and distributed among the four participants of the testing team. This sample distribution ensures that every triple has been verified by at least three participants, and the meaningful/correctness of the surface knowledge extraction is decided by a majority vote. The validation results of all four members were combined and re-processed to determine the final accuracy rate, as mentioned above. The results are presented in Table 2. Because the generated output is large, performing a validation test with high variability and reliability is vital. Variability was achieved by randomly choosing the test triples from the entire corpus using sampling without replacement method. Reliability was assessed using an inter-rater agreement mechanism. The results show that the surface knowledge extraction achieved a meaningful triple extraction rate of 83.5%, which is good. This indicates that the surface knowledge extraction algorithm yields accurate results.

Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus

FactFigure
Total number of triples in the samples1,548
Number of triples that voted as meaningful1,293
Number of triples that voted as meaningless255
Meaningful triple extraction rate (as a percentage)83.5
Error rate (as a percentage)16.5
IRR score (as a percentage)76


However, this performance was not achieved in one attempt. Several improvement cycles were performed by experimenting with and discovering new rules one by one. After introducing the main rule, the accuracy rate increased from 60% to 83.5%. Therefore, Rule 02 (the main rule) contributes significantly to this improved accuracy, as it makes accurate linkages among tokens in the grammatical structure.

For the inter-rater agreement test, the level of agreement between raters, also known as Inter-Rater Reliability (IRR), was computed using the percent agreement method. The IRR score obtained was 76%, which ensured the high reliability of our testing team.

C. Extended Study Over BBC News Dataset

The same knowledge extraction algorithm was executed on the BBC news dataset that had 2225 news articles that represented a good public open domain corpus [30]. Table 3 presents some important statistics for this experiment.

Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset

FactFigure
Number of documents in the corpus2,225
Number of sentences after pre-processing (sentence tokenization)41,983
Number of sentences ignored23,799
Number of sentences selected as valid for extractions18,184
Number of well-formed triples extracted18,184
Number of malformed triples extracted0


The resulting triplets are shown in Fig. 12. These extracted triples demonstrate the applicability of our triple-extraction algorithm for any open domain text.

Fig. 12. Some extracted surface knowledge/triples from the BBC news dataset.

D. Inter-rater-agreement-based Validation of Extracted Triples - BBC News

Validating these results is crucial for claiming effectiveness. Triples extracted from the BBC news corpus were also validated based on the inter-rater-agreement method (similar to the triples of the Sri Lankan news corpus that were validated previously). However, the BBC news context may not be familiar to Sri Lankans. Therefore, to obtain a fair judgment of the validation process, an examiner panel was formed using four native British examiners. During this validation, 800 triples were selected and distributed among the four examiners of the evaluation team. This sample distribution also ensured that every triple was manually verified by at least three qualified examiners. Therefore, at least two examiners should agree that the triple is meaningful/correct. The validation results of all four examiners were combined and re-processed to determine the final accuracy rate. The results are presented in Table 4.

Table 4 . Inter-rater-agreement test results - BBC news corpus

FactFigure
Total number of triples in the samples800
Number of triples that voted as meaningful741
Number of triples that voted as meaningless59
Meaningful triple extraction rate (as a percentage)92.6
Error rate (as a percentage)7.4
IRR score (as a percentage)90


The results show that the surface knowledge extraction achieved a 92.6% meaningful triple extraction rate, which is significant. Not only that but also the computed IRR score (using the percent agreement method) for this validation process was 90%, which proves the high reliability of our validation team.

The BBC news corpus yielded better results than the Sri Lankan news corpus. This demonstrates the higher accuracy and generalized applicability of our surface knowledge extraction algorithm to any open-domain unstructured text.

E. Unhandled Exceptions

Even though the surface knowledge extraction accuracy is good, it is still below 100% because of some unhandled exceptions. The triples marked as ‘meaningless’ by the evaluators were mostly caused by the exceptional cases described below.

Unexpected behavior in grammatical structure:

Refer to the example sentence; “A sizeable share of the Sri Lankan population migrates from rural areas to cities.” The grammatical structure of the example sentence evidenced an unexpected behavior such that, when the word “migrates” is expected to be the ROOT, the word ‘share’ has been marked as the ROOT.

Sentences with “said” word as the ‘ROOT’:

This type of sentence was handled differently using our own logic (Rule 08) introduced to the extraction algorithm. However, some examples deviate from the assumption made in the relevant rule, and the extraction becomes incorrect.

Sentences with synonyms of “said” word as the ‘ROOT’:

There are sentences that have synonyms of ‘said’ word such as “stated,” and “commented.” However, the same rule introduced to ‘said’ word cannot be applied for these types because such sentence’s formats were different, and the same rule applied to “said’ word sentences become inappropriate.

Identifying the best term from a noun phrase that has “of” word:

In the example “Percentage of entrants,” the most appropriate noun for the triple component should be the word “entrants” (that is, prepositional object). However, with the example “Claims of abuse,” the most appropriate noun for the triple component should be the word “claims” (not the prepositional object). Therefore, the selection of the best word for the triple component from such noun phrases is nontrivial.

Even with the aforementioned unhandled exceptions, the accuracy of our surface knowledge extraction algorithm is remarkable. It is important to emphasize that this accuracy rate was achieved even without using the contextual information of the news articles as a whole.

F. Efficiency Analysis of the Surface Knowledge Extraction Algorithm

Our tokenized corpus contained 116,839 sentences. The surface knowledge extraction algorithm extracted an average speed of 0.02 seconds per sentence. Our statistics indicate that the total triple extraction time is less than one hour (without preprocessing time), making a single pass over the entire corpus, which proves better efficiency. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.

Handling reported speech in a surface knowledge extraction algorithm requires improvement. It is necessary to find a robust mechanism to identify the reported clause of a sentence, as it is mandatory when extracting knowledge accurately from reported speech sentences. Identifying the most suitable word from the noun phrase is required when a preposition and its prepositional object have been appended as the suffix for the subject and/or object component of the triple. “Lot of students” (subject = Lot, preposition = of, prepositional object = students), “percentage of entrants” “bunch of grapes” and “claims of abuse” can be considered as examples. In such situations, our algorithm uses the entire noun phrase as the triple component (subject or object) for a better meaning. Then, the triple becomes lengthy, and if a compound word or modifier is appended, the problem worsens. Therefore, it is preferable to find a method to identify the most appropriate word for the triple component from the identified noun phrase.

The ever-increasing effect of information overload requires humans to be extremely selective regarding what we read and store as knowledge. Much more can be made accessible to humans if such textual content can be encoded into forms that lend themselves more readily to inference. This gap in converting natural-language textual content into a machine-processable form requires the extraction of surface knowledge. This study attempted to provide a solution to the aforementioned problem by automatically extracting surface knowledge in terms of triples from open domain news.

After a thorough examination of all features of the grammatical structure, 11 important rules were discovered that could be used in the extraction of meaningful surface knowledge. Among these, complex sentence handling, passive voice handling, and reported speech handling have remarkably contributed to the domain of AKE. The results were validated using an interrater agreement test to ensure high reliability. These tests obtained acceptable IRR scores computed using the percentage agreement method. The proposed approach achieved meaningful triple extraction rates of 83.5% for the Sri Lankan news corpus and 92.6% for BBC news, demonstrating a significant performance. The preprocessed Sri Lankan English news corpus and the discovered rules for surface knowledge extraction using the grammatical structure of the sentence can be highlighted as valuable contributions of this study.

With these research findings, AKE from open-domain texts has become feasible. Hence, machines can interpret open-domain unstructured text with cutting-edge computer-processing power. Our findings will also be useful for the development of the semantic web. Finally, the news/web data, which is an extremely large knowledge source, may not be wasted in the future.

  1. Y. Matsuo, Keyword extraction from a single document using word cooccurrence statistical information, International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169, Mar, 2004.
    CrossRef
  2. P. K. Shah, and C. Perez-Iratxeta, and P. Bork, and M. A. Andrade, Information extraction from full text scientific articles: where are the keywords?, BMC bioinformatics, vol. 4, no. 1, p. 20, May, 2003.
    Pubmed KoreaMed CrossRef
  3. S. Beliga and A. Mestrovic and S. Martincic-Ipsic, An overview of graph-based key words extraction methods and approaches, Journal of information and organizational sciences and JIOS, vol. 39, no. 1, pp. 1-20, Jul, 2015.
  4. A. Tixier and F. Malliaros and M. Vazirgiannis, A graph degeneracybased approach to keyword extraction, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin: TX, USA, pp. 1860-1870, 2016.
    CrossRef
  5. H. M. M. Hasan, and F. Sanyal, and D. Chaki, and M. H. Ali, An empirical study of important keyword extraction techniques from documents, in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, pp. 91-94, Oct, 2017.
    CrossRef
  6. S. K. Bharti and K. S. Babu and A. Pradhan, Automatic keyword extraction for text summarization in multi-document e-newspapers articles, European Journal of Advances in Engineering and Technology, vol. 4, no. 6, pp. 410-427, 2017.
  7. Z. Liu, and P. Li, and Y. Zheng, and M. Sun, Clustering to find exemplar terms for keyphrase extraction, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Singapore, pp. 257-266, Aug, 2009.
    CrossRef
  8. Y. Ouyang and W. Li and R. Zhang, 273. Task 5. keyphrase extraction based on core word identification and word expansion, in Proceedings of the 5th international workshop on semantic evaluation, Uppsala, Sweden, pp. 142-145, 2010.
  9. S. N. Kim, and O. Medelyan, and M. -Y. Kan, and T. Baldwin, Automatic keyphrase extraction from scientific articles, Language Resources and Evaluation, vol. 47, pp. 723-742, Dec, 2013.
    CrossRef
  10. D. Mahata, and J. Kuriakose, and R. R. Shah, and R. Zimmermann, Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings, in Proceedings of NAACL-HLT 2018, New Orleans: LA, USA, vol. 2, pp. 634-639, 2018.
    CrossRef
  11. G. Rabby, and S. Azad, and M. Mahmud, and K. Z. Zamli, and M. M. Rahman, A flexible keyphrase extraction technique for academic literature, n Procedia Computer Science, Tangerang, Indonesia, vol. 135, pp. 553-563, 2018.
    CrossRef
  12. K. Bennani-Smires, and C. Musat, and A. Hossmann, and M. Baeriswyl, and M. Jaggi, Simple unsupervised keyphrase extraction using sentence embeddings, in Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 221-229, Jan, 2018.
    CrossRef
  13. O. Etzioni, and M. Cafarella, and D. Downey, and S. Kok, and A. -M. Popescu, and T. Shaked, and S. Soderland, and D. S. Weld, and A. Yates, Web-scale information extraction in knowitall: (preliminary results), in Proceedings of the 13th international conference on World Wide Web, New York: NY, USA, pp. 100-110, May, 2004.
    CrossRef
  14. O. Etzioni, and M. Cafarella, and D. Downey, and A. -M. Popescu, and T. Shaked, and S. Soderland, and D. S. Weld, and A. Yates, Unsupervised named-entity extraction from the web: An experimental study, Artificial Intelligence, vol. 165, no. 1, pp. 99-134, Jun, 2005.
    CrossRef
  15. A. Ritter, and S. Clark, and Mausam, and O. Etzioni, Named entity recognition in tweets: An experimental study, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K, pp. 1524-1534, Jul, 2011.
  16. M. Mintz, and S. Bills, and R. Snow, and D. Jurafsky, Distant supervision for relation extraction without labeled data, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, Suntec, Singapore, vol. 2, pp. 1003-1011, 2009.
    CrossRef
  17. D. Q. Nguyen, and K. Verspoor, Convolutional neural networks for chemical-disease relation extraction are improved with characterbased word embeddings, in Proceedings of the BioNLP 2018 workshop, Melbourne, Australia, pp. 129-136, May, 2018.
    CrossRef
  18. K. G'abor, and D. Buscaldi, and A. -K. Schumann, and B. QasemiZadeh, and H. Zargayouna, and T. Charnois, SemEval-2018Task7: Semantic relation extraction and classification in scientific papers, in Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans: LA, USA, pp. 679-688, 2018.
    CrossRef
  19. S. Pawar and G. K. Palshikar and P. Bhattacharyya, Relation extraction: A survey, arXiv:1712.05191 [cs], Dec, 2017.
    CrossRef
  20. G. Bordea and E. Lefever and P. Buitelaar, Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2), in SemEval-2016, San Diego: CA, USA, pp. 1081-1091, 2016.
    CrossRef
  21. P. Maitra, and D. Das, UNLP at SemEval-2016 Task 13: A language independent approach for hypernym identification, in Proceedings of SemEval, San Diego: CA, USA, pp. 1310-1314, 2016.
    CrossRef
  22. A. Panchenko, and S. Faralli, and E. Ruppert, and S. Remus, and H. Naets, and C. Fairon, and S. P. Ponzetto, and C. Biemann, TAXI at SemEval-2016 Task 13: A taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling, in Proceedings of SemEval, San Diego: CA, USA, pp. 1320-1327, 2016.
    CrossRef
  23. A. Yates, and M. Banko, and M. Broadhead, and M. Cafarella, and O. Etzioni, and S. Soderland, TextRunner: open information extraction on the web, in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX - NAACL '07, Rochester: NY, USA, pp. 25-26, 2007.
    Pubmed CrossRef
  24. F. Wu, and D. S. Weld, Open information extraction using Wikipedia, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 118-127, 2010.
  25. O. Etzioni, and A. Fader, and J. Christensen, and S. Soderland, and M. Mausam, Open information extraction: The second generation., in IJCAI, 2011, vol. 11, pp. 3-10, 04, Jul, 2017.
  26. J. Fan, and A. Kalyanpur, and D. C. Gondek, and D. A. Ferrucci, Automatic knowledge extraction from documents, IBM Journal of Research and Development, vol. 56, no. 3,4, pp. 5:1-5:10, May, 2012.
    CrossRef
  27. S. Soderland, and B. Roof, and B. Qin, and S. Xu, and Mausam, and O. Etzioni, Adapting open information extraction to domain-specific relations, AI Magazine, vol. 31, pp. 93-102, Jul, 2010.
    CrossRef
  28. T. M. Mitchell, and W. Cohen, and E. Hruschka, and P. Talukdar, and B. Yang, and J. Betteridge, and A. Carlson, and B. Dalvi, and M. Gardner, and B. Kisiel, and J. Krishnamurthy, and N. Lao, and K. Mazaitis, and T. Mohamed, and N. Nakashole, and E. Platanios, and A. Ritter, and M. Samadi, and B. Settles, and R. Wang, and D. Wijaya, and A. Gupta, and X. Chen, and A. Saparov, and M. Greaves, and J. Welling, Never-Ending learning, Communication of the ACM, vol. 61, no. 5, pp. 103-115, May, 2018.
    CrossRef
  29. D. Bennet, The depth of knowledge: surface, shallow or deep?, VINE, vol. 38, no. 4, pp. 405-420, Oct, 2008.
    CrossRef
  30. BBC News Summary, in Kaggle [Online]. [accessed May 20, 2020]. Available: https://www.kaggle.com/pariza/bbc-news-summary.

Muditha Tissera

Having played a lead development role in the software development industry for more than 15 years, she changed her career into academia in 2013. She received her PhD in Computing (Natural Language Processing) from University of Colombo, Sri Lanka in 2020. Her research interests include Computational linguistics, Automatic Knowledge Extraction, Text analytics, Semantic Web and Ontological modeling. https://orcid.org/0000-0002-3398-5438


Ruvan Weerasinghe

received his PhD in Computing from the University of Cardiff, UK and leads a research group in natural language processing at the University of Colombo School of Computing, Sri Lanka. His research interests include data-driven language processing, computational biology and more recently, complex adaptive systems. https://orcid.org/0000-0002-1392-7791.


Article

Journal of information and communication convergence engineering 2022; 20(2): 113-124

Published online June 30, 2022 https://doi.org/10.6109/jicce.2022.20.2.113

Copyright © Korea Institute of Information and Communication Engineering.

Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text

Muditha Tissera 1* and Ruvan Weerasinghe2

1Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka
2School of Computing, University of Colombo, Colombo 00100, Sri Lanka

Correspondence to:Muditha Tissera (E-mail: mudithat@kln.ac.lk, Tel: +94-1129-12709)
Department of Software Engineering, University of Kelaniya, Kelaniya 11600, Sri Lanka.

Received: June 20, 2021; Revised: October 27, 2021; Accepted: November 21, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.

Keywords: Automatic Knowledge Extraction, Relation extraction, Natural Language Processing, Semantic Web, Triples Extraction

I. INTRODUCTION

A. Problem Formation

Open domain also known to as “domain-independent” or “unconstraint-domain” refers to unstructured text from news articles, magazines, World Wide Web (WWW), email text, blogs, and social media comments, where the content is not limited to a single domain. These are the vast information sources among the various types of information generators available today. The knowledge/information facts embedded in these sources are presented using natural language text, which is unstructured and mostly in heterogeneous formats; thus, only humans can read and understand. However, humans bear limited cognitive processing power, and this neverending information generation leads to the problem of information overload. Hence, these knowledge sources are not effectively used.

B. Proposed Solution

The main objective of this study is to automatically extract surface knowledge from open-domain news sources and convert it into structured formats so that it can be interpreted by machines. We propose an approach based on the grammatical structure of a sentence to extract triples with a remarkably meaningful knowledge extraction rate. The extracted surface knowledge, in terms of triples, was validated using an interrater agreement validation method, which has high reliability.

II. RELATED WORK

Some have already attempted to solve the aforementioned problem by automatically extracting knowledge (AKE) from unstructured text and organizing it in structured knowledge bases that allow machines to reason out knowledge in a useful way. These attempts include, extracting different semantic components such as keywords, key phrases, entities, and relations.

The work, “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information” introduces a novel algorithm for Keyword extraction [1]. This study is unique because it does not require a corpus of similar documents. Another interesting domain-dependent keyword extraction approach was found in [2], where it extracts biological information from full-text scientific articles, not limiting the effort to only the content of the abstract, as in many other similar approaches. A survey was conducted to analyze the use of graph-based methods for keyword extraction. Furthermore, they highlighted the fact that graph-based methods are better than supervised and unsupervised methods in terms of complexity and computational resources [3]. Testing the hypothesis, “keywords are more likely to be found among influential nodes of a graph of words rather than among its nodes high on eigenvector-related centrality measure,” Tixier, Malliaros, and Vazirgiannis attempted to extract keywords from documents [4]. Hasan, Sanyal, Chaki, and Ali [5] conducted an empirical study of important keyword extraction. They concluded that the Support Vector Machine (SVM) and Conditional Random Field (CRF) methods yielded better results. A survey was conducted by [6] for automatic keyword extraction, which is used for text summarization. They introduced a hybrid extraction technique, which is a codependent algorithm for keyword extraction and text summarization.

Several studies have focused on automatic keyphrase extraction. Liu et al. [7] proposed an unsupervised approach for key phrase extraction by using a clustering mechanism to find exemplar terms, and then used the extracted exemplar terms to find key phrases. Ouyang, Li, and Zhang [8] participated in a task titled “Automatic Keyphrase Extraction from Scientific Articles” in SemEval-2, task number 5. In their approach, they identified the core words of the articles as the most essential words in the document and expanded them toward proper Key phrases using “Word expansion” approach. The competition and its results are detailed in [9]. Key2Vec is a phrase embedding method that is used for ranking keyphrases extracted from scientific articles. According to the experimental results the proposed Key2Vec technique produced state-of-the-art results on benchmark datasets [10]. Rabby and Azad [11] proposed a rooted tree-based, domain-independent, automatic keyphrase extraction method based on nominal statistical knowledge. Typically, unsupervised systems have poor accuracy and require a large corpus. To address these drawbacks, Bennani-Smires et al. used a novel unsupervised method called “EmbedRank,” which leverages sentence embeddings to extract keyphrases from a single document [12].

Named Entity Recognition (NER) is another important task in knowledge extraction research. KNOWITALL [13] extracted a large collection of named entities from web data. KNOWITALL [14] was enhanced to improve the recall and extraction rates. Ritter et al. [15] stated that the performance of standard NLP tools is severely degraded in tweets. Therefore, they proposed a mechanism to rebuild the NLP pipeline, starting with POS tagging, chunking, and NER using a distantly supervised approach.

Semantic binary relation extraction is another area of the AKE. Mintz et al. [16] conducted a useful research based on the concept of “Distance Supervision”. Their methodology does not require labeled corpora; hence, no domain dependency exists. Instead, Freebase was used for distance supervision. Nguyen and Verspoor investigated the performance after integrating character-based word representations into a standard Convolutional Neural Network (CNN) based relation extraction model [17]. The SemEval-2018-Task7 was designed to identify and classify instances of semantic relations between concepts in a set of six discrete categories. When analyzing the approaches of the 32 participants, it was revealed that the most popular methods include CNN and Long Short Term Memory (LSTM) networks, with word embedding-based features calculated based on domain-specific corpora [18]. Triple extraction can be considered as a sort of relation extraction in the knowledge extraction domain. A survey on relation extraction has revealed that, out of all supervised approaches including feature-based and kernel-based, syntactic tree kernel-based techniques were the most effective [19].

Another popular topic is taxonomy extraction, which focuses on extracting hypernym-hyponym relationships from text. SemEval-2016, Task 13, was also designed for taxonomy extraction from a specified multi-domain term list [20]. Maitra and Das came in fourth place in the monolingual competition (English language) and second place in the multilingual competition (Dutch, Italian, and French languages) [21]. They used an unsupervised approach with two modules to produce possible hypernym candidates, then merged the results. Panchenko et al. [22] secured the first place in the same competition. Substring matching and lexico-syntactic pattern matching were used in their methodology to develop their product called TAXI.

The target domain used in AKE significantly influences the decision on the methods and techniques to use and the expected level of accuracy. Therefore, knowledge extraction can be classified as domain-dependent and domain-independent, also known as open domain (nontrivial), which can be extended to massive web data. The unsupervised approach, TEXTRUNNER, is an Open Information Extraction (OIE) system that extracts information from the unconstrained Web [23]. TEXTRUNNER had a 33% lower error rate than KNOWITALL did. The Wikipedia-based Open Extractor (WOE) [24] is an extension of TEXTRUNNER with improved precision and recall. Etzioni et al. attempted to scale knowledge extraction into a sizable and heterogeneous web corpus [25]. The authors of this research highlighted that their work could be dubbed the second generation of OIE because the novel model doubled the precision and recall compared with previous works such as TEXTRUNNER and WOE. The hypothesis “Shallow syntactic knowledge and its implied semantics can be easily acquired and can be used in many areas of a question-answering system,” proved in research on AKE from documents [26]. This approach was implemented in the IBM Watson DeepQA system, and its overall accuracy improved by 2.4%. Soderland et al. mapped domain-independent open information extractions (tuples) into ontologies using domains from the DARPA Machine Reading Project [27].

Never-Ending Language Learning (NELL) is a learning agent whose task is to learn to read the web all the time. This is an alternative paradigm for machine learning that accurately models human learning. NELL is successfully learning to improve its reading competence over time and plans to expand its knowledge base of world beliefs [28].

Although recent literature has shown many advancements in the AKE domain, it suffers from low precision and semantic drift in the evaluation of results.

III. METHODOLOGY

According to David Bennet and Alex Bennet, there are three levels of knowledge: deep, shallow, and surface [29]. Surface knowledge is explicit knowledge that requires minimal meaning, without the context. Our attempt was made toward extracting surface knowledge from unstructured texts in Sri Lankan news. Fig. 1 depicts a high-level design diagram of the proposed approach.

Figure 1. High-level Design Diagram-Automatic Surface Knowledge Extraction.

As depicted in the diagram, several subtasks were performed with the aim of extracting surface knowledge in the form of triples. Our primary data source, the Sri Lankan new corpus, consists of various types of news articles in text format, from different domains such as business and finance, politics, education, and breaking news. It contains 5,409 plain text files with approximately 116,839 sentences. There were 2,524,876 words in the corpus.

First, the news articles in text format were pre-processed and tokenized into sentences. However, these sentences are ineligible for knowledge extraction. Our in-depth investigations helped us to identify and foresee problematic sentences. Hence, we ignored such problematic sentences during pre-processing and kept them aside for handling in future work. These include 1) Sentences that are outside the effective sentence length range. We set the minimum and maximum sentence lengths to between five and 65 words. 2) Sentences with more than two commas. 3) If quotes are found in the sentence. 4) If a question is found in a sentence. 5) If the semi-colon was found more than once in the sentence. 6) If the sentence starts with special characters (these are mostly metadata in news).

Tokenized sentences were parsed using a dependency parser to obtain the grammatical structure of the sentence. The grammatical structure of the sentence is organized as an array of tokens (words of the sentence in its left to right order) with details such as text/token (the original token text), dependency tag (the syntactic relation connecting child to head), head text (the original text of the token head), head part-of-speech (POS) (the POS tag of the token head), and children (the immediate syntactic dependents of the token). Fig. 2 shows an example of the grammatical structure of the sentence: “In fact, for political benefits, many political leaders brought communal politics to the fore.

Figure 2. Grammatical structure.

The triple extraction algorithm, in which the discovered validation rules were implemented, navigated through the sentence’s grammatical structure and extracted knowledge facts in the form of a triple. Dependency parser annotations such as ‘nsubj’, dobj, and ‘ROOT’ are standard parser tags, which are typically used in any dependency parser. The spacy syntactic dependency parser was chosen for dependency parsing. All coding was performed using Python version 3.6.5.

A. Triple as a structured semantic pattern

The surface knowledge extracted from open domain news should be stored in a semantic pattern that represents structured knowledge. In our study, the triple, which is in the form of “subject | predicate | object, was chosen as the best template for this purpose. The following example shows a lengthy unstructured sentence and its extracted knowledge in the form of a triple. (Triple components are separated from the pipe symbol)

Sentence: “Investigations into the Thalawa State Bank robbery were handed over to the Criminal Investigation Department (CID) on IGP Pujith Jayasundara's orders, Police Spokesman Ruwan Gunasekara.”

Triple: Investigations | handed over to | Criminal Investigation Department

B. Parsing the Rule Validation Layer

A set of valuable rules for accurately extracting the components of the triple were discovered by comprehensively analyzing the sentence’s grammatical structure. These rules were implemented as a rule validation layer in the surface knowledge extraction process. During the execution of this algorithm, every sentence is parsed through the rule validation layer, and the triple components are extracted accurately. The remainder of this section introduces the discovered rules and their implementations using pseudocodes.

C. Rules to Extract Basic Level Triple

Rule 01:

“In basic level triple extraction, ROOT token should be assigned as the Predicate component of the triple. Then, a token with the dependency tag ‘nsubj’ or ‘nsubjpass’ for the subject component and ‘dobj’ or ‘pobj’ or ‘attr’ token as the object component of the triple should be assigned depending on their availability in the grammatical structure.”

The pseudocode shown in Fig. 3 illustrates the logic implemented for the basic-level triple extraction rule (Rule 01) using the sentence’s grammatical structure.

Figure 3. Pseudocode for Rule 01 and Rule 02 implementations.

News sentences are lengthy and complex. They can be complex, compound, direct speech, indirect speech, active voice, passive voice, and so on. When there are multiple clauses in a sentence, there is a high chance of inaccurate triple extraction because of the complexity. For example, “Citizen and the Constitution Under the Constitution any citizen who is qualified to be elected to the office of President may be nominated by a recognized political party (Article 31) elected by the people (Art (30-2) and hold office for a period of six years.”

As you can see in the above example, there could be multiple tokens with the dependency tag ‘nsubj’, ‘dobj’, ‘pobj’, or ‘attr’ that are capable of being selected as subject/object component of the triple. This is mainly because of multiple clauses within a single sentence. No matter how complex it is, there should be only one ‘ROOT’ token in a single sentence. Identifying the most appropriate token for the subject component and the most appropriate token for the object component (out of multiple candidate tokens) are very vital to form a meaningful triple. For that purpose, a new rule has been discovered, and it is mentioned as Rule 02 or “Main rule.”

Rule 02 (Main rule):

“Any token utilized in the triple should be directly or indirectly (via another token) linked with the ROOT token.”

Although there may be multiple candidate tokens for the subject and object components of the triple, the aforementioned main rule will be satisfied by only one candidate token. Therefore, using this rule, the most appropriate candidate token can be selected and a meaningful link can be created. To implement this main rule, the ROOT token should first be identified and assigned to the predicate part of the triple. When extracting other components later, the following condition should be checked to make the appropriate link to the ROOT/Predicate.

token.head.text = Predicate and token.head.POS = ‘VERB’ (Where the value for “Predicate” has already been selected.)

Refer to line 5 or line 8 of the pseudocode shown in Fig. 3 for the implementation details. The pseudocode in Fig. 3 implements regular rules that can be used to extract three components of the triple at its basic level. Although such basic-level extraction would result in a triple, it may not have adequate details to be meaningful. Therefore, enhancements are essential for making the triple more meaningful. Furthermore, in all such improvements, the above-mentioned main rule must be adhered to.

D. Advanced Rules to Enhance the Basic Level Triple

Basic-level triple extraction may result in an acceptable meaning for a triple. Affixing appropriate terms to the triple components yields an enhanced meaning for the triple. Although our target was to extract surface knowledge, triple extraction with the best possible meaning was our ultimate target. However, introducing additional terms minimally is crucial but challenging.

For example, “The Ceylon Independence Act was passed by the British Parliament.”

Basic level triple: Act | passed | parliament

Enhanced triple: Independence Act | passed by| British parliament

1) Appending Preposition as a Suffix to Predicate

Rule 03:

“If the prepositional object (‘pobj’) has contributed to form the triple object, its relevant preposition should be appended to the predicate as a suffix.”

For example,

Triple before: Decision | taken | meeting

Triple after: Decision | taken at | meeting

In the abovementioned example, the word ‘at’ is the preposition and ‘meeting’ is the prepositional object assigned as the object component of the triple. With the added token ‘at’ as a suffix to the predicate, expanded triple yields a better meaning. However, one exception was observed when there were words between ROOT and the preposition as below.

For example, “He contended the construction of the memorial and museum.”

Triple before: He | contended of | memorial

As you can see in the above example, the output triple may not be meaningful without the middle word “construction.” Therefore, the expected triple can be expressed as follows:

Triple after: He | contended construction of | memorial

As a solution to this drawback, relevant words between the selected predicate and the preposition were identified by traversing the grammatical structure in the backward direction. Refer to the pseudocode in Fig. 4 for the implementation of Rule 03.

Figure 4. Pseudocode for Rule 03 implementation.

2) Appending Particles as Suffixes to Predicate when Phrasal Verbs Exist

Rule 04:

“When a sentence has a phrasal verb such as ‘call off’, ‘rule out’ etc., the grammatical structure denotes the first word of the phrase as the ROOT and other words as particles. In such sentences, when the predicate is formed, particles of the ‘ROOT’ should also be appended to the predicate as a suffix to get the proper meaning.”

For example, “investigations into the Thalawa state bank robbery had been handed over to the Criminal Investigation Department, Police Spokesperson Ruwan Gunasekara.”

Triple before: investigations | handed to | Investigation Department

Triple after: investigations | handed over to | Investigation Department

In the above example, ‘handed over’ is a phrasal verb. The grammatical structure denotes the token ‘handed’ as the ROOT, and it is selected as the predicate. Additionally, the token ‘over’ is denoted as a particle and selected as the suffix for the predicate. Refer to the pseudocode in Fig. 5 for the implementation of Rule 04.

Figure 5. Pseudocode for Rule 04 implementation.

3) Appending compound words and modifiers as prefixes to subject and object

The tokens selected as subject and object components of the basic-level triple may not be meaningful. Therefore, appropriate prefixes can be combined to render the triple components more meaningful.

Rule 05:

“If Subject and/or Object components of the basic level triple consist of a modifier or a compound word prior to them in the original sentence, such tokens should be added as prefixes to make Subject and Object components more meaningful.”

For example, “Most countries had achieved significant gains in reducing hunger in the last 25 years, progress in the majority of countries affected by conflict had stagnated or deteriorated.”

Triple before: countries | achieved | gains

Triple after: most countries | achieved | significant gains

Once the basic-level triple components have been identified, such prefixes are sought by re-traversing the same grammatical structure. Refer to the pseudocode in Fig. 6 for the implementation of Rule 05.

Figure 6. Pseudocode for Rule 05 implementation.

4) When the Word “of” Comes after the Subject or Object token

As in Rule 05, appropriate tokens can be combined as suffixes for the subject and object components of the triple. This becomes mandatory when the word “of” comes after the tokens that have been selected as the subject and object components of the triple. When the sentences have “lot of people”, “percentage of students”, “crisis of confidence” etc., representing its subject or object, the output triple becomes meaningless.

Rule 06:

“If the preposition ‘of’ is found right after the Subject/Object, it cannot stand meaningfully without the relevant prepositional object. In such situations, the preposition and prepositional object should be appended to the subject/object words as suffixes.”

Example 1: “Lot of teachers must be taught these important characteristics”

Triple before: Lot |taught |important characteristics

Triple after: Lot of teachers |taught |important characteristics

Example 2: “This way we can increase the percentage of University entrants to over 50%, the amount now between15 and 16%.”

Triple before: we | increase | percentage

Triple after: we | increase | percentage of entrants

Refer to pseudocode in Fig. 7 for the Rule 06 implementation.

Figure 7. Pseudocode for Rule 06 implementation.

E. Rule to Replace Predicate with Open Clausal Complement

In some sentences, the main verb that holds the ’ROOT’ dependency tag comes with an open clausal complement. For example, “The Central Bank is hoping to build its foreign reserves of up to US$ 10 billion this year.”

In this example “build” is an open clausal token. If the predicate is formed based on the ‘ROOT’ token, the following would be the output triple.

Central Bank | hoping | foreign reserves

However, the above triple does not provide the expected meaning. Therefore, Rule 07 has been introduced to address this issue.

Rule 07:

“When, a token with an open clausal complement dependency tag, comes after the ROOT token, which has already been selected as the predicate, such open clausal complement token should be used as the predicate of the triple instead the ROOT token of the grammatical structure.”

Triple after: Central Bank | build | foreign reserves

Refer to pseudocode in Fig. 8 for the Rule 07 implementation.

Figure 8. Pseudocode for Rule 07 implementation.

F. Handling complex sentences

If a sentence is complex, very often it has linking words. Therefore, triple extraction from such sentences should be handled differently.

For example, “The government was unable to significantly reduce the cost of living with subsidies due to the necessary fiscal and monetary discipline to face the upcoming debt repayment cycle, although politicians continued their hedonistic lifestyles.”

In such complex sentences, “Discourse markers” or “linking phrases/words” are commonly used (For example, although, however, since, assuming that, and in fact).

Rule 08:

“When a sentence is a complex type that has multiple clauses linked to each other using discourse markers, split the sentence into separate segments based on the discourse markers and then triple extraction should be carried out from each segment separately.”

Discourse markers can be identified in the grammatical structure using the tokens with their dependency tag set as “mark.” According to Rule 08, the given example results in two triples extracted from two sentence segments/clauses, as follows:

Segment 1: “The government was unable to significantly reduce the cost of living with subsidies due to the necessary fiscal and monetary discipline to face the upcoming debt repayment cycle.”

Token with a discourse mark tag: “although”

Segment 2: “politicians continued their hedonistic lifestyles.”

Triples generated:

Government | cut down | cost of living

politicians | continued| hedonistic lifestyles

G. Handling Reported Speech with ‘Said’ Word

In any news article, the reported speech is a frequently used writing style. Because our research also uses news as its data source, reported speech handling cannot be avoided. The grammatical structure of the sentences that have “said” word in it, identifies the “said” word as the ‘ROOT’ token. Therefore, the output triple is meaningless. Refer to the following two examples.

Example 1: “Minister Athukorala stated that the MoU indicates both nations' keenness to bring transparency in all stages of recruitment and employment.”

Extracted triple: Minister Athukorala | said | MoU

Example 2: “the Sri Lanka Freedom Party (SLFP) members of the Unity Government would continue to remain in it, backing SLFP Chairman and President Maithripala Sirisena, Minister S.B. Dissanayake today said.”

Extracted triple: Sri Lanka Freedom Party | said|

To resolve this anomaly, rule 09 has been implemented.

Rule 09:

“If the word “said” found in the sentence (reported speech), separate the sentence into two clauses as left side and right side of the “said” word. Then the reported clause is identified from these two clauses and extract the triple from that reported clause. Ignore the reporting clause.”

In the reported speech sentences, the reported clause includes what the original speaker said. Therefore, a reported clause is more important for knowledge extraction. The automatic classification of reported clauses and reporting clauses is challenging. Even though the probability of success is not at a higher rate, simple logic has been identified to perform this task. It was observed from many examples that the reported clause is found in the right-side segment (as in Example 1 above) unless the left-side segment length is much higher the right-side segment length and when a comma “,” found in the left-side segment. The implementation of this logic is depicted as a pseudocode in Fig. 9.

Figure 9. Pseudocode for Rule 09 implementation.

H. Handling Passive voice

The grammatical structures of active and passive voice sentences are slightly different from each other. The two main differences include the following: 1) Active voice sentences have tokens of dependency tag ‘nsubj’ for nouns that represent the subject. However, passive voice sentences have tokens of dependency tag ‘nsubjpass’ for nouns that represent their nominal subject, 2) Passive voice sentences mostly have a token with a dependency tag ‘agent’, immediately after the preposition ‘by’. However, with minor modifications to the algorithm, passive-voice sentences can also be processed using the same triple-extraction algorithm.

Rule 10:

“When the Subject component of the triple is selected from the grammatical structure of the sentences or in other words, wherever the ‘nsubj’ dependency tag is searched, ‘nsubjpass’ dependency tag should also be considered for passive voice.”

The implementation of Rule 10 is represented in the pseudocode shown in Fig. 10 by changing lines 3, 4, and 5 of the basic triple extraction pseudocode depicted in Fig. 3.

For example, “When investigated it was found that 12 dolphins were killed by nine fishermen who were arrested by the Trincomalee Crime Prevention Unit.”

Extracted triple: dolphins | killed by | fishermen

Even though our regular triple extraction algorithm is suitable for passive voice sentences with minor modifications, as described above, one anomaly was observed with some ‘Predicate’ part of the triple, as shown below.

For example, “Temporary workers are paid their wages on weekly basis.”

Extracted triple: Temporary workers | paid | wages

In the example above, the passive sense of the triple is lost. If the agent is unknown (done by whom) and past participle verbs are the same as past tense verbs (for example, paid, made, and read), the passive voice triple gives incorrect meaning. Rule 11 was introduced to address this issue.

Rule 11:

“If the sentence is passive voice and the verb has the same token for its past tense and past participle tense, then predicate should be formed with the ROOT token and its auxiliary verb prior to it.”

Extracted triple after rule implementation:

Temporary workers | are paid | wages

IV. RESULTS AND DISCUSSION

A. Experimental Study Over Sri Lankan Open Domain News Corpus

The proposed approach was successfully implemented and validated using the Sri Lankan English news corpus. The algorithmic details are presented as a pseudocode from Figs. 3-10.

Figure 10. Pseudocode for Rule 10 implementation.

Output generation:

The output triples were written into a CSV file with the pipe character used as the delimiter within the triple components. Some of the extracted triples are shown in Fig. 11. Table 1 contains some important statistical facts related to the extraction process. According to Table 1 every valid sentence used for extraction has resulted in a well-formed triple (value exists for all three components). This may be caused by the ignorance of the sentences that foresee problems before extraction.

Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus.

FactFigure
Number of documents in the corpus5,409
Number of sentences after pre-processing116,839
Number of sentences ignored62,638
Number of sentences valid for extractions54,201
Number of triples extracted54,201
Number of distinct predicates extracted10,116
Number of distinct subjects extracted6,736
Number of distinct objects extracted7,937

Figure 11. Some extracted surface knowledge (Triples) from the Sri Lankan news.

Sentences ignored during the algorithm execution:

When a sentence is a command, the ROOT token is the first in the grammatical structure array. In such cases, the subject component of the triple becomes empty, which results in a malformed triple. Therefore, these sentences were ignored.

For example, “Shift the insurance liability toward manufacturers.”

B. Inter-rater-agreement-based Validation of Extracted Triples - Sri Lankan News

The purpose of this validation process was to evaluate the significance of the output (how meaningful) of the proposed surface knowledge extraction algorithm. Accuracy was measured by conducting an inter-rater agreement test among the four participants selected from our testing team. Our testing team consisted of academics and professionals who were experts in English and with different professional backgrounds. For this validation process, four sample sets of 387 triples were randomly chosen from the extracted triples, using sampling without replacement method. Altogether, 1548 triples were selected and distributed among the four participants of the testing team. This sample distribution ensures that every triple has been verified by at least three participants, and the meaningful/correctness of the surface knowledge extraction is decided by a majority vote. The validation results of all four members were combined and re-processed to determine the final accuracy rate, as mentioned above. The results are presented in Table 2. Because the generated output is large, performing a validation test with high variability and reliability is vital. Variability was achieved by randomly choosing the test triples from the entire corpus using sampling without replacement method. Reliability was assessed using an inter-rater agreement mechanism. The results show that the surface knowledge extraction achieved a meaningful triple extraction rate of 83.5%, which is good. This indicates that the surface knowledge extraction algorithm yields accurate results.

Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus.

FactFigure
Total number of triples in the samples1,548
Number of triples that voted as meaningful1,293
Number of triples that voted as meaningless255
Meaningful triple extraction rate (as a percentage)83.5
Error rate (as a percentage)16.5
IRR score (as a percentage)76


However, this performance was not achieved in one attempt. Several improvement cycles were performed by experimenting with and discovering new rules one by one. After introducing the main rule, the accuracy rate increased from 60% to 83.5%. Therefore, Rule 02 (the main rule) contributes significantly to this improved accuracy, as it makes accurate linkages among tokens in the grammatical structure.

For the inter-rater agreement test, the level of agreement between raters, also known as Inter-Rater Reliability (IRR), was computed using the percent agreement method. The IRR score obtained was 76%, which ensured the high reliability of our testing team.

C. Extended Study Over BBC News Dataset

The same knowledge extraction algorithm was executed on the BBC news dataset that had 2225 news articles that represented a good public open domain corpus [30]. Table 3 presents some important statistics for this experiment.

Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset.

FactFigure
Number of documents in the corpus2,225
Number of sentences after pre-processing (sentence tokenization)41,983
Number of sentences ignored23,799
Number of sentences selected as valid for extractions18,184
Number of well-formed triples extracted18,184
Number of malformed triples extracted0


The resulting triplets are shown in Fig. 12. These extracted triples demonstrate the applicability of our triple-extraction algorithm for any open domain text.

Figure 12. Some extracted surface knowledge/triples from the BBC news dataset.

D. Inter-rater-agreement-based Validation of Extracted Triples - BBC News

Validating these results is crucial for claiming effectiveness. Triples extracted from the BBC news corpus were also validated based on the inter-rater-agreement method (similar to the triples of the Sri Lankan news corpus that were validated previously). However, the BBC news context may not be familiar to Sri Lankans. Therefore, to obtain a fair judgment of the validation process, an examiner panel was formed using four native British examiners. During this validation, 800 triples were selected and distributed among the four examiners of the evaluation team. This sample distribution also ensured that every triple was manually verified by at least three qualified examiners. Therefore, at least two examiners should agree that the triple is meaningful/correct. The validation results of all four examiners were combined and re-processed to determine the final accuracy rate. The results are presented in Table 4.

Table 4 . Inter-rater-agreement test results - BBC news corpus.

FactFigure
Total number of triples in the samples800
Number of triples that voted as meaningful741
Number of triples that voted as meaningless59
Meaningful triple extraction rate (as a percentage)92.6
Error rate (as a percentage)7.4
IRR score (as a percentage)90


The results show that the surface knowledge extraction achieved a 92.6% meaningful triple extraction rate, which is significant. Not only that but also the computed IRR score (using the percent agreement method) for this validation process was 90%, which proves the high reliability of our validation team.

The BBC news corpus yielded better results than the Sri Lankan news corpus. This demonstrates the higher accuracy and generalized applicability of our surface knowledge extraction algorithm to any open-domain unstructured text.

E. Unhandled Exceptions

Even though the surface knowledge extraction accuracy is good, it is still below 100% because of some unhandled exceptions. The triples marked as ‘meaningless’ by the evaluators were mostly caused by the exceptional cases described below.

Unexpected behavior in grammatical structure:

Refer to the example sentence; “A sizeable share of the Sri Lankan population migrates from rural areas to cities.” The grammatical structure of the example sentence evidenced an unexpected behavior such that, when the word “migrates” is expected to be the ROOT, the word ‘share’ has been marked as the ROOT.

Sentences with “said” word as the ‘ROOT’:

This type of sentence was handled differently using our own logic (Rule 08) introduced to the extraction algorithm. However, some examples deviate from the assumption made in the relevant rule, and the extraction becomes incorrect.

Sentences with synonyms of “said” word as the ‘ROOT’:

There are sentences that have synonyms of ‘said’ word such as “stated,” and “commented.” However, the same rule introduced to ‘said’ word cannot be applied for these types because such sentence’s formats were different, and the same rule applied to “said’ word sentences become inappropriate.

Identifying the best term from a noun phrase that has “of” word:

In the example “Percentage of entrants,” the most appropriate noun for the triple component should be the word “entrants” (that is, prepositional object). However, with the example “Claims of abuse,” the most appropriate noun for the triple component should be the word “claims” (not the prepositional object). Therefore, the selection of the best word for the triple component from such noun phrases is nontrivial.

Even with the aforementioned unhandled exceptions, the accuracy of our surface knowledge extraction algorithm is remarkable. It is important to emphasize that this accuracy rate was achieved even without using the contextual information of the news articles as a whole.

F. Efficiency Analysis of the Surface Knowledge Extraction Algorithm

Our tokenized corpus contained 116,839 sentences. The surface knowledge extraction algorithm extracted an average speed of 0.02 seconds per sentence. Our statistics indicate that the total triple extraction time is less than one hour (without preprocessing time), making a single pass over the entire corpus, which proves better efficiency. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.

V. FUTURE WORK

Handling reported speech in a surface knowledge extraction algorithm requires improvement. It is necessary to find a robust mechanism to identify the reported clause of a sentence, as it is mandatory when extracting knowledge accurately from reported speech sentences. Identifying the most suitable word from the noun phrase is required when a preposition and its prepositional object have been appended as the suffix for the subject and/or object component of the triple. “Lot of students” (subject = Lot, preposition = of, prepositional object = students), “percentage of entrants” “bunch of grapes” and “claims of abuse” can be considered as examples. In such situations, our algorithm uses the entire noun phrase as the triple component (subject or object) for a better meaning. Then, the triple becomes lengthy, and if a compound word or modifier is appended, the problem worsens. Therefore, it is preferable to find a method to identify the most appropriate word for the triple component from the identified noun phrase.

VI. SUMMARY AND CONCLUSION

The ever-increasing effect of information overload requires humans to be extremely selective regarding what we read and store as knowledge. Much more can be made accessible to humans if such textual content can be encoded into forms that lend themselves more readily to inference. This gap in converting natural-language textual content into a machine-processable form requires the extraction of surface knowledge. This study attempted to provide a solution to the aforementioned problem by automatically extracting surface knowledge in terms of triples from open domain news.

After a thorough examination of all features of the grammatical structure, 11 important rules were discovered that could be used in the extraction of meaningful surface knowledge. Among these, complex sentence handling, passive voice handling, and reported speech handling have remarkably contributed to the domain of AKE. The results were validated using an interrater agreement test to ensure high reliability. These tests obtained acceptable IRR scores computed using the percentage agreement method. The proposed approach achieved meaningful triple extraction rates of 83.5% for the Sri Lankan news corpus and 92.6% for BBC news, demonstrating a significant performance. The preprocessed Sri Lankan English news corpus and the discovered rules for surface knowledge extraction using the grammatical structure of the sentence can be highlighted as valuable contributions of this study.

With these research findings, AKE from open-domain texts has become feasible. Hence, machines can interpret open-domain unstructured text with cutting-edge computer-processing power. Our findings will also be useful for the development of the semantic web. Finally, the news/web data, which is an extremely large knowledge source, may not be wasted in the future.

Fig 1.

Figure 1.High-level Design Diagram-Automatic Surface Knowledge Extraction.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 2.

Figure 2.Grammatical structure.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 3.

Figure 3.Pseudocode for Rule 01 and Rule 02 implementations.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 4.

Figure 4.Pseudocode for Rule 03 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 5.

Figure 5.Pseudocode for Rule 04 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 6.

Figure 6.Pseudocode for Rule 05 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 7.

Figure 7.Pseudocode for Rule 06 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 8.

Figure 8.Pseudocode for Rule 07 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 9.

Figure 9.Pseudocode for Rule 09 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 10.

Figure 10.Pseudocode for Rule 10 implementation.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 11.

Figure 11.Some extracted surface knowledge (Triples) from the Sri Lankan news.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Fig 12.

Figure 12.Some extracted surface knowledge/triples from the BBC news dataset.
Journal of Information and Communication Convergence Engineering 2022; 20: 113-124https://doi.org/10.6109/jicce.2022.20.2.113

Table 1 . Statistical facts of surface knowledge extraction over the Sri Lankan news corpus.

FactFigure
Number of documents in the corpus5,409
Number of sentences after pre-processing116,839
Number of sentences ignored62,638
Number of sentences valid for extractions54,201
Number of triples extracted54,201
Number of distinct predicates extracted10,116
Number of distinct subjects extracted6,736
Number of distinct objects extracted7,937

Table 2 . Inter-rater-agreement test results – Sri Lankan news corpus.

FactFigure
Total number of triples in the samples1,548
Number of triples that voted as meaningful1,293
Number of triples that voted as meaningless255
Meaningful triple extraction rate (as a percentage)83.5
Error rate (as a percentage)16.5
IRR score (as a percentage)76

Table 3 . Statistical facts relevant to surface knowledge extraction over the BBC news dataset.

FactFigure
Number of documents in the corpus2,225
Number of sentences after pre-processing (sentence tokenization)41,983
Number of sentences ignored23,799
Number of sentences selected as valid for extractions18,184
Number of well-formed triples extracted18,184
Number of malformed triples extracted0

Table 4 . Inter-rater-agreement test results - BBC news corpus.

FactFigure
Total number of triples in the samples800
Number of triples that voted as meaningful741
Number of triples that voted as meaningless59
Meaningful triple extraction rate (as a percentage)92.6
Error rate (as a percentage)7.4
IRR score (as a percentage)90

References

  1. Y. Matsuo, Keyword extraction from a single document using word cooccurrence statistical information, International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169, Mar, 2004.
    CrossRef
  2. P. K. Shah, and C. Perez-Iratxeta, and P. Bork, and M. A. Andrade, Information extraction from full text scientific articles: where are the keywords?, BMC bioinformatics, vol. 4, no. 1, p. 20, May, 2003.
    Pubmed KoreaMed CrossRef
  3. S. Beliga and A. Mestrovic and S. Martincic-Ipsic, An overview of graph-based key words extraction methods and approaches, Journal of information and organizational sciences and JIOS, vol. 39, no. 1, pp. 1-20, Jul, 2015.
  4. A. Tixier and F. Malliaros and M. Vazirgiannis, A graph degeneracybased approach to keyword extraction, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin: TX, USA, pp. 1860-1870, 2016.
    CrossRef
  5. H. M. M. Hasan, and F. Sanyal, and D. Chaki, and M. H. Ali, An empirical study of important keyword extraction techniques from documents, in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, pp. 91-94, Oct, 2017.
    CrossRef
  6. S. K. Bharti and K. S. Babu and A. Pradhan, Automatic keyword extraction for text summarization in multi-document e-newspapers articles, European Journal of Advances in Engineering and Technology, vol. 4, no. 6, pp. 410-427, 2017.
  7. Z. Liu, and P. Li, and Y. Zheng, and M. Sun, Clustering to find exemplar terms for keyphrase extraction, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Singapore, pp. 257-266, Aug, 2009.
    CrossRef
  8. Y. Ouyang and W. Li and R. Zhang, 273. Task 5. keyphrase extraction based on core word identification and word expansion, in Proceedings of the 5th international workshop on semantic evaluation, Uppsala, Sweden, pp. 142-145, 2010.
  9. S. N. Kim, and O. Medelyan, and M. -Y. Kan, and T. Baldwin, Automatic keyphrase extraction from scientific articles, Language Resources and Evaluation, vol. 47, pp. 723-742, Dec, 2013.
    CrossRef
  10. D. Mahata, and J. Kuriakose, and R. R. Shah, and R. Zimmermann, Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings, in Proceedings of NAACL-HLT 2018, New Orleans: LA, USA, vol. 2, pp. 634-639, 2018.
    CrossRef
  11. G. Rabby, and S. Azad, and M. Mahmud, and K. Z. Zamli, and M. M. Rahman, A flexible keyphrase extraction technique for academic literature, n Procedia Computer Science, Tangerang, Indonesia, vol. 135, pp. 553-563, 2018.
    CrossRef
  12. K. Bennani-Smires, and C. Musat, and A. Hossmann, and M. Baeriswyl, and M. Jaggi, Simple unsupervised keyphrase extraction using sentence embeddings, in Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 221-229, Jan, 2018.
    CrossRef
  13. O. Etzioni, and M. Cafarella, and D. Downey, and S. Kok, and A. -M. Popescu, and T. Shaked, and S. Soderland, and D. S. Weld, and A. Yates, Web-scale information extraction in knowitall: (preliminary results), in Proceedings of the 13th international conference on World Wide Web, New York: NY, USA, pp. 100-110, May, 2004.
    CrossRef
  14. O. Etzioni, and M. Cafarella, and D. Downey, and A. -M. Popescu, and T. Shaked, and S. Soderland, and D. S. Weld, and A. Yates, Unsupervised named-entity extraction from the web: An experimental study, Artificial Intelligence, vol. 165, no. 1, pp. 99-134, Jun, 2005.
    CrossRef
  15. A. Ritter, and S. Clark, and Mausam, and O. Etzioni, Named entity recognition in tweets: An experimental study, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K, pp. 1524-1534, Jul, 2011.
  16. M. Mintz, and S. Bills, and R. Snow, and D. Jurafsky, Distant supervision for relation extraction without labeled data, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, Suntec, Singapore, vol. 2, pp. 1003-1011, 2009.
    CrossRef
  17. D. Q. Nguyen, and K. Verspoor, Convolutional neural networks for chemical-disease relation extraction are improved with characterbased word embeddings, in Proceedings of the BioNLP 2018 workshop, Melbourne, Australia, pp. 129-136, May, 2018.
    CrossRef
  18. K. G'abor, and D. Buscaldi, and A. -K. Schumann, and B. QasemiZadeh, and H. Zargayouna, and T. Charnois, SemEval-2018Task7: Semantic relation extraction and classification in scientific papers, in Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans: LA, USA, pp. 679-688, 2018.
    CrossRef
  19. S. Pawar and G. K. Palshikar and P. Bhattacharyya, Relation extraction: A survey, arXiv:1712.05191 [cs], Dec, 2017.
    CrossRef
  20. G. Bordea and E. Lefever and P. Buitelaar, Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2), in SemEval-2016, San Diego: CA, USA, pp. 1081-1091, 2016.
    CrossRef
  21. P. Maitra, and D. Das, UNLP at SemEval-2016 Task 13: A language independent approach for hypernym identification, in Proceedings of SemEval, San Diego: CA, USA, pp. 1310-1314, 2016.
    CrossRef
  22. A. Panchenko, and S. Faralli, and E. Ruppert, and S. Remus, and H. Naets, and C. Fairon, and S. P. Ponzetto, and C. Biemann, TAXI at SemEval-2016 Task 13: A taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling, in Proceedings of SemEval, San Diego: CA, USA, pp. 1320-1327, 2016.
    CrossRef
  23. A. Yates, and M. Banko, and M. Broadhead, and M. Cafarella, and O. Etzioni, and S. Soderland, TextRunner: open information extraction on the web, in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX - NAACL '07, Rochester: NY, USA, pp. 25-26, 2007.
    Pubmed CrossRef
  24. F. Wu, and D. S. Weld, Open information extraction using Wikipedia, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 118-127, 2010.
  25. O. Etzioni, and A. Fader, and J. Christensen, and S. Soderland, and M. Mausam, Open information extraction: The second generation., in IJCAI, 2011, vol. 11, pp. 3-10, 04, Jul, 2017.
  26. J. Fan, and A. Kalyanpur, and D. C. Gondek, and D. A. Ferrucci, Automatic knowledge extraction from documents, IBM Journal of Research and Development, vol. 56, no. 3,4, pp. 5:1-5:10, May, 2012.
    CrossRef
  27. S. Soderland, and B. Roof, and B. Qin, and S. Xu, and Mausam, and O. Etzioni, Adapting open information extraction to domain-specific relations, AI Magazine, vol. 31, pp. 93-102, Jul, 2010.
    CrossRef
  28. T. M. Mitchell, and W. Cohen, and E. Hruschka, and P. Talukdar, and B. Yang, and J. Betteridge, and A. Carlson, and B. Dalvi, and M. Gardner, and B. Kisiel, and J. Krishnamurthy, and N. Lao, and K. Mazaitis, and T. Mohamed, and N. Nakashole, and E. Platanios, and A. Ritter, and M. Samadi, and B. Settles, and R. Wang, and D. Wijaya, and A. Gupta, and X. Chen, and A. Saparov, and M. Greaves, and J. Welling, Never-Ending learning, Communication of the ACM, vol. 61, no. 5, pp. 103-115, May, 2018.
    CrossRef
  29. D. Bennet, The depth of knowledge: surface, shallow or deep?, VINE, vol. 38, no. 4, pp. 405-420, Oct, 2008.
    CrossRef
  30. BBC News Summary, in Kaggle [Online]. [accessed May 20, 2020]. Available: https://www.kaggle.com/pariza/bbc-news-summary.
JICCE
Sep 30, 2022 Vol.20 No.3, pp. 143~233

Stats or Metrics

Share this article on

  • line
  • mail

Related articles in JICCE

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255