Journal of information and communication convergence engineering 2023; 21(2): 117-129
Published online June 30, 2023
https://doi.org/10.56977/jicce.2023.21.2.117
© Korea Institute of Information and Communication Engineering
Correspondence to : Siva S (E-mail: sivaraju.reva@gmail.com)
Department of Computer Science and Applications, Reva University, Bangalore 560064, India
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
High-utility itemset mining (HIUM) has emerged as a key data-mining paradigm for object-of-interest identification and recommendation systems that serve as frequent itemset identification tools, product or service recommendation systems, etc. Recently, it has gained widespread attention owing to its increasing role in business intelligence, top-N recommendation, and other enterprise solutions. Despite the increasing significance and the inability to provide swift and more accurate predictions, most at-hand solutions, including frequent itemset mining, HUIM, and high average- and fast high-utility itemset mining, are limited to coping with real-time enterprise demands. Moreover, complex computations and high memory exhaustion limit their scalability as enterprise solutions. To address these limitations, this study proposes a model to extract high-utility frequent closed itemsets based on an improved cumulative summary list structure (CSLFC-HUIM) to reduce an optimal set of candidate items in the search space. Moreover, it employs the lift score as the minimum threshold, called the cumulative utility threshold, to prune the search space optimal set of itemsets in a nested-list structure that improves computational time, costs, and memory exhaustion. Simulations over different datasets revealed that the proposed CSLFC-HUIM model outperforms other existing methods, such as closed- and frequent closed-HUIM variants, in terms of execution time and memory consumption, making it suitable for different mined items and allied intelligence of business goals.
Keywords Business Intelligence, Cumulative Summary List-Structure, Frequent Closed High-Utility Itemset Mining
The rapid growth in software technologies, the Internet, decentralized computing, and low-cost hardware Fhas broadened the horizon of analytics solutions for real-time decisionmaking. Advanced analytics solutions have become central entities for business intelligence and enterprise solutions, particularly for product recommendations, market segmentation, and supply chain management. It also helps in future demand analysis and prediction, which shapes businesses for proactive decision-making [1-4]. Pattern mining or high-utility itemset mining (HIUM) methods are commonly used in market basket analysis, business intelligence, and recommendation systems [5]. Functionally, HIUM and related pattern- mining approaches involve the identification of vital features, cues, or data objects, thus enabling target segmentation and prediction [6]. In the real world, the patterns can be periodic patterns, sequential patterns, and frequent itemsets. However, most business intelligence solutions employ frequent itemset patterns that enable the segmentation of the top demanding products, probable demands and items, interitem associations, and so on for further prediction and decision- making. Frequent itemset mining methods have been extensively applied for frequently occurring item segmentation over a large transaction dataset [6-8], thus, identifying frequent items with frequencies greater than a defined threshold (e.g., support-value) where prediction decisions are based [9]. Among the major pattern-mining approaches, methods such as Apriori [6-7] were developed for frequent itemset identification by performing (iterative) level-wise searching. These approaches apply the downward bottom closure (DBC) function, also known as the Apriori property, to prune the insignificant itemsets; therefore, it identifies only those items with high frequency to be labeled as high-frequency items. However, the iterative search method and large computation force them to undergo delays and memory exhaustion, making them unsuitable for real-time enterprise solutions. However, in later efforts, the authors proposed different improved solutions, including ECLAT [2], FP-Growth [3-10], and HMine [11], where the focus was on improving the data structure and pruning mechanism to achieve a higher accuracy and lower computation time. However, in these methods, factors such as transaction and its volume, copurchased items and their frequency, and items with highprofit value can affect value-oriented pattern mining and, hence, can confine their efficacy [12]. To address these limitations, high-utility itemset mining (HIUM) [13-14] methods have been developed, which use both volume and its unit profit to improve accuracy, scalability, and time efficacy. The HIUM method applies a utility factor that signifies the total profit of an itemset to identify a set of high-utility items. Numerous state-of-the-art HIUM algorithms [12-15] generate a large volume of candidate itemsets and impose computational costs and delays. However, pruning insignificant itemsets (i.e., items with a low frequency) can reduce the search space and improve performance [20-24]. Unlike the aforementioned DBC-based pruning methods, recent authors [20-21] have found that transaction-weighted utility (TWU) can improve the search space without added computation. In this function, TWU measures the value of the threshold of an upper (utmost) itemset to perform pruning. Numerous efforts have been made towards two-phase implementation [16-19] and single-phase models [20-21] for HIUM, where the first step identifies highly correlated itemsets, followed by a suitability analysis as the HUI in the second step. Unfortunately, owing to its iterative data-scanning nature, Apriori suffers from deficiency, thereby imposing a cost of computation [20]. To improve this, one-phase methods were developed [20] using a list structure, often called a utility list, to identify high-utility items. However, the utility value being proportional to the length of the itemset increases the computational costs for large data. Subsequently, the average-utility method was applied to minimize the influence of item length on high-utility itemset estimation [22- 24]. Unlike support value-based mining, using a probability factor can improve HUIM performance [25]. Thus, the list of high-utility items can be improved, and items that are more profitable and in demand can be proposed.
This study proposes a reliable, lightweight frequent closed high-utility itemset mining model driven by cumulative summary list structures (CSLFC-HUIM). The proposed model aims to determine an improved utility list structure with the most significant and frequent itemsets throughout the search space. It uses a stringent threshold measure to limit the number of candidate itemsets in the search space, thereby reducing computing time and memory fatigue. Unlike classical HUIM methods, in which the authors use the minimum support threshold to filter insignificant items, the proposed CSLFC-HUIM model employs the lift value as a support threshold to perform pruning in a nested-list structure. This value is derived from the support value, conviction, and confidence, signifying the likelihood of an item frequently occuring over transactions. The structure of the nested list was improved using the abovementioned lifting measure as a support threshold, called cumulative support, and the corresponding cumulative utility. (e.g., cumulative summary list structure or simply, CULS) to retain only the highly significant items. This helped reduce computational time and memory exhaustion. Simulation results over different benchmark datasets under various threshold conditions revealed that the proposed CSLFC-HUIM model outperformed other methods, including closed-HUIM and frequent closed-HUIM variants. CSLFC-HUIM was found to be more efficient in terms of execution time and memory consumption and is a potential solution for many types of pattern mining or business intelligence tasks.
Frequent itemset-mining methods include ECLAT [2], FPGrowth [3], Apriori [4], and HMine [11]. However, these methods fail to guarantee time efficiency and accuracy. The authors [6] applied a level-wise search method to design Apriori for frequent itemset estimation. The involved processes, such as iterative database scanning, are time-consuming and exhaustive. In [3], FP-growth was proposed to inculcate tree structure-based itemset identification. First, it generates an FP-tree structure by traversing the search (data) space and identifying frequent itemsets. However, exceedingly high reliance on the search space size and iterative scanning confine for real-time enterprise solutions. The authors in [11] designed an HMine model similar to FPGrowth but with a pointer-based hyperlink to represent items. By contrast, ECLAT [2] was designed as an improved model constituting a vertical layout of the database called “Transaction ID list” to alleviate iterative database scans. In this previous study, the authors applied the support count information for each itemset for comparison with the Transaction ID list to find the frequent itemset(s). Unfortunately, merely applying a support count cannot yield an optimal solution because of the increased search space over a large input dataset. Similar limitations have been observed in other frequent itemset mining methods [6-26-29] that failed to guarantee optimality, particularly in terms of accuracy with minimum time and memory utilization. HIUM methods [13-14] were later designed to identify significant patterns by exploiting both itemset volume and its profit information. HIUM methods were designed as two-phase [16] and onephase methods. In the two-phase method, the first phase generates a set of significant itemsets possessing a higher frequency, whereas the later phase estimates the candidate’s utility to identify high-utility itemsets. To avoid dependence on the utility value, a transaction-weighted utility (TWU) function was designed [16]. The authors [17] designed a TWU with a flexible upper threshold and high-utility itemset skipping ability to improve performance. They used a twophase model with a pruning algorithm called the isolated itemset discarding strategy (IIDS) to perform swift data scanning and pruning towards HIUM. Subsequently, various tree-based models, such as IHUP [18], HUP-tree [30], UPGrowth [19], UP-Growth+ [31], MU-Growth [32], and PB [33], were developed. These methods apply FP-Growth [3] to reduce the scanning cost using a tree structure. Unlike two-phase methods, the single-phase method [20] focuses on identifying candidate itemsets with high utility values. HUIMiner [20] was designed as a one-phase model with a list structure called a utility list (UL) to store the information of high-frequency itemsets. With the improved search space, the pruning efficacy became better. To improve HUI-Miner, the authors [21] reduced the number of joint operations between the utility value and frequency using a pruning method named estimated-utility cooccurrence pruning (EUCP), which mainly relies on a matrix structure called estimatedutility cooccurrence structure (EUCS). EUCP stores the TWU values of two item sets in the EUCS matrix, which are used to prune insignificant items without estimating their utility value. To improve [21], the U-pruning, PU-pruning, and LA-pruning methods were developed [34]. The authors [35] designed the EFIM by applying upper bounds named subtree utility and local utility. To minimize the scanning cost, a transaction merging model was designed in [35]. In [36], HMiner was improved by using utility information storage. Methods such as BAHUI [37], the HUIM-BPSO sign [38], MinHUIs [39], and FHM+ [40] were designed for HIUM; however, they were found to be susceptible to large real-time continuous data. Unlike HIUM methods, high average- utility itemset mining methods use average-utility values to reduce reliance on length constraints. As an evolution from TPAU [22], a two-phase mining model applied the upper bound as average utility upper bound (AUUB) criteria. If the AUUB value of an itemset did not meet the aforementioned criteria, it was discarded as a high-utility item, thereby reducing the search space. However, as a level-wise search method, TPAU [22] can experience higher delays and costs. To improve time efficiency, PBAU [23] was designed as a projection-based method with an indexing structure. With PBAU, an upper-limit-driven prefix is applied to reduce the search space. A tree-based high-average-utility itemset mining method was designed in [41], in which the itemset structure was applied to improve performance. The authors [42] designed HAUI-Growth with a tree structure to minimize unwanted iterative scans. A single-phase HAUI was proposed in [43] using an average utility-based list structure. AUUB is used to prune weak candidates from the search space. The authors [44] designed a HAUI miner called EHAUPM by merging two strictly defined upper thresholds called looser upper-bound utility (LUB) and revised tighter upper bound (RTUB). Despite its ability to reduce the search space, it failed to address the association between different items and the probability of becoming a HUI. In [45], MHAI was designed using the HAI list structure to identify suitable high-utility itemsets. A closed HUI (CHUI) was proposed in [51] to reduce the number of candidate itemsets. To improve the time efficiency, CHUI-Miner [52] was designed as a single- phase solution using the EU-List structure. CHIU-Miner, also known as EFIM-Closed [53], was proposed with two upper thresholds with a forward-backward check facility. It applies local utility and subtree utility thresholds to prune the search space to improve performance [54]. Unfortunately, none of these methods [46-50] can assess the probability of coexisting items for profit.
A. An itemset
B. In sync with (1) and the allied HUIM problem, several definitions are derived. These are expressed as follows:
C.
D. For a transaction set
E. Definition 2 (Cumulative Utility of an Itemset in Database D)
F. For X itemset in D, the cumulative utility value is given by (3).
G. Definition 3 (Utility Values of a Distinct Transaction and a Complete Database D)
H. The utility of a transaction refers to the cumulative utilities of all items, as shown in (4).
I. Definition 4 (Relative Utility of an Itemset)
J. For database D, the relative utility of an itemset is given by (5).
K.
L. In the context of HUIM,
M.
N. An itemset
O.
P. If the support value of itemset
Q.
This work extends the above definitions and derives a factor called a probabilistic frequent closed high-utility itemset to improve overall efficiency. If the cumulative support value of an itemset
For the illustrations in Table 1, CTWU of item
Table 1 . Transaction utility and CTWU estimation
TtID | ||||||
---|---|---|---|---|---|---|
38 | 39 | 11 | 25 | 30 | 37 | |
Item name | a | b | c | d | e | F |
CTWU | 143 | 144 | 77 | 118 | 141 | 123 |
In (7),
To apply PFCHUI, CULS is applied, which is followed by the storage of the residual utility values of the items in CULS for filtering. Here, CULS stores item information in the form of (
The
Let
1) System Model
The overall proposed model is depicted in Fig. 1.
a) Cumulative Utility Summary List (CULS):
Most existing HIUM models use a utility list structure to prune and define HUIs. The utility list of a set of entries in a transactional database can be measured by traversing the utility lists of its subsets without parsing the original database. In contrast, most existing methods perform repetitive intersection functions and contain noncritical elements in the memory. This exhausts the memory and time required by classical methods. Therefore, unlike this approach, classical utility-list models are not optimal for truncating infrequent and insignificant item sets in large transactional databases. Classical methods do not store the associations between itemsets or the probabilistic relationships between items. To address this problem, CULS is applied, which depends on a structure called a utility-list buffer [25]. CULS applies a utility list buffer structure (ULB), followed by its definition. In this study, ULB is designed as a memory-pipeline structure to store the information pertaining to each itemset in the form of (
In (10),
b) CULS Precheck Model:
The use of
c) Cumulative Summary List Structure (CSLS):
This work emphasizes performing frequent closed HUIM, signifying that not only is the itemset’s cumulative utility supposed to be higher than
d) Implementation:
To implement the model, function (12) (i.e.,
Algorithm-1 Pseudocode for CSLFC-HUIM mining Pseudocode-1 CSLFC-HUIM Mining Model
Input: Transaction database
Output: Targeted probabilistic frequent CHUI
Algorithms:
1. Initiate the scanning of the input transaction database D, and estimate the CTWU for first itemset.
2. Consider
3. Sort items in descending order (in
4. Restart scanning database
5. Form the proposed linked list structure.
6. Execute
Algorithm-1 Pseudocode for probabilistic high-frequent and high-utility itemsets
Pseudocode-2 Targeted High-Frequent and High-Utility Itemsets
Input:
Output: Targeted probabilistic frequent CHUI
1. Initiate a loop.
Algorithm-3 Pseudocode for generating targeted high-utility itemsets
Pseudocode-3
Input: CSLFC-HUIM
Output: Targeted probabilistic frequent CHUI
Multiple benchmark datasets were used to assess the efficacy of the proposed CSLFC-HUIM model. In addition, a comparison was made with different state-of-the-art HIUM methods, where the relative performance was characterized in terms of time and memory consumption. Six datasets were obtained from the SPMF Open-Source Data Mining Library [55] (Table 2). Table 2 presents the data specifications, including the total number of transactions, total items, the average length of transactions, and the type of data (sparse or dense). Dense data encompass several items per transaction, whereas sparse data contain relatively fewer items per transaction. The utility per transaction varies according to the number of items per transaction and its corresponding occurrence throughout the dataset. Accordingly, we examined the performance using different datasets, including both sparse and dense (data) structures. To assess efficacy, the performance was tested over different threshold values, including
Table 2 . Data specifications [55]
Dataset | Transactions | Items | Average length of transaction | Size (MB) | Type of the Data |
---|---|---|---|---|---|
Chess | 3196 | 75 | 37 | 656.6 KB | Dense |
Accident | 340,183 | 468 | 33.8 | 66.2 | Dense |
Retail | 88,162 | 16,470 | 10.3 | 6.7 | Sparse |
Chainstore | 1,112,949 | 46,086 | 7.3 | 39.1 | Sparse |
Kosarak | 990,002 | 41,270 | 8.1 | 21.6 | Sparse |
Connect | 67,557 | 129 | 43 | 16.9 | Dense |
Four state-of-the-art algorithms were considered: closed high-utility itemset mining (CHUI-Miner [56]), CLS-Miner [57], fast high average-utility itemset miner (FHUIM) [58], and frequent and closed HUIM (FCHUIM) [59]. The CHUIminer (CHUIM) and FCHUIM models were developed based on the closed HUIM [56] principle, where they focus mainly on improving the utility-list structure. The CSLFC-HUIM model can be considered an extension of these methods. However, unlike in [56-57-59], CSLFC-HUIM uses multiple support values to perform pruning. It applies statistics (probabilistic statistical values), including conviction, confidence, and lift, to design cumulative support values to refine pruning. Because lift is derived as an eventual feature by applying support value, confidence, and conviction, we considered this value at the place of the support value. CSLFC-HUIM uses the lift score as the support value to perform CULS and derive a nested utility list structure. It significantly reduces the candidate itemsets in the list structure, which decisively improves the computation. Fig. 2 shows the execution time analysis over different input datasets for different utility threshold values. The results indicate that the CSLFC-HUIM consumes less time.
This ability of CSLFC-HUIM improved the search space and computation. Moreover, the nested loop architecture further improved the computational efficacy and time consumption. In contrast, classical approaches, such as FCHUIM and CLSM, which are TWU pruning models, require more time. We varied the utility score for the different test datasets (i.e., Chess, Chainstore, Retail, Accident, Kosarak, and Connect) to assess the relative performance. The results confirm that the proposed CSLFC-HUIM model outperforms other stateof- the-art methods. For the chess data, where the utility threshold was varied from 350000 to 550000, CSLFC-HUIM performed better than other state-of-the-art models, such as CHUIM, CLSM, and FHAIM; however, the performance of FCHUIM was found to be close to that of the proposed CSLFC-HUIM model. This is because, similar to the proposed CSLFC-HUIM model, FCHUIM applies a support threshold-driven nested-list structure for pruning, which improves time efficiency. The simulation confirms that the CSLFC-HUIM model outperforms other state-of-the-art methods, whereas FCHUIM, CLSM, FHAIM, and CHUIM perform in decreasing order. CHUIM is a two-phase method that consumes more time than other methods. In contrast, FHAIM focuses on applying a strict upper bound in terms of the support threshold, which also reduces time efficiency. This is because it cannot avoid the presence of itemsets that have support values lower than the defined threshold. The presence of residual itemsets reduces the performance (Fig. 2). In CSLFC-HUIM, increasing the utility threshold decreases time consumption owing to reduced residual itemsets in the search space and, hence, enhances performance.
Because increasing the support threshold helps a model reduce the itemsets in the list structure (in this case, the CULS nested list structure), thereby decreasing the computational time, we varied the support threshold for different inputs in percentiles. For instance, in the chess data, we varied the support threshold from 10 to 50% of the target itemsets. The simulation results (Table 3) revealed that the proposed CSLFC-HUIM model performs better than other state-of-theart methods. The reduction in the residual itemsets in the CULS nested list structure helped reduce the execution time. The results reveal CSLFC-HUIM performs almost four times faster than the classical CHUIM methods. Compared to CLS-Miner, CSLFC-HUIM performs computations almost three times faster. The chainstore dataset carried a total of 1,112,949 transactions with a significantly large number of items (i.e., 46,086), and the different algorithms were simulated with 0.20, 0.40, 0.60, 0.80, and 1% of the item sets. The proposed CSLFC-HUIM model requires an average of 18.2 seconds to process these data for different target itemset batches. In contrast, FCHUIM, CLSM, FHAIM, and CHUIM take 35.5 seconds, 76.2 seconds, 104.7 seconds, and 88 seconds, respectively. For the Chess dataset, the average time consumptions for CHUIM, CLSM, FHAIM, and FCHUIM were 1582.6 seconds, 1255 seconds, 703.2 seconds, and 157.8 seconds, respectively. Compared to these algorithms, CSLFC-HUIM took merely 74.6 seconds, signifying superior performance in HIUM tasks. Similar outputs were found for the Kosarak data, where the existing methods showed higher time consumption (CHUIM (93.18s), CLSM (15.86s), FHAIM (73.4s), and FCHUIM (11.0s)). Compared to these state-ofthe- art methods, CSLFC-HUIM took 8.8 seconds, signifying superior time efficacy and exhaustion towards HUIM.
Table 3 . Relative runtime analysis with varying minimum cumulative support value (support threshold) for the different datasets
Data | MinCumSup | CHUIM | CLSM | FHAIUM | FCHUI | CSLFCHUIM |
---|---|---|---|---|---|---|
Chess | 10% | 1825 | 1461 | 986 | 217 | 111.3 |
20% | 1779 | 1402 | 781 | 206 | 87 | |
30% | 1620 | 1247 | 703 | 131 | 69 | |
40% | 1458 | 1195 | 621 | 123 | 59.6 | |
50% | 1231 | 972 | 425 | 112 | 46.2 | |
Chainstore | 0.20% | 107.4 | 103.6 | 129.4 | 45 | 26.1 |
0.40% | 95.3 | 93.02 | 117.3 | 41.8 | 21.8 | |
0.60% | 89.1 | 79.3 | 104.2 | 36.3 | 17.02 | |
0.80% | 79.4 | 63.1 | 93.8 | 32.5 | 14.91 | |
1% | 68.8 | 42.1 | 79.2 | 21.9 | 11.26 | |
Retail | 0.20% | 24.31 | 21.27 | 14.83 | 7.9 | 6.7 |
0.40% | 16.96 | 13.8 | 11.12 | 7.2 | 6.4 | |
0.60% | 12.09 | 10.96 | 9.36 | 4.8 | 4.6 | |
0.80% | 7.82 | 9.91 | 6.63 | 4.21 | 4.02 | |
1% | 4.21 | 3.27 | 5.71 | 3.2 | 2.81 | |
Accident | 10% | 142 | 127 | 124.5 | 92 | 77.76 |
20% | 134.4 | 116.4 | 118.81 | 88.3 | 68.2 | |
30% | 127.4 | 86.8 | 99.32 | 79.2 | 49.91 | |
40% | 91.8 | 69.2 | 87.6 | 52.2 | 29.2 | |
50% | 77.7 | 56.6 | 79.1 | 29 | 21.6 | |
Kosarak | 2% | 129.91 | 38 | 102.8 | 32 | 26.62 |
4% | 107.2 | 18 | 90.3 | 6 | 4.9 | |
6% | 95.7 | 12 | 84.2 | 5.8 | 4.47 | |
8% | 83.2 | 6.2 | 47.6 | 5.6 | 4.2 | |
10% | 49.9 | 5.1 | 42.3 | 5.63 | 3.98 | |
Connect | 2% | 264 | 248 | 233 | 197 | 124.4 |
4% | 240 | 237 | 231 | 184 | 101.3 | |
6% | 197 | 187 | 219 | 162 | 97.76 | |
8% | 193 | 181.3 | 179.3 | 127.3 | 95.21 | |
10% | 176 | 159 | 161 | 87.3 | 83.22 |
The average memory utilization by CHUIM, CLSM, FHAIM, and FCHUIM algorithms was 216.4 MB, 208 MB, 204 MB, and 159.8 MB, respectively. In contrast, the proposed CSLFC-HUIM model consumes merely 35 MB of memory (Fig. 3), confirming its robustness in terms of memory utilization. Interestingly, CSLFC-HUIM performs almost four times lower in terms of memory exhaustion than the recently proposed FCHUIM algorithm and almost 6 times lower than CHUIM, CLS-Miner, and FHAIM algorithms. For the Chainstore data, the memory utilization measurement shows that the state-of-the-art CHUIM, CLSM, FHAIM, and FCHUIM consume 793 MB, 744 MB, 673 MB, and 696 MB of resources, respectively. In contrast, CSLFC-HUIM consumes only 132 MB of average memory over the different utility thresholds. A similar pattern can be found in the retail dataset, where CSLFC-HUIM exhibits almost five times lower memory (154 MB) than the state-of-the-art methods (CHUIM (732 MB), CLSM (686 MB), FHAIM (671 MB), and FCHUIM (576 MB)). For the Accident dataset, the memory consumption of CHUIM (1486 MB), CLSM (836 MB), FHAIM (1008 MB), and FCHUIM (816 MB) was higher than that of the proposed CSLFC-HUIM model (779 MB). For the Kosarak dataset, the average memory utilization of CHUIM, CLSM, FHAIM, and FCHUIM were 626 MB, 578 MB, 620 MB, and 384 MB, respectively, whereas CSLFC-HUIM consumed only 331.6 MB, signifying better resource efficiency. Similarly, for the Connect dataset, CHUIM (622 MB), CLSM (486 MB), FHAIM (566 MB), and FCHUIM (350 MB) were found to be resource-exhaustive than the proposed CSLFC-HUIM model (288 MB). The overall results confirm that the proposed CSLFC-HUIM can yield both time and memory efficiency to satisfy major HUIM purposes and enterprise demands.
This study proposed a lightweight high-utility closed-element set discovery model (CSLFC-HUIM) for HUIM model mining tasks. The proposed model focuses on improving the utility list structure with frequent and important item sets in the search space to improve overall computational efficiency. Unlike classical HUIM methods, the proposed model applies a lift measure or a value derived as a cumulative score to perform pruning. The use of the lift measure as a support threshold, called cumulative support, and corresponding cumulative utility values helps improve the nested list named cumulative utility list structure (CULS) to refine the search space and the optimal residual itemsets to reduce computational time and memory exhaustion. The use of the cumulative score as a support value as the threshold realized both closed-HUIM and frequent closed-HUIM. The implementation of the cumulative summary list structure with a precheck- assisted nested-list structure helped the CSLFC-HUIM model exhibit better computational time and memory utilization. Interestingly, the average runtime required (i.e., timeconsumption) by the different HUIM methods showed that the proposed method consumed only 42.71 seconds, whereas the other methods like CHUIM, CLSM, FHAIM, and FCHUI consumed 350.92 seconds, 275.5 seconds, 199.57 seconds, and 71.57 seconds, respectively. This confirms that the proposed CSLFC-HUIM model has exceedingly better performance than other state-of-the-art methods, and the role of improved multiple conditions or thresholds in the form of a lift score helped reduce the search space, while CULS enabled swift computation to perform HUIM. Simulations on different benchmark datasets show that the CSLFC-HUIM is nearly five times more responsive while ensuring significantly lower memory consumption. The relative performance with other state-of-the-art methods, such as CHUIM, FHAIM, CLSM, and FCHUIM, confirms that CSLFC-HUIM exhibits superior performance, where the role of the probabilistic derived lift score as a threshold cannot be ignored. As a matter of fact, CSLFC-HUIM is superior towards HUIM tasks; however, it could not address the need for top-N most frequent itemset identification. Future research will focus on identifying the top-N items to assist enterprises in making proactive supply chains and marketing decisions in the business horizon.
He received his Master of Computer Applications (M.CA) from the University of Madras, India in 2004. He has over 16 years of experience in the field of software research and development. He is currently employed as a senior engineering manager in a company in Bangalore, India, where a healthcare mobile solution is being developed using IoT and AI. Currently, he is pursuing a Ph.D. at REVA University, Bangalore, India. He is a member of IEEE.
Currently, she is working as an Associate Professor at the Department of CSE, MSRIT, Bangalore. She has been a technology educator and corporate trainer since 1999. In the last 18 years, she has served in various academic positions in technical institutes in Maharashtra and Karnataka. Her areas of research and teaching include network security,RTOS, computational intelligence, wireless networks, and embedded system development. She is a professional member of the Computer Society of India (CSI), a Life member since 2013, and an IEEE Member #94611631, Bangalore Section.
Journal of information and communication convergence engineering 2023; 21(2): 117-129
Published online June 30, 2023 https://doi.org/10.56977/jicce.2023.21.2.117
Copyright © Korea Institute of Information and Communication Engineering.
Siva S 1* and Shilpa Chaudhari2
1Department of Computer Science and Applications, Reva University, Bangalore 560064, India
2Department of Computer Science and Engineering, MS Ramaiah Institute of Technology, Bangalore 560054, India
Correspondence to:Siva S (E-mail: sivaraju.reva@gmail.com)
Department of Computer Science and Applications, Reva University, Bangalore 560064, India
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
High-utility itemset mining (HIUM) has emerged as a key data-mining paradigm for object-of-interest identification and recommendation systems that serve as frequent itemset identification tools, product or service recommendation systems, etc. Recently, it has gained widespread attention owing to its increasing role in business intelligence, top-N recommendation, and other enterprise solutions. Despite the increasing significance and the inability to provide swift and more accurate predictions, most at-hand solutions, including frequent itemset mining, HUIM, and high average- and fast high-utility itemset mining, are limited to coping with real-time enterprise demands. Moreover, complex computations and high memory exhaustion limit their scalability as enterprise solutions. To address these limitations, this study proposes a model to extract high-utility frequent closed itemsets based on an improved cumulative summary list structure (CSLFC-HUIM) to reduce an optimal set of candidate items in the search space. Moreover, it employs the lift score as the minimum threshold, called the cumulative utility threshold, to prune the search space optimal set of itemsets in a nested-list structure that improves computational time, costs, and memory exhaustion. Simulations over different datasets revealed that the proposed CSLFC-HUIM model outperforms other existing methods, such as closed- and frequent closed-HUIM variants, in terms of execution time and memory consumption, making it suitable for different mined items and allied intelligence of business goals.
Keywords: Business Intelligence, Cumulative Summary List-Structure, Frequent Closed High-Utility Itemset Mining
The rapid growth in software technologies, the Internet, decentralized computing, and low-cost hardware Fhas broadened the horizon of analytics solutions for real-time decisionmaking. Advanced analytics solutions have become central entities for business intelligence and enterprise solutions, particularly for product recommendations, market segmentation, and supply chain management. It also helps in future demand analysis and prediction, which shapes businesses for proactive decision-making [1-4]. Pattern mining or high-utility itemset mining (HIUM) methods are commonly used in market basket analysis, business intelligence, and recommendation systems [5]. Functionally, HIUM and related pattern- mining approaches involve the identification of vital features, cues, or data objects, thus enabling target segmentation and prediction [6]. In the real world, the patterns can be periodic patterns, sequential patterns, and frequent itemsets. However, most business intelligence solutions employ frequent itemset patterns that enable the segmentation of the top demanding products, probable demands and items, interitem associations, and so on for further prediction and decision- making. Frequent itemset mining methods have been extensively applied for frequently occurring item segmentation over a large transaction dataset [6-8], thus, identifying frequent items with frequencies greater than a defined threshold (e.g., support-value) where prediction decisions are based [9]. Among the major pattern-mining approaches, methods such as Apriori [6-7] were developed for frequent itemset identification by performing (iterative) level-wise searching. These approaches apply the downward bottom closure (DBC) function, also known as the Apriori property, to prune the insignificant itemsets; therefore, it identifies only those items with high frequency to be labeled as high-frequency items. However, the iterative search method and large computation force them to undergo delays and memory exhaustion, making them unsuitable for real-time enterprise solutions. However, in later efforts, the authors proposed different improved solutions, including ECLAT [2], FP-Growth [3-10], and HMine [11], where the focus was on improving the data structure and pruning mechanism to achieve a higher accuracy and lower computation time. However, in these methods, factors such as transaction and its volume, copurchased items and their frequency, and items with highprofit value can affect value-oriented pattern mining and, hence, can confine their efficacy [12]. To address these limitations, high-utility itemset mining (HIUM) [13-14] methods have been developed, which use both volume and its unit profit to improve accuracy, scalability, and time efficacy. The HIUM method applies a utility factor that signifies the total profit of an itemset to identify a set of high-utility items. Numerous state-of-the-art HIUM algorithms [12-15] generate a large volume of candidate itemsets and impose computational costs and delays. However, pruning insignificant itemsets (i.e., items with a low frequency) can reduce the search space and improve performance [20-24]. Unlike the aforementioned DBC-based pruning methods, recent authors [20-21] have found that transaction-weighted utility (TWU) can improve the search space without added computation. In this function, TWU measures the value of the threshold of an upper (utmost) itemset to perform pruning. Numerous efforts have been made towards two-phase implementation [16-19] and single-phase models [20-21] for HIUM, where the first step identifies highly correlated itemsets, followed by a suitability analysis as the HUI in the second step. Unfortunately, owing to its iterative data-scanning nature, Apriori suffers from deficiency, thereby imposing a cost of computation [20]. To improve this, one-phase methods were developed [20] using a list structure, often called a utility list, to identify high-utility items. However, the utility value being proportional to the length of the itemset increases the computational costs for large data. Subsequently, the average-utility method was applied to minimize the influence of item length on high-utility itemset estimation [22- 24]. Unlike support value-based mining, using a probability factor can improve HUIM performance [25]. Thus, the list of high-utility items can be improved, and items that are more profitable and in demand can be proposed.
This study proposes a reliable, lightweight frequent closed high-utility itemset mining model driven by cumulative summary list structures (CSLFC-HUIM). The proposed model aims to determine an improved utility list structure with the most significant and frequent itemsets throughout the search space. It uses a stringent threshold measure to limit the number of candidate itemsets in the search space, thereby reducing computing time and memory fatigue. Unlike classical HUIM methods, in which the authors use the minimum support threshold to filter insignificant items, the proposed CSLFC-HUIM model employs the lift value as a support threshold to perform pruning in a nested-list structure. This value is derived from the support value, conviction, and confidence, signifying the likelihood of an item frequently occuring over transactions. The structure of the nested list was improved using the abovementioned lifting measure as a support threshold, called cumulative support, and the corresponding cumulative utility. (e.g., cumulative summary list structure or simply, CULS) to retain only the highly significant items. This helped reduce computational time and memory exhaustion. Simulation results over different benchmark datasets under various threshold conditions revealed that the proposed CSLFC-HUIM model outperformed other methods, including closed-HUIM and frequent closed-HUIM variants. CSLFC-HUIM was found to be more efficient in terms of execution time and memory consumption and is a potential solution for many types of pattern mining or business intelligence tasks.
Frequent itemset-mining methods include ECLAT [2], FPGrowth [3], Apriori [4], and HMine [11]. However, these methods fail to guarantee time efficiency and accuracy. The authors [6] applied a level-wise search method to design Apriori for frequent itemset estimation. The involved processes, such as iterative database scanning, are time-consuming and exhaustive. In [3], FP-growth was proposed to inculcate tree structure-based itemset identification. First, it generates an FP-tree structure by traversing the search (data) space and identifying frequent itemsets. However, exceedingly high reliance on the search space size and iterative scanning confine for real-time enterprise solutions. The authors in [11] designed an HMine model similar to FPGrowth but with a pointer-based hyperlink to represent items. By contrast, ECLAT [2] was designed as an improved model constituting a vertical layout of the database called “Transaction ID list” to alleviate iterative database scans. In this previous study, the authors applied the support count information for each itemset for comparison with the Transaction ID list to find the frequent itemset(s). Unfortunately, merely applying a support count cannot yield an optimal solution because of the increased search space over a large input dataset. Similar limitations have been observed in other frequent itemset mining methods [6-26-29] that failed to guarantee optimality, particularly in terms of accuracy with minimum time and memory utilization. HIUM methods [13-14] were later designed to identify significant patterns by exploiting both itemset volume and its profit information. HIUM methods were designed as two-phase [16] and onephase methods. In the two-phase method, the first phase generates a set of significant itemsets possessing a higher frequency, whereas the later phase estimates the candidate’s utility to identify high-utility itemsets. To avoid dependence on the utility value, a transaction-weighted utility (TWU) function was designed [16]. The authors [17] designed a TWU with a flexible upper threshold and high-utility itemset skipping ability to improve performance. They used a twophase model with a pruning algorithm called the isolated itemset discarding strategy (IIDS) to perform swift data scanning and pruning towards HIUM. Subsequently, various tree-based models, such as IHUP [18], HUP-tree [30], UPGrowth [19], UP-Growth+ [31], MU-Growth [32], and PB [33], were developed. These methods apply FP-Growth [3] to reduce the scanning cost using a tree structure. Unlike two-phase methods, the single-phase method [20] focuses on identifying candidate itemsets with high utility values. HUIMiner [20] was designed as a one-phase model with a list structure called a utility list (UL) to store the information of high-frequency itemsets. With the improved search space, the pruning efficacy became better. To improve HUI-Miner, the authors [21] reduced the number of joint operations between the utility value and frequency using a pruning method named estimated-utility cooccurrence pruning (EUCP), which mainly relies on a matrix structure called estimatedutility cooccurrence structure (EUCS). EUCP stores the TWU values of two item sets in the EUCS matrix, which are used to prune insignificant items without estimating their utility value. To improve [21], the U-pruning, PU-pruning, and LA-pruning methods were developed [34]. The authors [35] designed the EFIM by applying upper bounds named subtree utility and local utility. To minimize the scanning cost, a transaction merging model was designed in [35]. In [36], HMiner was improved by using utility information storage. Methods such as BAHUI [37], the HUIM-BPSO sign [38], MinHUIs [39], and FHM+ [40] were designed for HIUM; however, they were found to be susceptible to large real-time continuous data. Unlike HIUM methods, high average- utility itemset mining methods use average-utility values to reduce reliance on length constraints. As an evolution from TPAU [22], a two-phase mining model applied the upper bound as average utility upper bound (AUUB) criteria. If the AUUB value of an itemset did not meet the aforementioned criteria, it was discarded as a high-utility item, thereby reducing the search space. However, as a level-wise search method, TPAU [22] can experience higher delays and costs. To improve time efficiency, PBAU [23] was designed as a projection-based method with an indexing structure. With PBAU, an upper-limit-driven prefix is applied to reduce the search space. A tree-based high-average-utility itemset mining method was designed in [41], in which the itemset structure was applied to improve performance. The authors [42] designed HAUI-Growth with a tree structure to minimize unwanted iterative scans. A single-phase HAUI was proposed in [43] using an average utility-based list structure. AUUB is used to prune weak candidates from the search space. The authors [44] designed a HAUI miner called EHAUPM by merging two strictly defined upper thresholds called looser upper-bound utility (LUB) and revised tighter upper bound (RTUB). Despite its ability to reduce the search space, it failed to address the association between different items and the probability of becoming a HUI. In [45], MHAI was designed using the HAI list structure to identify suitable high-utility itemsets. A closed HUI (CHUI) was proposed in [51] to reduce the number of candidate itemsets. To improve the time efficiency, CHUI-Miner [52] was designed as a single- phase solution using the EU-List structure. CHIU-Miner, also known as EFIM-Closed [53], was proposed with two upper thresholds with a forward-backward check facility. It applies local utility and subtree utility thresholds to prune the search space to improve performance [54]. Unfortunately, none of these methods [46-50] can assess the probability of coexisting items for profit.
A. An itemset
B. In sync with (1) and the allied HUIM problem, several definitions are derived. These are expressed as follows:
C.
D. For a transaction set
E. Definition 2 (Cumulative Utility of an Itemset in Database D)
F. For X itemset in D, the cumulative utility value is given by (3).
G. Definition 3 (Utility Values of a Distinct Transaction and a Complete Database D)
H. The utility of a transaction refers to the cumulative utilities of all items, as shown in (4).
I. Definition 4 (Relative Utility of an Itemset)
J. For database D, the relative utility of an itemset is given by (5).
K.
L. In the context of HUIM,
M.
N. An itemset
O.
P. If the support value of itemset
Q.
This work extends the above definitions and derives a factor called a probabilistic frequent closed high-utility itemset to improve overall efficiency. If the cumulative support value of an itemset
For the illustrations in Table 1, CTWU of item
Table 1 . Transaction utility and CTWU estimation.
TtID | ||||||
---|---|---|---|---|---|---|
38 | 39 | 11 | 25 | 30 | 37 | |
Item name | a | b | c | d | e | F |
CTWU | 143 | 144 | 77 | 118 | 141 | 123 |
In (7),
To apply PFCHUI, CULS is applied, which is followed by the storage of the residual utility values of the items in CULS for filtering. Here, CULS stores item information in the form of (
The
Let
1) System Model
The overall proposed model is depicted in Fig. 1.
a) Cumulative Utility Summary List (CULS):
Most existing HIUM models use a utility list structure to prune and define HUIs. The utility list of a set of entries in a transactional database can be measured by traversing the utility lists of its subsets without parsing the original database. In contrast, most existing methods perform repetitive intersection functions and contain noncritical elements in the memory. This exhausts the memory and time required by classical methods. Therefore, unlike this approach, classical utility-list models are not optimal for truncating infrequent and insignificant item sets in large transactional databases. Classical methods do not store the associations between itemsets or the probabilistic relationships between items. To address this problem, CULS is applied, which depends on a structure called a utility-list buffer [25]. CULS applies a utility list buffer structure (ULB), followed by its definition. In this study, ULB is designed as a memory-pipeline structure to store the information pertaining to each itemset in the form of (
In (10),
b) CULS Precheck Model:
The use of
c) Cumulative Summary List Structure (CSLS):
This work emphasizes performing frequent closed HUIM, signifying that not only is the itemset’s cumulative utility supposed to be higher than
d) Implementation:
To implement the model, function (12) (i.e.,
Algorithm-1 Pseudocode for CSLFC-HUIM mining Pseudocode-1 CSLFC-HUIM Mining Model
Input: Transaction database
Output: Targeted probabilistic frequent CHUI
Algorithms:
1. Initiate the scanning of the input transaction database D, and estimate the CTWU for first itemset.
2. Consider
3. Sort items in descending order (in
4. Restart scanning database
5. Form the proposed linked list structure.
6. Execute
Algorithm-1 Pseudocode for probabilistic high-frequent and high-utility itemsets
Pseudocode-2 Targeted High-Frequent and High-Utility Itemsets
Input:
Output: Targeted probabilistic frequent CHUI
1. Initiate a loop.
Algorithm-3 Pseudocode for generating targeted high-utility itemsets
Pseudocode-3
Input: CSLFC-HUIM
Output: Targeted probabilistic frequent CHUI
Multiple benchmark datasets were used to assess the efficacy of the proposed CSLFC-HUIM model. In addition, a comparison was made with different state-of-the-art HIUM methods, where the relative performance was characterized in terms of time and memory consumption. Six datasets were obtained from the SPMF Open-Source Data Mining Library [55] (Table 2). Table 2 presents the data specifications, including the total number of transactions, total items, the average length of transactions, and the type of data (sparse or dense). Dense data encompass several items per transaction, whereas sparse data contain relatively fewer items per transaction. The utility per transaction varies according to the number of items per transaction and its corresponding occurrence throughout the dataset. Accordingly, we examined the performance using different datasets, including both sparse and dense (data) structures. To assess efficacy, the performance was tested over different threshold values, including
Table 2 . Data specifications [55].
Dataset | Transactions | Items | Average length of transaction | Size (MB) | Type of the Data |
---|---|---|---|---|---|
Chess | 3196 | 75 | 37 | 656.6 KB | Dense |
Accident | 340,183 | 468 | 33.8 | 66.2 | Dense |
Retail | 88,162 | 16,470 | 10.3 | 6.7 | Sparse |
Chainstore | 1,112,949 | 46,086 | 7.3 | 39.1 | Sparse |
Kosarak | 990,002 | 41,270 | 8.1 | 21.6 | Sparse |
Connect | 67,557 | 129 | 43 | 16.9 | Dense |
Four state-of-the-art algorithms were considered: closed high-utility itemset mining (CHUI-Miner [56]), CLS-Miner [57], fast high average-utility itemset miner (FHUIM) [58], and frequent and closed HUIM (FCHUIM) [59]. The CHUIminer (CHUIM) and FCHUIM models were developed based on the closed HUIM [56] principle, where they focus mainly on improving the utility-list structure. The CSLFC-HUIM model can be considered an extension of these methods. However, unlike in [56-57-59], CSLFC-HUIM uses multiple support values to perform pruning. It applies statistics (probabilistic statistical values), including conviction, confidence, and lift, to design cumulative support values to refine pruning. Because lift is derived as an eventual feature by applying support value, confidence, and conviction, we considered this value at the place of the support value. CSLFC-HUIM uses the lift score as the support value to perform CULS and derive a nested utility list structure. It significantly reduces the candidate itemsets in the list structure, which decisively improves the computation. Fig. 2 shows the execution time analysis over different input datasets for different utility threshold values. The results indicate that the CSLFC-HUIM consumes less time.
This ability of CSLFC-HUIM improved the search space and computation. Moreover, the nested loop architecture further improved the computational efficacy and time consumption. In contrast, classical approaches, such as FCHUIM and CLSM, which are TWU pruning models, require more time. We varied the utility score for the different test datasets (i.e., Chess, Chainstore, Retail, Accident, Kosarak, and Connect) to assess the relative performance. The results confirm that the proposed CSLFC-HUIM model outperforms other stateof- the-art methods. For the chess data, where the utility threshold was varied from 350000 to 550000, CSLFC-HUIM performed better than other state-of-the-art models, such as CHUIM, CLSM, and FHAIM; however, the performance of FCHUIM was found to be close to that of the proposed CSLFC-HUIM model. This is because, similar to the proposed CSLFC-HUIM model, FCHUIM applies a support threshold-driven nested-list structure for pruning, which improves time efficiency. The simulation confirms that the CSLFC-HUIM model outperforms other state-of-the-art methods, whereas FCHUIM, CLSM, FHAIM, and CHUIM perform in decreasing order. CHUIM is a two-phase method that consumes more time than other methods. In contrast, FHAIM focuses on applying a strict upper bound in terms of the support threshold, which also reduces time efficiency. This is because it cannot avoid the presence of itemsets that have support values lower than the defined threshold. The presence of residual itemsets reduces the performance (Fig. 2). In CSLFC-HUIM, increasing the utility threshold decreases time consumption owing to reduced residual itemsets in the search space and, hence, enhances performance.
Because increasing the support threshold helps a model reduce the itemsets in the list structure (in this case, the CULS nested list structure), thereby decreasing the computational time, we varied the support threshold for different inputs in percentiles. For instance, in the chess data, we varied the support threshold from 10 to 50% of the target itemsets. The simulation results (Table 3) revealed that the proposed CSLFC-HUIM model performs better than other state-of-theart methods. The reduction in the residual itemsets in the CULS nested list structure helped reduce the execution time. The results reveal CSLFC-HUIM performs almost four times faster than the classical CHUIM methods. Compared to CLS-Miner, CSLFC-HUIM performs computations almost three times faster. The chainstore dataset carried a total of 1,112,949 transactions with a significantly large number of items (i.e., 46,086), and the different algorithms were simulated with 0.20, 0.40, 0.60, 0.80, and 1% of the item sets. The proposed CSLFC-HUIM model requires an average of 18.2 seconds to process these data for different target itemset batches. In contrast, FCHUIM, CLSM, FHAIM, and CHUIM take 35.5 seconds, 76.2 seconds, 104.7 seconds, and 88 seconds, respectively. For the Chess dataset, the average time consumptions for CHUIM, CLSM, FHAIM, and FCHUIM were 1582.6 seconds, 1255 seconds, 703.2 seconds, and 157.8 seconds, respectively. Compared to these algorithms, CSLFC-HUIM took merely 74.6 seconds, signifying superior performance in HIUM tasks. Similar outputs were found for the Kosarak data, where the existing methods showed higher time consumption (CHUIM (93.18s), CLSM (15.86s), FHAIM (73.4s), and FCHUIM (11.0s)). Compared to these state-ofthe- art methods, CSLFC-HUIM took 8.8 seconds, signifying superior time efficacy and exhaustion towards HUIM.
Table 3 . Relative runtime analysis with varying minimum cumulative support value (support threshold) for the different datasets.
Data | MinCumSup | CHUIM | CLSM | FHAIUM | FCHUI | CSLFCHUIM |
---|---|---|---|---|---|---|
Chess | 10% | 1825 | 1461 | 986 | 217 | 111.3 |
20% | 1779 | 1402 | 781 | 206 | 87 | |
30% | 1620 | 1247 | 703 | 131 | 69 | |
40% | 1458 | 1195 | 621 | 123 | 59.6 | |
50% | 1231 | 972 | 425 | 112 | 46.2 | |
Chainstore | 0.20% | 107.4 | 103.6 | 129.4 | 45 | 26.1 |
0.40% | 95.3 | 93.02 | 117.3 | 41.8 | 21.8 | |
0.60% | 89.1 | 79.3 | 104.2 | 36.3 | 17.02 | |
0.80% | 79.4 | 63.1 | 93.8 | 32.5 | 14.91 | |
1% | 68.8 | 42.1 | 79.2 | 21.9 | 11.26 | |
Retail | 0.20% | 24.31 | 21.27 | 14.83 | 7.9 | 6.7 |
0.40% | 16.96 | 13.8 | 11.12 | 7.2 | 6.4 | |
0.60% | 12.09 | 10.96 | 9.36 | 4.8 | 4.6 | |
0.80% | 7.82 | 9.91 | 6.63 | 4.21 | 4.02 | |
1% | 4.21 | 3.27 | 5.71 | 3.2 | 2.81 | |
Accident | 10% | 142 | 127 | 124.5 | 92 | 77.76 |
20% | 134.4 | 116.4 | 118.81 | 88.3 | 68.2 | |
30% | 127.4 | 86.8 | 99.32 | 79.2 | 49.91 | |
40% | 91.8 | 69.2 | 87.6 | 52.2 | 29.2 | |
50% | 77.7 | 56.6 | 79.1 | 29 | 21.6 | |
Kosarak | 2% | 129.91 | 38 | 102.8 | 32 | 26.62 |
4% | 107.2 | 18 | 90.3 | 6 | 4.9 | |
6% | 95.7 | 12 | 84.2 | 5.8 | 4.47 | |
8% | 83.2 | 6.2 | 47.6 | 5.6 | 4.2 | |
10% | 49.9 | 5.1 | 42.3 | 5.63 | 3.98 | |
Connect | 2% | 264 | 248 | 233 | 197 | 124.4 |
4% | 240 | 237 | 231 | 184 | 101.3 | |
6% | 197 | 187 | 219 | 162 | 97.76 | |
8% | 193 | 181.3 | 179.3 | 127.3 | 95.21 | |
10% | 176 | 159 | 161 | 87.3 | 83.22 |
The average memory utilization by CHUIM, CLSM, FHAIM, and FCHUIM algorithms was 216.4 MB, 208 MB, 204 MB, and 159.8 MB, respectively. In contrast, the proposed CSLFC-HUIM model consumes merely 35 MB of memory (Fig. 3), confirming its robustness in terms of memory utilization. Interestingly, CSLFC-HUIM performs almost four times lower in terms of memory exhaustion than the recently proposed FCHUIM algorithm and almost 6 times lower than CHUIM, CLS-Miner, and FHAIM algorithms. For the Chainstore data, the memory utilization measurement shows that the state-of-the-art CHUIM, CLSM, FHAIM, and FCHUIM consume 793 MB, 744 MB, 673 MB, and 696 MB of resources, respectively. In contrast, CSLFC-HUIM consumes only 132 MB of average memory over the different utility thresholds. A similar pattern can be found in the retail dataset, where CSLFC-HUIM exhibits almost five times lower memory (154 MB) than the state-of-the-art methods (CHUIM (732 MB), CLSM (686 MB), FHAIM (671 MB), and FCHUIM (576 MB)). For the Accident dataset, the memory consumption of CHUIM (1486 MB), CLSM (836 MB), FHAIM (1008 MB), and FCHUIM (816 MB) was higher than that of the proposed CSLFC-HUIM model (779 MB). For the Kosarak dataset, the average memory utilization of CHUIM, CLSM, FHAIM, and FCHUIM were 626 MB, 578 MB, 620 MB, and 384 MB, respectively, whereas CSLFC-HUIM consumed only 331.6 MB, signifying better resource efficiency. Similarly, for the Connect dataset, CHUIM (622 MB), CLSM (486 MB), FHAIM (566 MB), and FCHUIM (350 MB) were found to be resource-exhaustive than the proposed CSLFC-HUIM model (288 MB). The overall results confirm that the proposed CSLFC-HUIM can yield both time and memory efficiency to satisfy major HUIM purposes and enterprise demands.
This study proposed a lightweight high-utility closed-element set discovery model (CSLFC-HUIM) for HUIM model mining tasks. The proposed model focuses on improving the utility list structure with frequent and important item sets in the search space to improve overall computational efficiency. Unlike classical HUIM methods, the proposed model applies a lift measure or a value derived as a cumulative score to perform pruning. The use of the lift measure as a support threshold, called cumulative support, and corresponding cumulative utility values helps improve the nested list named cumulative utility list structure (CULS) to refine the search space and the optimal residual itemsets to reduce computational time and memory exhaustion. The use of the cumulative score as a support value as the threshold realized both closed-HUIM and frequent closed-HUIM. The implementation of the cumulative summary list structure with a precheck- assisted nested-list structure helped the CSLFC-HUIM model exhibit better computational time and memory utilization. Interestingly, the average runtime required (i.e., timeconsumption) by the different HUIM methods showed that the proposed method consumed only 42.71 seconds, whereas the other methods like CHUIM, CLSM, FHAIM, and FCHUI consumed 350.92 seconds, 275.5 seconds, 199.57 seconds, and 71.57 seconds, respectively. This confirms that the proposed CSLFC-HUIM model has exceedingly better performance than other state-of-the-art methods, and the role of improved multiple conditions or thresholds in the form of a lift score helped reduce the search space, while CULS enabled swift computation to perform HUIM. Simulations on different benchmark datasets show that the CSLFC-HUIM is nearly five times more responsive while ensuring significantly lower memory consumption. The relative performance with other state-of-the-art methods, such as CHUIM, FHAIM, CLSM, and FCHUIM, confirms that CSLFC-HUIM exhibits superior performance, where the role of the probabilistic derived lift score as a threshold cannot be ignored. As a matter of fact, CSLFC-HUIM is superior towards HUIM tasks; however, it could not address the need for top-N most frequent itemset identification. Future research will focus on identifying the top-N items to assist enterprises in making proactive supply chains and marketing decisions in the business horizon.
Table 1 . Transaction utility and CTWU estimation.
TtID | ||||||
---|---|---|---|---|---|---|
38 | 39 | 11 | 25 | 30 | 37 | |
Item name | a | b | c | d | e | F |
CTWU | 143 | 144 | 77 | 118 | 141 | 123 |
Table 3 . Relative runtime analysis with varying minimum cumulative support value (support threshold) for the different datasets.
Data | MinCumSup | CHUIM | CLSM | FHAIUM | FCHUI | CSLFCHUIM |
---|---|---|---|---|---|---|
Chess | 10% | 1825 | 1461 | 986 | 217 | 111.3 |
20% | 1779 | 1402 | 781 | 206 | 87 | |
30% | 1620 | 1247 | 703 | 131 | 69 | |
40% | 1458 | 1195 | 621 | 123 | 59.6 | |
50% | 1231 | 972 | 425 | 112 | 46.2 | |
Chainstore | 0.20% | 107.4 | 103.6 | 129.4 | 45 | 26.1 |
0.40% | 95.3 | 93.02 | 117.3 | 41.8 | 21.8 | |
0.60% | 89.1 | 79.3 | 104.2 | 36.3 | 17.02 | |
0.80% | 79.4 | 63.1 | 93.8 | 32.5 | 14.91 | |
1% | 68.8 | 42.1 | 79.2 | 21.9 | 11.26 | |
Retail | 0.20% | 24.31 | 21.27 | 14.83 | 7.9 | 6.7 |
0.40% | 16.96 | 13.8 | 11.12 | 7.2 | 6.4 | |
0.60% | 12.09 | 10.96 | 9.36 | 4.8 | 4.6 | |
0.80% | 7.82 | 9.91 | 6.63 | 4.21 | 4.02 | |
1% | 4.21 | 3.27 | 5.71 | 3.2 | 2.81 | |
Accident | 10% | 142 | 127 | 124.5 | 92 | 77.76 |
20% | 134.4 | 116.4 | 118.81 | 88.3 | 68.2 | |
30% | 127.4 | 86.8 | 99.32 | 79.2 | 49.91 | |
40% | 91.8 | 69.2 | 87.6 | 52.2 | 29.2 | |
50% | 77.7 | 56.6 | 79.1 | 29 | 21.6 | |
Kosarak | 2% | 129.91 | 38 | 102.8 | 32 | 26.62 |
4% | 107.2 | 18 | 90.3 | 6 | 4.9 | |
6% | 95.7 | 12 | 84.2 | 5.8 | 4.47 | |
8% | 83.2 | 6.2 | 47.6 | 5.6 | 4.2 | |
10% | 49.9 | 5.1 | 42.3 | 5.63 | 3.98 | |
Connect | 2% | 264 | 248 | 233 | 197 | 124.4 |
4% | 240 | 237 | 231 | 184 | 101.3 | |
6% | 197 | 187 | 219 | 162 | 97.76 | |
8% | 193 | 181.3 | 179.3 | 127.3 | 95.21 | |
10% | 176 | 159 | 161 | 87.3 | 83.22 |