Journal of information and communication convergence engineering 2022; 20(2): 103-112
Published online June 30, 2022
https://doi.org/10.6109/jicce.2022.20.2.103
© Korea Institute of Information and Communication Engineering
Correspondence to : Hyu Chan Park (E-mail: hcpark@kmou.ac.kr, Tel: +82-51-410-4573)
Department of Computer Engineering, Korea Maritime and Ocean University, Busan 49112, Korea.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The goal of pattern mining is to identify novel patterns in a database. High utility itemset mining (HUIM) is a research direction for pattern mining. This is different from frequent itemset mining (FIM), which additionally considers the quantity and profit of the commodity. Several algorithms have been used to mine high utility itemsets (HUIs). The original BPSO algorithm lacks local search capabilities in the subsequent stage, resulting in insufficient HUIs to be mined. Compared to the transfer function used in the original PSO algorithm, the V-shaped transfer function more sufficiently reflects the probability between the velocity and position change of the particles. Considering the influence of the acceleration factor on the particle motion mode and trajectory, a nonlinear acceleration strategy was used to enhance the search ability of the particles. Experiments show that the number of mined HUIs is 73% higher than that of the original BPSO algorithm, which indicates better performance of the proposed algorithm.
Keywords High utility itemsets, Particle swarm optimization, V-shaped transfer function, Nonlinear acceleration coefficient strategy.
Data mining is the process of searching for information hidden in large amounts of data using algorithms. The searched information can be either nonnumerical or inductive. It can also be used for information management, decision support, and other purposes.
Frequent itemset mining (FIM) is an important data mining technique. Frequent patterns refer to the set of items that appear frequently in a dataset, such as the set of items that appear frequently in transaction data. However, the same item may be purchased more than once in a single transaction, and the profit of each item varies. FIM does not consider these factors. To address these shortcomings of frequent itemsets, high utility itemset mining was proposed.
High-utility itemset mining (HUIM) mines a set of items that satisfy certain conditions using a certain strategy. A mining method is required to mine itemsets with a relatively low frequency and high utility value. For instance, low sales of jewelry, and high profits; high sales of bread and low profits. Therefore, the target to be mined should be defined beforehand and a minimum utility value should be provided as the judgment condition.
The process of finding all itemsets in the database that exceed the minimum utility value is called utility mining, and the identified itemsets are called utility itemsets. In practical application scenarios, the utility of an item set can be measured by a variety of factors, such as weight, profit, or cost, which can be chosen by the users themselves according to their needs [1]. Therefore, utility mining can be more intuitive for decision problems in practical scenarios.
For the traditional HUIM algorithm, when the size of the mined database is large or the number of items included is extremely large, the search space is large and requires a long time. Metaheuristic algorithms can be used to solve such problems.
Metaheuristic algorithm is a modification of the heuristic algorithm that combines a randomized algorithm with a local search algorithm. Kannimuthu developed high utility pattern extraction using a genetic algorithm with ranked mutation of the minimum utility threshold to mine high utility itemsets (HUIs) [2]. This algorithm requires crossover and mutation during the planning process to randomly generate the next solution. However, many calculations are required in the initial stage to find a satisfactory itemset, which requires a long time.
Another metaheuristic algorithm, particle swarm optimization (PSO), is a population-based stochastic optimization algorithm [3]. It is guided by individual and global solutions to determine the optimal solutions by changing the velocity and position of the particles. Compared with genetic algorithms, PSO algorithms save time by eliminating the need to perform crossover and mutation operations.
The original PSO algorithm is primarily used to solve continuous space optimization problems. Generally, it is used to find the optimal solution of a function; however, when mining high utility item sets, each item has only two states, that is, with or without this item. Therefore, a binary PSO (BPSO) algorithm was designed to solve this problem [4].
In BPSO, the discrete problem space is mapped to a continuous particle motion space with appropriate modifications. It retains the velocity position update strategy of classical PSO to find the optimal solution. The updated formula for the particle velocity remains the same; however, the value of an individual position can only be zero or one.
The probability of particle position change is determined by the transfer function. The absolute value of particle velocity should be considered when selecting the transfer function. For larger absolute values of the velocity, the transfer function should provide a higher possibility of position change. Because the velocity of a particle is related to its superiority or inferiority, a larger velocity indicates that it is farther from the optimal solution and vice versa [5].
The original BPSO algorithm uses a transfer function that is a sigmoid function, also known as an S-shaped function. Mirjalili divided the transfer function of the BPSO into S-shaped and V-shaped functions [6]. The experimental analysis of the S-shaped transfer function shows that the function lacks late local search capability. Mirjalili compared the effectiveness of V-shaped and S-shaped transfer functions on benchmark functions in optimization tests and verified the excellent search capability of the V-shaped transfer function on many benchmark functions.
During the particle search process, the variation in the acceleration coefficients of the particles can also affect the way the particles are searched. The acceleration coefficients were divided into linear and nonlinear. According to Sun's experiments, the nonlinear method is more accurate and stable when searching for an optimal solution [7].
Considering the mining ability and accuracy of the algorithm in the process of high utility itemsets mining, we propose a mining method based on the BPSO algorithm with a V-shaped transfer function and nonlinear acceleration coefficient strategy, called VBPSO.
The experiments show that the proposed VBPSO algorithm mines more HUIs than the BPSO. Although the running time of the VBPSO algorithm was longer, the number of HUIs mined in the same unit time was greater than that of the BPSO algorithm.
In this chapter, works related to HUIM and PSO are briefly reviewed.
High-utility itemset mining is an evolution of weighted FIM. In weighted FIM, only the number of occurrences of an item and its weight are considered; the number of occurrences of the item in a transaction is not considered. However, in a transaction, the same product may be purchased more than once, and the total profit generated by the transaction cannot be calculated solely based on the profit of a single product. Therefore, to meet the demand of merchants to pursue high sales profit without ignoring profitable but low support items, researchers have proposed high-utility itemset mining. It considers the unit profit (the weight of items) and number (the number of items in the transaction set) of items. Traditional weighted FIM is unsuitable for this scenario.
The problem of mining HUIs was first proposed by Chan [8]. Subsequently, Yao found HUIs in terms of the quantity of an item as an internal utility and its profit per unit as an external utility [9]. To solve this problem, they proposed a mining algorithm. Lin proposed using the PSO algorithm to mine HUIs, and subsequently proposed mining HUIs using the binary PSO algorithm to reduce mining time [1].
Utility is at the core of high utility itemset mining. It is determined by two factors: the external effects between different items in the database (the profit of the item), and the internal effects of the item in the transaction (the number of items). The utility of an item set in the database is the sum of the products of the external and internal utilities of each item in the item set. If the utility of an itemset is not less than the minimum threshold set by the user, the itemset is considered a high utility itemset.
The PSO algorithm, proposed by Kennedy and Eberhart in 1995, simulates the behavior of a group of birds searching for food in an area. In this area, there is only one piece of food. None of the birds knew where the food is. However, they know how far their current position is from the food; thus, the optimal strategy to find the food is to search the area around the bird closest to it.
The PSO algorithm is inspired by this biological population behavior property and is used to solve the optimization problem. Particles are used to model the aforementioned bird individuals, and each particle can be considered a search individual in an N-dimensional space. The current position of a particle is a candidate solution to the corresponding optimization problem, and the flight process of the particle is the search process of the individual. The flight velocity of a particle can be dynamically adjusted according to the individual's historical optimal position and the population's historical optimal position (Kennedy and Eberhart 1995).
The steps of PSO can be illustrated as follows：
1. Initialization: First, we set the number of particles and velocity interval, and then we set the maximum number of iterations and position information for the entire search space. The velocity and position of each particle are then randomly initialized within the velocity interval and search space.
2. Iteration: This involves defining the fitness function, the individual optimal solution (pbest) is the optimal solution found by each particle individually, that is, the best result after calculation using the fitness function. The result with the best outcome among these individual optimal solutions is treated as the global optimal solution (gbest). The pbest and gbest values are then compared with the historical values, and the largest values of pbest and gbest are updated and retained. Then, the particle updated its position based on pbest, gbest, and its own inertia. The particle position update method is illustrated in Fig. 1.
3. Termination condition: The global optimal solution of the particle, as well as its position and velocity, are continuously updated until the termination criterion is reached.
During the PSO algorithm update, the corresponding particle position and velocity are described as follows:
In Eqs. (1) and (2),
Kennedy and Eberhart suggested that the values of
Subsequently, Feng showed through experimental validation that
Particle swarm algorithms have been widely used in data mining. Jiang proposed a multi-objective particle swarm algorithm to generate association rules for the relationship between the sentiment dimensions and design attributes [13]. Indira proposed an adaptive method for controlling the inertia and learning factors in particle swarm algorithms to mine association rules [14]. Jitendra proposed a method for mining positive and negative association rules. They applied the coding step to affirmative and negative rules separately, in terms of the statistical correlation between transaction databases [15].
However, in practical scenarios, many problems have variables in a discrete variable space, rather than in a continuous interval. Kennedy and Eberhart designed a discrete BPSO to solve the limitations of continuous PSO [4].
The BPSO algorithm maps a discrete problem space to a continuous particle motion space by converting each particle into a set of binary variables. The BPSO requires a transfer function to convert real values to binary values of zero or one. Transfer functions define the possibility of changing the elements of the position vector from zero to one. The original transfer functions were used to perform probabilistic updates using the sigmoid transition functions. The advantage of the V-shaped transfer function over the sigmoid transfer function is that it encourages particles to remain at their current position when the velocity value is low and switch to their complement when the velocity value is high. This allows to enhance the local search capability of the algorithm, thus improving the performance of the BPSO algorithm. Zhang mined frequent itemsets using the BPSO algorithm based on the V-shaped transfer function [16].
High-utility itemset mining is the calculation of the utility value of an itemset from transactions in the database. It differs from FIM in that it considers the quantity and profit of items. The utility value must be calculated using a series of formulae. The problem description and related definitions in high utility itemset mining are described as follows.
Let
Table 1 presents an example to demonstrate this thesis. In Table 1, there are ten transactions and six different items, denoted by a to
Table 1 . Transaction database
TID | Transaction (item, quantity) |
---|---|
T_{1} | b:13, c:9, d:12 |
T_{2} | b:5, d:6, e:8, f:15 |
T_{3} | d:15, e:13, f:13 |
T_{4} | b:14, d:7 |
T_{5} | a:6, b:10, d:15, e:10 |
T_{6} | b:6, c:6 |
T_{7} | b:2, d:14 |
T_{8} | a:2, b:6, d:9, e:10 |
T_{9} | b:5, d:13, e:12 |
T_{10} | a:10, b:15, d:11, e:4 |
Table 2 . Profit table
Item | a | b | c | d | e | f |
---|---|---|---|---|---|---|
Profit | 2 | 9 | 8 | 10 | 5 | 7 |
There are seven definitions for the high utility of itemset calculations as follows.
Definition 1. The utility value of item
For example,
Definition 2. In transaction
For example, in transaction
Definition 3. The utility of itemset
For example, in the entire transaction database, the utility of itemsets a and bc are calculated as
Definition 4. The total utility of a transaction
For example,
Definition 5. The total utility of a DB database is denoted as
The total utility in this example is calculated as
Definition 6. A HUI is a set of item sets whose utility is greater than the threshold (
HUIM finds the itemsets with utility values greater than the threshold from the transaction database and the profit table, in which the threshold is set by the user.
In this section, the mining process is described in detail in conjunction with the PSO algorithm.
The proposed algorithm is based on the BPSO algorithm. The particles are the basic units of the BPSO algorithm, and each particle in the PSO algorithm corresponds to an itemset, that is, a potential HUI. Each position of the item set is composed of either one or zero, which represents the presence or absence of the corresponding item, respectively. As shown in Fig. 2, only the corresponding positions of items a and e are one, and thus the itemset corresponding to particle (10010) is ae.
Using the same rule for particle encoding, the items contained in each transaction and the quantities corresponding to each item are converted into a binary matrix and a corresponding quantity matrix according to this rule.
The utility value of the itemset is calculated by scanning the binary matrix and quantity matrix. Binarizing the database can reduce the amount of time spent scanning the database, which must be traversed once each time the utility value of a particle is calculated. The scan time of the binary matrix was shorter than that of the character-based data.
Considering Table 1 as an example, the converted binary item matrix and quantity matrix of this database are shown in Figs. 3 and 4, respectively.
In the BPSO algorithm, particle evaluation is based on the results of the fitness function. The fitness value is used to represent the utility value of the item set. The potential solutions generated by each particle are fed into the fitness function, and the merit of the particle is measured according to the magnitude of the fitness value. The best position of an individual is denoted as pbest and the best position in the population is denoted as gbest. They serve as the basis for particle movement. The fitness function in this algorithm can be defined as:
Consider the utility value calculation of item set (ae) as an example. Itemsets (ae) are present in transactions
The updated formula for the velocity of the BPSO algorithm is the same as that of the original PSO algorithm, as shown in Eq. (1). The probability of particle position change is based on the transfer function. The transfer function should be the probability of changing each dimension of the position vector from zero to one and vice versa.
The original transfer function for the BPSO position-vector update is a sigmoid function. The sigmoid transfer function is shown in Fig. 5, where X is the particle velocity. The sigmoid function is a monotonically increasing function that does not conform to the rule that the probability of its position change increases with the absolute value of the velocity. The V-shaped function does not force the particle to change its value from zero to one. Alternatively, it updates the position to its complement when the particle is moving extremely fast and keeps the particle at its original position when the velocity is extremely slow. The velocity of the particles in the PSO algorithm represents the quality of the particles. The V-shaped transfer function is shown in Fig. 6.
According to the velocity update formula, when the particle is farther from the optimal solution, the velocity of the particle increases. Therefore, when the velocity of the particle is larger, the particle should have a greater probability of changing its position vector, that is, it should be updated to the complement of that position vector. Therefore, we selected the V-shaped transfer function as the transfer function of the proposed algorithm.
In addition, compared with the sigmoid function, the V-shaped function has more powerful local search capability, which is necessary when mining the HUIs. This is because the resulting result is not only the global optimal solution, but also more potential solutions can be obtained only when it has a stronger local search capability. This is because the more capable the local search is, the wider the search will be.
The V-shaped function used in this study is shown in Eqs. (10) and (11), where
In Eq. (1),
The ideal state in the process of mining HUIs should be to expand the search space of particles at the early stage of the search to obtain diverse particles. With more potential solutions at the early stage, it can provide more references to particles in the search, as well as enhance the ability to search for global optimal solutions and find more HUIs in subsequent stages.
Therefore, considering the impact of acceleration on the BPSO algorithm, Chen used the proposed nonlinear acceleration adjustment strategy constructed by the arc cosine function (Chen, 2017), as shown in Eqs. (12) and (13). The trends of
The designed mining model, based on BPSO, is divided into four steps. The results are shown in Fig. 8.
Pre-processing consists of generating an item binary matrix and a quantity matrix. The preprocessing step focuses on providing the required type of data and facilitating the mining process. Therefore, in this process, the database is converted into a binary matrix and a quantity matrix. We then calculate the total utility value of the database DB.
The initialization process consists of initializing the particles and obtaining pbest and gbest. In the initialization process, the particles are initialized by a binary matrix, and each particle is provided with a random initial velocity; then, the value of each particle is calculated by the fitness function to derive pbest and gbest. Furthermore, in the subsequent iteration process, the particles are updated by performing the corresponding updates for velocity, pbest, and gbest with a V-shaped transfer function and nonlinear acceleration coefficient strategy. During the iteration process, if the utility value of the particle is greater than the set threshold, the corresponding position of the particle is added to the set of utility itemsets at the time of the update. This process is repeated until the upper limit of the number of iterations is reached, and finally, the set of HUIs is obtained.
In this section, the proposed HUIM-VBPSO (v-shaped) algorithm is compared with HUIM-BPSO (s-shaped). All experiments were performed in Microsoft Windows 10, with an Intel i5 3.0 GHz CPU and 16.38 GB RAM. The experiments were conducted simultaneously using three different databases: chess, mushroom, and foodmart datasets, which are commonly used for HUIM experiments. The information on the three databases is presented in Table 3. The number of particles was set to 20, both algorithms were iterated 10,000 times, and the velocity range of the particles was set between [-10, 10].
Table 3 . Dataset
Dataset | Number of items | Total number of transactions |
---|---|---|
Chess | 76 | 3,196 |
Mushroom | 120 | 8,124 |
Foodmart | 1,559 | 21,557 |
The percentage of discovered HUIs for HUIM-BPSO and the proposed algorithm HUIM-VBPSO are listed in Table 4. The number of HUIs mined by all algorithms for the three datasets is shown in Figs. 9-11.
Table 4 . Percentage of discovered HUIs
Threshold (%) | HUIM-BPSO (%) | HUIM-VBPSO (%) |
---|---|---|
Chess dataset | ||
25.30 | 47.06 | 88.24 |
25.50 | 57.14 | 92.85 |
25.70 | 58.62 | 96.55 |
Mushroom dataset | ||
14.00 | 34.78 | 60.87 |
14.25 | 38.46 | 66.67 |
14.50 | 40.00 | 70.00 |
Foodmart dataset | ||
0.12 | 15.87 | 25.40 |
0.13 | 18.42 | 28.94 |
0.14 | 20.00 | 33.44 |
The experiments show that the proposed VBPSO algorithm mines more HUIs than the BPSO. According to the experimental data, the proportion of HUIs mined by the BPSO algorithm decreased when the number of items increased, which may be a limitation of the PSO algorithm.
The running times of the HUIM-BPSO and HUIM-VBPSO are listed in Table 5. The number of HUIs mined by all the algorithms for the three datasets is shown in Figs. 12-14. For the runtime experiments, the thresholds for the three datasets were set to be consistent with those for the HUI mining experiments.
Table 5 . Running time of algorithms (in seconds)
Threshold (%) | HUIM-BPSO (s) | HUIM-VBPSO (s) |
---|---|---|
Chess dataset | ||
25.30 | 1,072 | 1,475 |
25.50 | 1,089 | 1,496 |
25.70 | 1,099 | 1,483 |
Mushroom dataset | ||
14.00 | 1,850 | 2,451 |
14.25 | 1,796 | 2,475 |
14.50 | 1,882 | 2,468 |
Foodmart dataset | ||
0.12 | 20,073 | 25,753 |
0.13 | 20,186 | 25,920 |
0.14 | 20,124 | 26,968 |
The running time of the VBPSO algorithm was longer than that of the BPSO algorithm. This is because, in the early stage, the local search needs to be intensified to expand the search area. This step requires more time than the BPSO. However, the number of HUIs mined in the same unit time was greater than that of the BPSO.
High-utility itemset mining is an important issue in data mining and can replace FIM to reveal high-profit products. To mine such itemsets, this study proposes a new BPSO algorithm called VBPSO, which is an extension of the original BPSO algorithm with a V-shaped transfer function and nonlinear acceleration coefficient strategy. The proposed VBPSO algorithm exhibited better performance. Several experiments showed that although the running time of the proposed VBPSO is 32% longer, the number of mined HUIs is 73% higher compared with the original BPSO. Therefore, it can be used for practical applications that are not real time.
In addition to the PSO algorithm, other metaheuristic algorithms have been proposed in recent years. In the future, we will try to adopt these new algorithms for HUIM to determine whether they can mine more HUIs or reduce the running time.
is a Ph.D. student in the Department of Computer Engineering at Korea Maritime and Ocean University. He received his B.S. degree in software engineering from East China Jiaotong University in 2015 and his M.S. degree in computer engineering from Korea Maritime and Ocean University in 2022. His research areas include data mining and big data analysis.
is a professor in the Department of Computer Engineering at Korea Maritime and Ocean University. He received his Ph.D. from the Universite de Fanche-Comte in 1995. His research areas include signal processing and embedded systems.
is a professor in the Department of Computer Engineering at Korea Maritime and Ocean University. He received his Ph.D. in computer engineering from the Korea Advanced Institute of Science and Technology in 1995. His research areas include databases, data mining, big data analysis, and marine information systems.
Journal of information and communication convergence engineering 2022; 20(2): 103-112
Published online June 30, 2022 https://doi.org/10.6109/jicce.2022.20.2.103
Copyright © Korea Institute of Information and Communication Engineering.
Bodong Tao , Ok Keun Shin , and Hyu Chan Park^{* }
Department of Computer Engineering, Korea Maritime and Ocean University, Busan 49112, Korea
Correspondence to:Hyu Chan Park (E-mail: hcpark@kmou.ac.kr, Tel: +82-51-410-4573)
Department of Computer Engineering, Korea Maritime and Ocean University, Busan 49112, Korea.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The goal of pattern mining is to identify novel patterns in a database. High utility itemset mining (HUIM) is a research direction for pattern mining. This is different from frequent itemset mining (FIM), which additionally considers the quantity and profit of the commodity. Several algorithms have been used to mine high utility itemsets (HUIs). The original BPSO algorithm lacks local search capabilities in the subsequent stage, resulting in insufficient HUIs to be mined. Compared to the transfer function used in the original PSO algorithm, the V-shaped transfer function more sufficiently reflects the probability between the velocity and position change of the particles. Considering the influence of the acceleration factor on the particle motion mode and trajectory, a nonlinear acceleration strategy was used to enhance the search ability of the particles. Experiments show that the number of mined HUIs is 73% higher than that of the original BPSO algorithm, which indicates better performance of the proposed algorithm.
Keywords: High utility itemsets, Particle swarm optimization, V-shaped transfer function, Nonlinear acceleration coefficient strategy.
Data mining is the process of searching for information hidden in large amounts of data using algorithms. The searched information can be either nonnumerical or inductive. It can also be used for information management, decision support, and other purposes.
Frequent itemset mining (FIM) is an important data mining technique. Frequent patterns refer to the set of items that appear frequently in a dataset, such as the set of items that appear frequently in transaction data. However, the same item may be purchased more than once in a single transaction, and the profit of each item varies. FIM does not consider these factors. To address these shortcomings of frequent itemsets, high utility itemset mining was proposed.
High-utility itemset mining (HUIM) mines a set of items that satisfy certain conditions using a certain strategy. A mining method is required to mine itemsets with a relatively low frequency and high utility value. For instance, low sales of jewelry, and high profits; high sales of bread and low profits. Therefore, the target to be mined should be defined beforehand and a minimum utility value should be provided as the judgment condition.
The process of finding all itemsets in the database that exceed the minimum utility value is called utility mining, and the identified itemsets are called utility itemsets. In practical application scenarios, the utility of an item set can be measured by a variety of factors, such as weight, profit, or cost, which can be chosen by the users themselves according to their needs [1]. Therefore, utility mining can be more intuitive for decision problems in practical scenarios.
For the traditional HUIM algorithm, when the size of the mined database is large or the number of items included is extremely large, the search space is large and requires a long time. Metaheuristic algorithms can be used to solve such problems.
Metaheuristic algorithm is a modification of the heuristic algorithm that combines a randomized algorithm with a local search algorithm. Kannimuthu developed high utility pattern extraction using a genetic algorithm with ranked mutation of the minimum utility threshold to mine high utility itemsets (HUIs) [2]. This algorithm requires crossover and mutation during the planning process to randomly generate the next solution. However, many calculations are required in the initial stage to find a satisfactory itemset, which requires a long time.
Another metaheuristic algorithm, particle swarm optimization (PSO), is a population-based stochastic optimization algorithm [3]. It is guided by individual and global solutions to determine the optimal solutions by changing the velocity and position of the particles. Compared with genetic algorithms, PSO algorithms save time by eliminating the need to perform crossover and mutation operations.
The original PSO algorithm is primarily used to solve continuous space optimization problems. Generally, it is used to find the optimal solution of a function; however, when mining high utility item sets, each item has only two states, that is, with or without this item. Therefore, a binary PSO (BPSO) algorithm was designed to solve this problem [4].
In BPSO, the discrete problem space is mapped to a continuous particle motion space with appropriate modifications. It retains the velocity position update strategy of classical PSO to find the optimal solution. The updated formula for the particle velocity remains the same; however, the value of an individual position can only be zero or one.
The probability of particle position change is determined by the transfer function. The absolute value of particle velocity should be considered when selecting the transfer function. For larger absolute values of the velocity, the transfer function should provide a higher possibility of position change. Because the velocity of a particle is related to its superiority or inferiority, a larger velocity indicates that it is farther from the optimal solution and vice versa [5].
The original BPSO algorithm uses a transfer function that is a sigmoid function, also known as an S-shaped function. Mirjalili divided the transfer function of the BPSO into S-shaped and V-shaped functions [6]. The experimental analysis of the S-shaped transfer function shows that the function lacks late local search capability. Mirjalili compared the effectiveness of V-shaped and S-shaped transfer functions on benchmark functions in optimization tests and verified the excellent search capability of the V-shaped transfer function on many benchmark functions.
During the particle search process, the variation in the acceleration coefficients of the particles can also affect the way the particles are searched. The acceleration coefficients were divided into linear and nonlinear. According to Sun's experiments, the nonlinear method is more accurate and stable when searching for an optimal solution [7].
Considering the mining ability and accuracy of the algorithm in the process of high utility itemsets mining, we propose a mining method based on the BPSO algorithm with a V-shaped transfer function and nonlinear acceleration coefficient strategy, called VBPSO.
The experiments show that the proposed VBPSO algorithm mines more HUIs than the BPSO. Although the running time of the VBPSO algorithm was longer, the number of HUIs mined in the same unit time was greater than that of the BPSO algorithm.
In this chapter, works related to HUIM and PSO are briefly reviewed.
High-utility itemset mining is an evolution of weighted FIM. In weighted FIM, only the number of occurrences of an item and its weight are considered; the number of occurrences of the item in a transaction is not considered. However, in a transaction, the same product may be purchased more than once, and the total profit generated by the transaction cannot be calculated solely based on the profit of a single product. Therefore, to meet the demand of merchants to pursue high sales profit without ignoring profitable but low support items, researchers have proposed high-utility itemset mining. It considers the unit profit (the weight of items) and number (the number of items in the transaction set) of items. Traditional weighted FIM is unsuitable for this scenario.
The problem of mining HUIs was first proposed by Chan [8]. Subsequently, Yao found HUIs in terms of the quantity of an item as an internal utility and its profit per unit as an external utility [9]. To solve this problem, they proposed a mining algorithm. Lin proposed using the PSO algorithm to mine HUIs, and subsequently proposed mining HUIs using the binary PSO algorithm to reduce mining time [1].
Utility is at the core of high utility itemset mining. It is determined by two factors: the external effects between different items in the database (the profit of the item), and the internal effects of the item in the transaction (the number of items). The utility of an item set in the database is the sum of the products of the external and internal utilities of each item in the item set. If the utility of an itemset is not less than the minimum threshold set by the user, the itemset is considered a high utility itemset.
The PSO algorithm, proposed by Kennedy and Eberhart in 1995, simulates the behavior of a group of birds searching for food in an area. In this area, there is only one piece of food. None of the birds knew where the food is. However, they know how far their current position is from the food; thus, the optimal strategy to find the food is to search the area around the bird closest to it.
The PSO algorithm is inspired by this biological population behavior property and is used to solve the optimization problem. Particles are used to model the aforementioned bird individuals, and each particle can be considered a search individual in an N-dimensional space. The current position of a particle is a candidate solution to the corresponding optimization problem, and the flight process of the particle is the search process of the individual. The flight velocity of a particle can be dynamically adjusted according to the individual's historical optimal position and the population's historical optimal position (Kennedy and Eberhart 1995).
The steps of PSO can be illustrated as follows：
1. Initialization: First, we set the number of particles and velocity interval, and then we set the maximum number of iterations and position information for the entire search space. The velocity and position of each particle are then randomly initialized within the velocity interval and search space.
2. Iteration: This involves defining the fitness function, the individual optimal solution (pbest) is the optimal solution found by each particle individually, that is, the best result after calculation using the fitness function. The result with the best outcome among these individual optimal solutions is treated as the global optimal solution (gbest). The pbest and gbest values are then compared with the historical values, and the largest values of pbest and gbest are updated and retained. Then, the particle updated its position based on pbest, gbest, and its own inertia. The particle position update method is illustrated in Fig. 1.
3. Termination condition: The global optimal solution of the particle, as well as its position and velocity, are continuously updated until the termination criterion is reached.
During the PSO algorithm update, the corresponding particle position and velocity are described as follows:
In Eqs. (1) and (2),
Kennedy and Eberhart suggested that the values of
Subsequently, Feng showed through experimental validation that
Particle swarm algorithms have been widely used in data mining. Jiang proposed a multi-objective particle swarm algorithm to generate association rules for the relationship between the sentiment dimensions and design attributes [13]. Indira proposed an adaptive method for controlling the inertia and learning factors in particle swarm algorithms to mine association rules [14]. Jitendra proposed a method for mining positive and negative association rules. They applied the coding step to affirmative and negative rules separately, in terms of the statistical correlation between transaction databases [15].
However, in practical scenarios, many problems have variables in a discrete variable space, rather than in a continuous interval. Kennedy and Eberhart designed a discrete BPSO to solve the limitations of continuous PSO [4].
The BPSO algorithm maps a discrete problem space to a continuous particle motion space by converting each particle into a set of binary variables. The BPSO requires a transfer function to convert real values to binary values of zero or one. Transfer functions define the possibility of changing the elements of the position vector from zero to one. The original transfer functions were used to perform probabilistic updates using the sigmoid transition functions. The advantage of the V-shaped transfer function over the sigmoid transfer function is that it encourages particles to remain at their current position when the velocity value is low and switch to their complement when the velocity value is high. This allows to enhance the local search capability of the algorithm, thus improving the performance of the BPSO algorithm. Zhang mined frequent itemsets using the BPSO algorithm based on the V-shaped transfer function [16].
High-utility itemset mining is the calculation of the utility value of an itemset from transactions in the database. It differs from FIM in that it considers the quantity and profit of items. The utility value must be calculated using a series of formulae. The problem description and related definitions in high utility itemset mining are described as follows.
Let
Table 1 presents an example to demonstrate this thesis. In Table 1, there are ten transactions and six different items, denoted by a to
Table 1 . Transaction database.
TID | Transaction (item, quantity) |
---|---|
T_{1} | b:13, c:9, d:12 |
T_{2} | b:5, d:6, e:8, f:15 |
T_{3} | d:15, e:13, f:13 |
T_{4} | b:14, d:7 |
T_{5} | a:6, b:10, d:15, e:10 |
T_{6} | b:6, c:6 |
T_{7} | b:2, d:14 |
T_{8} | a:2, b:6, d:9, e:10 |
T_{9} | b:5, d:13, e:12 |
T_{10} | a:10, b:15, d:11, e:4 |
Table 2 . Profit table.
Item | a | b | c | d | e | f |
---|---|---|---|---|---|---|
Profit | 2 | 9 | 8 | 10 | 5 | 7 |
There are seven definitions for the high utility of itemset calculations as follows.
Definition 1. The utility value of item
For example,
Definition 2. In transaction
For example, in transaction
Definition 3. The utility of itemset
For example, in the entire transaction database, the utility of itemsets a and bc are calculated as
Definition 4. The total utility of a transaction
For example,
Definition 5. The total utility of a DB database is denoted as
The total utility in this example is calculated as
Definition 6. A HUI is a set of item sets whose utility is greater than the threshold (
HUIM finds the itemsets with utility values greater than the threshold from the transaction database and the profit table, in which the threshold is set by the user.
In this section, the mining process is described in detail in conjunction with the PSO algorithm.
The proposed algorithm is based on the BPSO algorithm. The particles are the basic units of the BPSO algorithm, and each particle in the PSO algorithm corresponds to an itemset, that is, a potential HUI. Each position of the item set is composed of either one or zero, which represents the presence or absence of the corresponding item, respectively. As shown in Fig. 2, only the corresponding positions of items a and e are one, and thus the itemset corresponding to particle (10010) is ae.
Using the same rule for particle encoding, the items contained in each transaction and the quantities corresponding to each item are converted into a binary matrix and a corresponding quantity matrix according to this rule.
The utility value of the itemset is calculated by scanning the binary matrix and quantity matrix. Binarizing the database can reduce the amount of time spent scanning the database, which must be traversed once each time the utility value of a particle is calculated. The scan time of the binary matrix was shorter than that of the character-based data.
Considering Table 1 as an example, the converted binary item matrix and quantity matrix of this database are shown in Figs. 3 and 4, respectively.
In the BPSO algorithm, particle evaluation is based on the results of the fitness function. The fitness value is used to represent the utility value of the item set. The potential solutions generated by each particle are fed into the fitness function, and the merit of the particle is measured according to the magnitude of the fitness value. The best position of an individual is denoted as pbest and the best position in the population is denoted as gbest. They serve as the basis for particle movement. The fitness function in this algorithm can be defined as:
Consider the utility value calculation of item set (ae) as an example. Itemsets (ae) are present in transactions
The updated formula for the velocity of the BPSO algorithm is the same as that of the original PSO algorithm, as shown in Eq. (1). The probability of particle position change is based on the transfer function. The transfer function should be the probability of changing each dimension of the position vector from zero to one and vice versa.
The original transfer function for the BPSO position-vector update is a sigmoid function. The sigmoid transfer function is shown in Fig. 5, where X is the particle velocity. The sigmoid function is a monotonically increasing function that does not conform to the rule that the probability of its position change increases with the absolute value of the velocity. The V-shaped function does not force the particle to change its value from zero to one. Alternatively, it updates the position to its complement when the particle is moving extremely fast and keeps the particle at its original position when the velocity is extremely slow. The velocity of the particles in the PSO algorithm represents the quality of the particles. The V-shaped transfer function is shown in Fig. 6.
According to the velocity update formula, when the particle is farther from the optimal solution, the velocity of the particle increases. Therefore, when the velocity of the particle is larger, the particle should have a greater probability of changing its position vector, that is, it should be updated to the complement of that position vector. Therefore, we selected the V-shaped transfer function as the transfer function of the proposed algorithm.
In addition, compared with the sigmoid function, the V-shaped function has more powerful local search capability, which is necessary when mining the HUIs. This is because the resulting result is not only the global optimal solution, but also more potential solutions can be obtained only when it has a stronger local search capability. This is because the more capable the local search is, the wider the search will be.
The V-shaped function used in this study is shown in Eqs. (10) and (11), where
In Eq. (1),
The ideal state in the process of mining HUIs should be to expand the search space of particles at the early stage of the search to obtain diverse particles. With more potential solutions at the early stage, it can provide more references to particles in the search, as well as enhance the ability to search for global optimal solutions and find more HUIs in subsequent stages.
Therefore, considering the impact of acceleration on the BPSO algorithm, Chen used the proposed nonlinear acceleration adjustment strategy constructed by the arc cosine function (Chen, 2017), as shown in Eqs. (12) and (13). The trends of
The designed mining model, based on BPSO, is divided into four steps. The results are shown in Fig. 8.
Pre-processing consists of generating an item binary matrix and a quantity matrix. The preprocessing step focuses on providing the required type of data and facilitating the mining process. Therefore, in this process, the database is converted into a binary matrix and a quantity matrix. We then calculate the total utility value of the database DB.
The initialization process consists of initializing the particles and obtaining pbest and gbest. In the initialization process, the particles are initialized by a binary matrix, and each particle is provided with a random initial velocity; then, the value of each particle is calculated by the fitness function to derive pbest and gbest. Furthermore, in the subsequent iteration process, the particles are updated by performing the corresponding updates for velocity, pbest, and gbest with a V-shaped transfer function and nonlinear acceleration coefficient strategy. During the iteration process, if the utility value of the particle is greater than the set threshold, the corresponding position of the particle is added to the set of utility itemsets at the time of the update. This process is repeated until the upper limit of the number of iterations is reached, and finally, the set of HUIs is obtained.
In this section, the proposed HUIM-VBPSO (v-shaped) algorithm is compared with HUIM-BPSO (s-shaped). All experiments were performed in Microsoft Windows 10, with an Intel i5 3.0 GHz CPU and 16.38 GB RAM. The experiments were conducted simultaneously using three different databases: chess, mushroom, and foodmart datasets, which are commonly used for HUIM experiments. The information on the three databases is presented in Table 3. The number of particles was set to 20, both algorithms were iterated 10,000 times, and the velocity range of the particles was set between [-10, 10].
Table 3 . Dataset.
Dataset | Number of items | Total number of transactions |
---|---|---|
Chess | 76 | 3,196 |
Mushroom | 120 | 8,124 |
Foodmart | 1,559 | 21,557 |
The percentage of discovered HUIs for HUIM-BPSO and the proposed algorithm HUIM-VBPSO are listed in Table 4. The number of HUIs mined by all algorithms for the three datasets is shown in Figs. 9-11.
Table 4 . Percentage of discovered HUIs.
Threshold (%) | HUIM-BPSO (%) | HUIM-VBPSO (%) |
---|---|---|
Chess dataset | ||
25.30 | 47.06 | 88.24 |
25.50 | 57.14 | 92.85 |
25.70 | 58.62 | 96.55 |
Mushroom dataset | ||
14.00 | 34.78 | 60.87 |
14.25 | 38.46 | 66.67 |
14.50 | 40.00 | 70.00 |
Foodmart dataset | ||
0.12 | 15.87 | 25.40 |
0.13 | 18.42 | 28.94 |
0.14 | 20.00 | 33.44 |
The experiments show that the proposed VBPSO algorithm mines more HUIs than the BPSO. According to the experimental data, the proportion of HUIs mined by the BPSO algorithm decreased when the number of items increased, which may be a limitation of the PSO algorithm.
The running times of the HUIM-BPSO and HUIM-VBPSO are listed in Table 5. The number of HUIs mined by all the algorithms for the three datasets is shown in Figs. 12-14. For the runtime experiments, the thresholds for the three datasets were set to be consistent with those for the HUI mining experiments.
Table 5 . Running time of algorithms (in seconds).
Threshold (%) | HUIM-BPSO (s) | HUIM-VBPSO (s) |
---|---|---|
Chess dataset | ||
25.30 | 1,072 | 1,475 |
25.50 | 1,089 | 1,496 |
25.70 | 1,099 | 1,483 |
Mushroom dataset | ||
14.00 | 1,850 | 2,451 |
14.25 | 1,796 | 2,475 |
14.50 | 1,882 | 2,468 |
Foodmart dataset | ||
0.12 | 20,073 | 25,753 |
0.13 | 20,186 | 25,920 |
0.14 | 20,124 | 26,968 |
The running time of the VBPSO algorithm was longer than that of the BPSO algorithm. This is because, in the early stage, the local search needs to be intensified to expand the search area. This step requires more time than the BPSO. However, the number of HUIs mined in the same unit time was greater than that of the BPSO.
High-utility itemset mining is an important issue in data mining and can replace FIM to reveal high-profit products. To mine such itemsets, this study proposes a new BPSO algorithm called VBPSO, which is an extension of the original BPSO algorithm with a V-shaped transfer function and nonlinear acceleration coefficient strategy. The proposed VBPSO algorithm exhibited better performance. Several experiments showed that although the running time of the proposed VBPSO is 32% longer, the number of mined HUIs is 73% higher compared with the original BPSO. Therefore, it can be used for practical applications that are not real time.
In addition to the PSO algorithm, other metaheuristic algorithms have been proposed in recent years. In the future, we will try to adopt these new algorithms for HUIM to determine whether they can mine more HUIs or reduce the running time.
Table 1 . Transaction database.
TID | Transaction (item, quantity) |
---|---|
T_{1} | b:13, c:9, d:12 |
T_{2} | b:5, d:6, e:8, f:15 |
T_{3} | d:15, e:13, f:13 |
T_{4} | b:14, d:7 |
T_{5} | a:6, b:10, d:15, e:10 |
T_{6} | b:6, c:6 |
T_{7} | b:2, d:14 |
T_{8} | a:2, b:6, d:9, e:10 |
T_{9} | b:5, d:13, e:12 |
T_{10} | a:10, b:15, d:11, e:4 |
Table 2 . Profit table.
Item | a | b | c | d | e | f |
---|---|---|---|---|---|---|
Profit | 2 | 9 | 8 | 10 | 5 | 7 |
Table 3 . Dataset.
Dataset | Number of items | Total number of transactions |
---|---|---|
Chess | 76 | 3,196 |
Mushroom | 120 | 8,124 |
Foodmart | 1,559 | 21,557 |
Table 4 . Percentage of discovered HUIs.
Threshold (%) | HUIM-BPSO (%) | HUIM-VBPSO (%) |
---|---|---|
Chess dataset | ||
25.30 | 47.06 | 88.24 |
25.50 | 57.14 | 92.85 |
25.70 | 58.62 | 96.55 |
Mushroom dataset | ||
14.00 | 34.78 | 60.87 |
14.25 | 38.46 | 66.67 |
14.50 | 40.00 | 70.00 |
Foodmart dataset | ||
0.12 | 15.87 | 25.40 |
0.13 | 18.42 | 28.94 |
0.14 | 20.00 | 33.44 |
Table 5 . Running time of algorithms (in seconds).
Threshold (%) | HUIM-BPSO (s) | HUIM-VBPSO (s) |
---|---|---|
Chess dataset | ||
25.30 | 1,072 | 1,475 |
25.50 | 1,089 | 1,496 |
25.70 | 1,099 | 1,483 |
Mushroom dataset | ||
14.00 | 1,850 | 2,451 |
14.25 | 1,796 | 2,475 |
14.50 | 1,882 | 2,468 |
Foodmart dataset | ||
0.12 | 20,073 | 25,753 |
0.13 | 20,186 | 25,920 |
0.14 | 20,124 | 26,968 |