Journal of information and communication convergence engineering 2023; 21(1): 82-89
Published online March 31, 2023
https://doi.org/10.56977/jicce.2023.21.1.82
© Korea Institute of Information and Communication Engineering
Correspondence to : Hyun Ahn (E-mail: hyunahn@hs.ac.kr)
Division of Software Convergence, Hanshin University, Osan 18101, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Various machine-learning models may yield high predictive power for massive time series for time series prediction. However, these models are prone to instability in terms of computational cost because of the high dimensionality of the feature space and nonoptimized hyperparameter settings. Considering the potential risk that model training with a high-dimensional feature set can be time-consuming, we evaluate a feature-importance-based feature selection method to derive a tradeoff between predictive power and computational cost for time series prediction. We used two machine learning techniques for performance evaluation to generate prediction models from a retail sales dataset. First, we ranked the features using impurity- and Local Interpretable Model-agnostic Explanations (LIME) -based feature importance measures in the prediction models. Then, the recursive feature elimination method was applied to eliminate unimportant features sequentially. Consequently, we obtained a subset of features that could lead to reduced model training time while preserving acceptable model performance.
Keywords Feature importance, Feature selection, LIME, Time series prediction
Over the past decade, we have witnessed a data explosion in terms of scale and variety, and advances in computing technology have taken such phenomena into account. In this regard, various prediction models for time series prediction that yield a high predictive power by learning patterns of vast and diverse time series have been proposed [1,2].
However, as these data-intensive techniques (e.g., machine learning and deep learning models) generally require massive amounts of computation, the challenge of determining a tradeoff between performance and time cost cannot be understated. For instance, in several state-of-the-art deep learning models, as the amount of data and feature space size increase, the learning process tends to cause nonlinearly increasing amounts of computation.
Hence, in this study, we experimentally validated a feature selection method that guarantees acceptable model performance based on feature importance and significantly reduces the training time. To this end, prediction models were created using a retail sales dataset [3], and features were ranked by applying two types of feature importance. Subsequently, using the recursive feature elimination (RFE) process, the model accuracy and training time for each feature subset were measured while sequentially removing features with lower ranks (i.e., less important features).
The remainder of this paper is organized as follows. Section II summarizes previous studies on feature selection methods based on feature importance for time series prediction. Section III describes the retail sales dataset used in the experiment. Section IV presents the feature selection method proposed herein. Section V provides the experimental results regarding the tradeoff between model performance and training time using feature importance. Finally, the conclusions and future directions are presented in Section VI.
Sales prediction is one of the most critical issues in building retail businesses. Specifically, store operations can be optimized through short-term sales forecasting (e.g., dayahead prediction). It is also feasible to forecast future sales trends by analyzing long-term time series data. As sales prediction is a time-series prediction problem, in several cases, raw data are given as multivariate rather than univariate. Therefore, computational approaches, such as machine learning techniques that can efficiently deal with it, are being actively applied [4,5].
For the Rossmann store sales dataset [3], a retail sales dataset provided by Kaggle [6], Kohli et al. [4] performed sales predictions for each drug store using linear regression and k-nearest neighbor (k-NN) regression models. The mean absolute percentage error (MAPE) was 22.1 and 31.4%, respectively. The experimental results suggest that the predictive power of the linear regression model for nonlinear problems is insufficient. The k-NN algorithm is based on feature similarity; however, it also shows limited performance because the importance of each feature is not considered sufficiently.
Weng et al. [5] performed sales prediction on the Rossmann dataset by combining long short-term memory (LSTM [1]), which is a deep learning technique appropriate for nonlinear problems, and a light gradient boosting machine (LightGBM [7]), which has the capability to avoid overfitting. However, as the feature selection technique based on feature importance was not applied in this study, the high computational cost of the LSTM model, which processes sequences of multiple time steps as inputs, is inevitable. Therefore, concerning retail sales prediction, this study focuses on feature selection methods to explore the optimal solution for the tradeoff between the computational cost and model predictive power.
Feature selection strategies are broadly classified into wrapper, filter, and embedded methods [8]. Feature importance can be used as a criterion for scoring and ranking features and is mainly used as the basis for the operation of filter methods. For some machine learning models, the feature importance of a model is naturally quantified during the model training.
In [9,10], the Gini impurity, that indicates the effectiveness of splitting nodes, was employed as the feature importance for feature selection. Furthermore, in [10,11], the authors applied an recursive feature elimination (RFE) method to the features ranked by feature importance to obtain an optimal feature subset.
In [12], a hybrid RFE method combining different feature importance measures for support vector machine, random forest, and gradient boosting machine models was proposed to ensure the robustness of feature selection. Furthermore, to go beyond basic feature importance, the dynamic feature importance (DFI) index was proposed to readjust the measured feature importance according to the redundancy between features [13].
Taken together, a part of the machine learning models (particularly tree-based) intrinsically provides the measured feature importance, which can make filter-based feature selection easier. On the other hand, to obtain feature importance from deep learning models with difficulty of interpretation, it is necessary to apply other explanation means, such as explainable artificial intelligence (XAI) methods.
In [14], the authors adopted local-interpretable model explanations (LIME), an XAI method, to measure feature importance for identified clusters. Similarly, in [15], three feature importance criteria (mean decrease accuracy, Shapley value, and LIME) were compared in terms of the stability of feature selection.
In this study, we evaluated the effectiveness of a feature selection method that combines impurity- and LIME-based feature importance measures, considering the computational cost and model predictive power.
This study uses the Rossmann store sales dataset [3] provided by Kaggle [6]. The dataset contains daily sales data for 1,115 drug stores in Europe. Table 1 represents the description of the dataset.
Variables in the Rossmann dataset
Data source | Variable | Description |
---|---|---|
Store information | Store | A unique identifier for each store |
Store Type | Differentiates among 4 different store models: a, b, c, d | |
Assortment | Describes an assortment level: a = basic, b = extra, c = extended | |
Competition Distance | Distance in meters to the nearest competitor store | |
Competition OpenSince | The approximated year and month of the time the nearest competitor was opened | |
Promo2 | A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating | |
Promo2 Since | The year and calendar week when the store started participating in Promo2 | |
Promo Interval | The consecutive intervals Promo2 is started, naming the months the promotion is started anew. | |
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store | ||
Store sales data | Store | |
Day Of Week | The day of the week | |
Data | A date formatted as YYYY-MM-DD | |
Sales | The turnover for any given day (target variable) | |
Customers | The number of customers on a given day | |
Open | An indicator for whether the store was open: 0 = closed, 1 = open | |
Promo | Indicates whether a store is running a promotion on that day | |
State Holiday | Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none | |
School Holiday | Indicates a school holiday: 1 = school holiday, 0 = none |
From the perspective of data quality, it is desirable to handle abnormal data using an appropriate method. We applied an interquartile range (IQR)-based outlier detection method to eliminate extreme outliers. The IQR is defined as Q3 − Q1, which represents the difference between the third quartile (Q3) and the first quartile (Q1). The upper fence (f3) and lower fence (f1) that distinguish extreme outliers from normal data are defined as follows:
After removing the outliers, the remaining data are rescaled through the min-max normalization to speed up and stabilize the training of the models.
where x is the single value for a specific variable, xmax is the maximum value, and xmin is the minimum value. The normalized value xnorm is calculated according to the above equation.
The target variable we aim to predict is the daily retail sales, and the number of customers is an important variable required for the sales prediction. Table 2 presents the descriptive statistics for these two variables.
Descriptive statistics of the sales and number of customers
Variable | Mean | Std. Dev. | Min | Max |
---|---|---|---|---|
Sales | 6955.51 | 3104.21 | 0 | 41551 |
Customers | 762.73 | 401.23 | 0 | 7388 |
Both the daily retail sales and the number of customers have weekly and yearly seasonal components, as shown in Fig. 2. These variables tended to peak at the end of each year.
Promo2, which indicates whether a store is participating in the promotion, is also an important variable to explore the significant difference from the sales data. As shown in Fig. 3, the group of stores that conduct promotions is more likely to achieve high sales.
This section describes the primary concepts of the feature selection method investigated in this study. First, two types of feature importance are briefly introduced: impurity-based and LIME-based feature importance. Subsequently, an RFE method with a threshold is described as the feature selection method.
For tree-structured models, impurity is an indicator used to evaluate the quality of splits on the nodes. In the case of random forest, the impurity-based feature importance is measured according to the following: Given an element s of a set of nodes S in a tree t, i(s) and w(s) represent the impurity and fraction of a number of samples of s, respectively. For regression tasks, the impurity of a node was measured using a loss function. In this study, we applied the mean squared error (MSE) as the metric for i(s).
where n is the number of samples, yi is the ith observation, and
Assuming that t is binary, s has two children:sl andsr. The impurity reduction at s, denoted by Δi(s), measures the impurity differences between s and its children,sl andsr. Therefore, a split at s is considered effective if the impurity reduction occurs to a significant extent.
Given a particular feature θ, Sθ is defined as a subset of nodes that split the data on θ. Accordingly, the feature importance of θ for an individual tree t, denoted by I(θ ), is measured using the following equation. Further, the final feature importance for a random forest is calculated by averaging the I(θ ) values of the individual trees.
LIME [16] is an XAI technique used to understand and explain a complicated model for the instance level using surrogate models. As LIME focuses on local explanations, an explanation model that approximates the original model linearly is created, centered on a specific instance that needs to be explained. Explanations provided by an explanation model are referred to as interpretable visual representations, which vary depending on the type of data (text, image, and tabular data). Therefore, LIME users, including nonexperts, can easily understand the resultant explanations regardless of the actual features used by the model. LIME attempts to derive an explanation model that replicates the behaviors of the original model locally while having low model complexity.
where f is the original model and g is an explanation model as an element of a set of interpretable models G. x is an instance that needs to be explained. πx measures the proximity between a perturbed instance z and x.
The term (f, g, πx) represents a locality-aware loss. It accumulates squared errors between the output of f for z, which is used as a label, and the output of g for z', which is a perturbed instance of reduced dimensionality. This term evaluates the local fidelity of g with respect to f.
Ω(g) represents the model complexity and is measured in different ways depending on the type of g. For example, if g is a tree-structured model (e.g., decision tree), Ω(g) represents the depth of the tree, whereas for a linear model, it corresponds to the number of nonzero coefficients of the model.
As our objective model is a prediction model for retail sales, the feature importance measured by LIME is a regression coefficient in a ridge regression model, which is a default explanation model for regression tasks.
RFE is the simplest form of the backward feature selection method. This method iteratively removes features ranked by importance one at a time. Let ρ(θt) be a ranking of θ by t, where t ∊ 𝒯 is the type of feature importance. Then, the ranking score of features and the marginal sum of the ranking by the feature importance of individual features are obtained.
As our method aims to determine a tradeoff between the predictive power and computational cost, we introduce the falling threshold β, limiting the lower prediction accuracy bound in the RFE method. For example, assuming that the predictive power of the baseline model φ(𝓕) using the complete set of features is
For each iteration, RFE selects the least important feature θr from the current feature set Fs. In other words, the model was trained on the feature set 𝓕s, except for the feature with the maximum of S(θ ). If the predictive power was better than or equal to beta, the feature was removed.
The detailed procedure of determining image size
Algorithm 1: Recursive Feature Elimination with β | |
---|---|
Input | 𝓕: Set offeatures |
𝒯: Set of classes of feature importance | |
begin | |
for |
|
end | |
while |
|
end | |
return |
To validate the feature selection method, we conducted an experiment involving ranking the features by measuring their importance and analyzing performance for each feature subset arranged by the RFE method with β. The specifications of the equipment and software libraries used in the experiments are listed in Table 3.
Experimental settings
Attribute | Description |
---|---|
GPU | NVIDIA GeForce 2080 RTX Super (8GB) |
CPU | Intel Core i9-9900 |
RAM | 32GB |
OS | Windows 10 Pro 64bit |
Python Libraries | numpy (numerical operations) |
pandas (data I/O and manipulation) | |
scikit-learn (feature engineering, model) | |
lightgbm (model) | |
bayesian-optimization (hyperparameter optimization) | |
lime (feature importance) | |
matplotlib (visualization) |
Various methods and models can be considered for retail sales prediction. As the feature selection method in this study combines impurity- and LIME-based feature importance, we use two machine learning models to which both of these feature importance methods are applicable: random forest regression (RFR) and light gradient boosting machine (LightGBM).
Random forest [17] is an ensemble machine learning model based on the bagging technique and has been widely used for classification and regression tasks. In this experiment, RFR was used, and its hyperparameters were configured as shown in Table 4.
Hyperparameter settings for RFR
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 100 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
min_samples_split | 2 | [2, 100] | Minimum number of samples required to split |
max_samples | 1.0 | [0.1, 1.0] | Fraction of the original dataset is given to any individual tree |
LightGBM [7] is a computationally efficient gradient boosting algorithm that prevents overfitting. The hyperparameters of LightGBM were set as listed in Table 5.
Hyperparameter settings for LightGBM
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 200 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
num_leaves | 100 | [2, 512] | Maximum tree leaves |
subsample | 1.0 | [0.1, 1.0] | Subsample ratio of the training instance |
The feature rankings for the two models were derived based on measurements of feature importance. Table 6 and 7 present the partial results (top-5 features). In the case of RFR, Promo was evaluated as the most important feature. In addition, DayOfWeek_1 (i.e., Monday) and Month_12 (i.e., December), the derived features for modeling weekly and annual cycles, were evaluated as high-ranking features.
Partial results of measuring feature importance and overall rankings for RFR
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
Promo | 0.1367 (2) | 829.5015 (1) | 1 |
DayOfWeek_1 | 0.0335 (5) | 225.5958 (2) | 2 |
Month_12 | 0.0223 (7) | 109.6942 (4) | 3 |
CompetitionOpenElapsedDays | 0.1306 (3) | 59.6449 (9) | 4 |
Promo2ElapsedDays | 0.0416 (4) | 47.9985 (11) | 5 |
Partial results of measuring feature importance and overall rankings for LightGBM
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
CompetitionDistance | 43182 (1) | 709.5813 (3) | 1 |
Promo | 13054 (4) | 886.1849 (1) | 2 |
DayOfWeek_1 | 5119 (7) | 748.4422 (2) | 3 |
CompetitionOpenElapsedDays | 33031 (2) | 98.6852 (12) | 4 |
Month_12 | 4025 (11) | 463.3038 (5) | 5 |
However, the distance from the competing drug store (i.e., CompetitionDistance) is considered the most important feature of the LightGBM model.
Note that the ranking results of the two separately applied feature importance methods differ significantly, which supports the idea that the hybrid approach aggregating multiple feature importance is appropriate as a feature selection method.
In the performance analysis experiment, the lowest-ranking features were removed individually using the RFE method. For each iteration, the weighted mean absolute percentage error (WMAPE) value and training time were measured to derive a tradeoff according to the falling threshold β. Therefore, the β values that determine the tradeoff were divided into 0.05, 0.1, and 0.2 and applied to each experiment.
The performance analysis involves the basic performance measurements of models and additional performance measurements with Bayesian optimization (BO), which is widely used in model optimization. The search ranges of the hyperparameters in BO are presented in Tables 4 and 5.
To evaluate the accuracy of the prediction models, the WMAPE was used as a scale-independent metric.
where n is the number of instances, and yi and are the observation and prediction in the ith instance, respectively.
Fig. 4 shows the performance measured by the RFR and the tradeoffs for each β value (5, 10, and 20%). In RFR, the training time was dramatically reduced as the number of features decreased. In contrast, when the number of features was less than 11, the WMAPE increased steeply because features of high predictive power were eliminated.
On the other hand, LightGBM trains the model hundreds of times faster than RFR because it performs pruning, which significantly reduces the amount of computation. However, in general, higher WMAPE values than those of RFR are obtained. A small feature subset leads to a reduction in training time. However, the effect on the overall performance is trivial.
As both models are based on the ensemble approach, they naturally exhibit randomness and are sensitive to hyperparameter settings, resulting in a potentially volatile performance. In the experiment, we attempted to achieve a stable performance by applying BO to each basic model, and the results are shown in Fig. 5.
For both models, the results suggest that the degree of reduction in training time by feature elimination is inconsistent. In LightGBM, similar to Fig. 4, model training is completed much faster than in RFR. Therefore, the reduction in training time on a smaller feature subset is negligible. In the case of RFR, although significant training time is required owing to BO, the training time is considerably reduced while RFE is performed until near the beta value, whereas the predictive power of the model remains at an acceptable level.
The results of the overall performance analysis are presented in Table 8. In summary, the high computational cost of the RFR can be significantly reduced by RFE with an appropriate beta. In particular, our feature selection method leads to a considerable reduction in training time in cases involving BO. This advantage allows us to derive tradeoffs that ensure reduced computational cost and acceptable model predictive power. In the case of LightGBM, as it is a computationally efficient algorithm, the validity of the tradeoffs discovered by the RFE is evaluated to be relatively less important.
Performance evaluation results
Models β | WMAPE | # Features | Training time (s) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | |
RFR | .1023 | .1072 | .1113 | .1132 | 42 | 21 | 16 | 15 | 736 | 384 | 304 | 284 |
LightGBM | .1275 | .1332 | .1382 | .1448 | 42 | 22 | 15 | 13 | 1.89 | 1.60 | 1.53 | 1.45 |
RFR+BO | .1013 | .1063 | .1108 | .1108 | 42 | 17 | 15 | 15 | 11,919 | 4,096 | 4,578 | 4,578 |
LightGBM+BO | .1027 | .1058 | .1095 | .1220 | 42 | 20 | 16 | 13 | 76 | 93 | 111 | 79 |
Data-intensive techniques, such as machine learning and deep learning models, which generally require massive amounts of computations, have been widely applied for time series prediction problems. In this regard, to find an appropriate tradeoff between computational cost and model predictive power, the effectiveness of the feature selection method using impurity- and XAI-based feature importance was evaluated in this study. The performance analysis was benchmarked against two machine learning models: RFR and LightGBM. In the case of RFR, which has a relatively high computational cost, it was confirmed that the featureimportance- based feature selection method affects the overall model performance more positively.
In the future, we plan to study an effective feature selection method that considers deep learning models with poor interpretability and high computational cost.
This work was supported by Hanshin University Research Grant.
received his B.S., M.S., and Ph.D. degrees in computer science from Kyonggi University in 2011, 2013, and 2017, respectively. He was a research professor at the Department of Computer Engineering, Kyonggi University from 2018 to 2021. He is an assistant professor at the Division of Software Convergence, Hanshin University. His research interests include process-aware information systems and data engineering.
Journal of information and communication convergence engineering 2023; 21(1): 82-89
Published online March 31, 2023 https://doi.org/10.56977/jicce.2023.21.1.82
Copyright © Korea Institute of Information and Communication Engineering.
1Division of Software Convergence, Hanshin University, 18101, Republic of Korea
Correspondence to:Hyun Ahn (E-mail: hyunahn@hs.ac.kr)
Division of Software Convergence, Hanshin University, Osan 18101, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Various machine-learning models may yield high predictive power for massive time series for time series prediction. However, these models are prone to instability in terms of computational cost because of the high dimensionality of the feature space and nonoptimized hyperparameter settings. Considering the potential risk that model training with a high-dimensional feature set can be time-consuming, we evaluate a feature-importance-based feature selection method to derive a tradeoff between predictive power and computational cost for time series prediction. We used two machine learning techniques for performance evaluation to generate prediction models from a retail sales dataset. First, we ranked the features using impurity- and Local Interpretable Model-agnostic Explanations (LIME) -based feature importance measures in the prediction models. Then, the recursive feature elimination method was applied to eliminate unimportant features sequentially. Consequently, we obtained a subset of features that could lead to reduced model training time while preserving acceptable model performance.
Keywords: Feature importance, Feature selection, LIME, Time series prediction
Over the past decade, we have witnessed a data explosion in terms of scale and variety, and advances in computing technology have taken such phenomena into account. In this regard, various prediction models for time series prediction that yield a high predictive power by learning patterns of vast and diverse time series have been proposed [1,2].
However, as these data-intensive techniques (e.g., machine learning and deep learning models) generally require massive amounts of computation, the challenge of determining a tradeoff between performance and time cost cannot be understated. For instance, in several state-of-the-art deep learning models, as the amount of data and feature space size increase, the learning process tends to cause nonlinearly increasing amounts of computation.
Hence, in this study, we experimentally validated a feature selection method that guarantees acceptable model performance based on feature importance and significantly reduces the training time. To this end, prediction models were created using a retail sales dataset [3], and features were ranked by applying two types of feature importance. Subsequently, using the recursive feature elimination (RFE) process, the model accuracy and training time for each feature subset were measured while sequentially removing features with lower ranks (i.e., less important features).
The remainder of this paper is organized as follows. Section II summarizes previous studies on feature selection methods based on feature importance for time series prediction. Section III describes the retail sales dataset used in the experiment. Section IV presents the feature selection method proposed herein. Section V provides the experimental results regarding the tradeoff between model performance and training time using feature importance. Finally, the conclusions and future directions are presented in Section VI.
Sales prediction is one of the most critical issues in building retail businesses. Specifically, store operations can be optimized through short-term sales forecasting (e.g., dayahead prediction). It is also feasible to forecast future sales trends by analyzing long-term time series data. As sales prediction is a time-series prediction problem, in several cases, raw data are given as multivariate rather than univariate. Therefore, computational approaches, such as machine learning techniques that can efficiently deal with it, are being actively applied [4,5].
For the Rossmann store sales dataset [3], a retail sales dataset provided by Kaggle [6], Kohli et al. [4] performed sales predictions for each drug store using linear regression and k-nearest neighbor (k-NN) regression models. The mean absolute percentage error (MAPE) was 22.1 and 31.4%, respectively. The experimental results suggest that the predictive power of the linear regression model for nonlinear problems is insufficient. The k-NN algorithm is based on feature similarity; however, it also shows limited performance because the importance of each feature is not considered sufficiently.
Weng et al. [5] performed sales prediction on the Rossmann dataset by combining long short-term memory (LSTM [1]), which is a deep learning technique appropriate for nonlinear problems, and a light gradient boosting machine (LightGBM [7]), which has the capability to avoid overfitting. However, as the feature selection technique based on feature importance was not applied in this study, the high computational cost of the LSTM model, which processes sequences of multiple time steps as inputs, is inevitable. Therefore, concerning retail sales prediction, this study focuses on feature selection methods to explore the optimal solution for the tradeoff between the computational cost and model predictive power.
Feature selection strategies are broadly classified into wrapper, filter, and embedded methods [8]. Feature importance can be used as a criterion for scoring and ranking features and is mainly used as the basis for the operation of filter methods. For some machine learning models, the feature importance of a model is naturally quantified during the model training.
In [9,10], the Gini impurity, that indicates the effectiveness of splitting nodes, was employed as the feature importance for feature selection. Furthermore, in [10,11], the authors applied an recursive feature elimination (RFE) method to the features ranked by feature importance to obtain an optimal feature subset.
In [12], a hybrid RFE method combining different feature importance measures for support vector machine, random forest, and gradient boosting machine models was proposed to ensure the robustness of feature selection. Furthermore, to go beyond basic feature importance, the dynamic feature importance (DFI) index was proposed to readjust the measured feature importance according to the redundancy between features [13].
Taken together, a part of the machine learning models (particularly tree-based) intrinsically provides the measured feature importance, which can make filter-based feature selection easier. On the other hand, to obtain feature importance from deep learning models with difficulty of interpretation, it is necessary to apply other explanation means, such as explainable artificial intelligence (XAI) methods.
In [14], the authors adopted local-interpretable model explanations (LIME), an XAI method, to measure feature importance for identified clusters. Similarly, in [15], three feature importance criteria (mean decrease accuracy, Shapley value, and LIME) were compared in terms of the stability of feature selection.
In this study, we evaluated the effectiveness of a feature selection method that combines impurity- and LIME-based feature importance measures, considering the computational cost and model predictive power.
This study uses the Rossmann store sales dataset [3] provided by Kaggle [6]. The dataset contains daily sales data for 1,115 drug stores in Europe. Table 1 represents the description of the dataset.
Variables in the Rossmann dataset.
Data source | Variable | Description |
---|---|---|
Store information | Store | A unique identifier for each store |
Store Type | Differentiates among 4 different store models: a, b, c, d | |
Assortment | Describes an assortment level: a = basic, b = extra, c = extended | |
Competition Distance | Distance in meters to the nearest competitor store | |
Competition OpenSince | The approximated year and month of the time the nearest competitor was opened | |
Promo2 | A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating | |
Promo2 Since | The year and calendar week when the store started participating in Promo2 | |
Promo Interval | The consecutive intervals Promo2 is started, naming the months the promotion is started anew. | |
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store | ||
Store sales data | Store | |
Day Of Week | The day of the week | |
Data | A date formatted as YYYY-MM-DD | |
Sales | The turnover for any given day (target variable) | |
Customers | The number of customers on a given day | |
Open | An indicator for whether the store was open: 0 = closed, 1 = open | |
Promo | Indicates whether a store is running a promotion on that day | |
State Holiday | Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none | |
School Holiday | Indicates a school holiday: 1 = school holiday, 0 = none |
From the perspective of data quality, it is desirable to handle abnormal data using an appropriate method. We applied an interquartile range (IQR)-based outlier detection method to eliminate extreme outliers. The IQR is defined as Q3 − Q1, which represents the difference between the third quartile (Q3) and the first quartile (Q1). The upper fence (f3) and lower fence (f1) that distinguish extreme outliers from normal data are defined as follows:
After removing the outliers, the remaining data are rescaled through the min-max normalization to speed up and stabilize the training of the models.
where x is the single value for a specific variable, xmax is the maximum value, and xmin is the minimum value. The normalized value xnorm is calculated according to the above equation.
The target variable we aim to predict is the daily retail sales, and the number of customers is an important variable required for the sales prediction. Table 2 presents the descriptive statistics for these two variables.
Descriptive statistics of the sales and number of customers.
Variable | Mean | Std. Dev. | Min | Max |
---|---|---|---|---|
Sales | 6955.51 | 3104.21 | 0 | 41551 |
Customers | 762.73 | 401.23 | 0 | 7388 |
Both the daily retail sales and the number of customers have weekly and yearly seasonal components, as shown in Fig. 2. These variables tended to peak at the end of each year.
Promo2, which indicates whether a store is participating in the promotion, is also an important variable to explore the significant difference from the sales data. As shown in Fig. 3, the group of stores that conduct promotions is more likely to achieve high sales.
This section describes the primary concepts of the feature selection method investigated in this study. First, two types of feature importance are briefly introduced: impurity-based and LIME-based feature importance. Subsequently, an RFE method with a threshold is described as the feature selection method.
For tree-structured models, impurity is an indicator used to evaluate the quality of splits on the nodes. In the case of random forest, the impurity-based feature importance is measured according to the following: Given an element s of a set of nodes S in a tree t, i(s) and w(s) represent the impurity and fraction of a number of samples of s, respectively. For regression tasks, the impurity of a node was measured using a loss function. In this study, we applied the mean squared error (MSE) as the metric for i(s).
where n is the number of samples, yi is the ith observation, and
Assuming that t is binary, s has two children:sl andsr. The impurity reduction at s, denoted by Δi(s), measures the impurity differences between s and its children,sl andsr. Therefore, a split at s is considered effective if the impurity reduction occurs to a significant extent.
Given a particular feature θ, Sθ is defined as a subset of nodes that split the data on θ. Accordingly, the feature importance of θ for an individual tree t, denoted by I(θ ), is measured using the following equation. Further, the final feature importance for a random forest is calculated by averaging the I(θ ) values of the individual trees.
LIME [16] is an XAI technique used to understand and explain a complicated model for the instance level using surrogate models. As LIME focuses on local explanations, an explanation model that approximates the original model linearly is created, centered on a specific instance that needs to be explained. Explanations provided by an explanation model are referred to as interpretable visual representations, which vary depending on the type of data (text, image, and tabular data). Therefore, LIME users, including nonexperts, can easily understand the resultant explanations regardless of the actual features used by the model. LIME attempts to derive an explanation model that replicates the behaviors of the original model locally while having low model complexity.
where f is the original model and g is an explanation model as an element of a set of interpretable models G. x is an instance that needs to be explained. πx measures the proximity between a perturbed instance z and x.
The term (f, g, πx) represents a locality-aware loss. It accumulates squared errors between the output of f for z, which is used as a label, and the output of g for z', which is a perturbed instance of reduced dimensionality. This term evaluates the local fidelity of g with respect to f.
Ω(g) represents the model complexity and is measured in different ways depending on the type of g. For example, if g is a tree-structured model (e.g., decision tree), Ω(g) represents the depth of the tree, whereas for a linear model, it corresponds to the number of nonzero coefficients of the model.
As our objective model is a prediction model for retail sales, the feature importance measured by LIME is a regression coefficient in a ridge regression model, which is a default explanation model for regression tasks.
RFE is the simplest form of the backward feature selection method. This method iteratively removes features ranked by importance one at a time. Let ρ(θt) be a ranking of θ by t, where t ∊ 𝒯 is the type of feature importance. Then, the ranking score of features and the marginal sum of the ranking by the feature importance of individual features are obtained.
As our method aims to determine a tradeoff between the predictive power and computational cost, we introduce the falling threshold β, limiting the lower prediction accuracy bound in the RFE method. For example, assuming that the predictive power of the baseline model φ(𝓕) using the complete set of features is
For each iteration, RFE selects the least important feature θr from the current feature set Fs. In other words, the model was trained on the feature set 𝓕s, except for the feature with the maximum of S(θ ). If the predictive power was better than or equal to beta, the feature was removed.
The detailed procedure of determining image size
Algorithm 1: Recursive Feature Elimination with β | |
---|---|
Input | 𝓕: Set offeatures |
𝒯: Set of classes of feature importance | |
begin | |
for |
|
end | |
while |
|
end | |
return |
To validate the feature selection method, we conducted an experiment involving ranking the features by measuring their importance and analyzing performance for each feature subset arranged by the RFE method with β. The specifications of the equipment and software libraries used in the experiments are listed in Table 3.
Experimental settings.
Attribute | Description |
---|---|
GPU | NVIDIA GeForce 2080 RTX Super (8GB) |
CPU | Intel Core i9-9900 |
RAM | 32GB |
OS | Windows 10 Pro 64bit |
Python Libraries | numpy (numerical operations) |
pandas (data I/O and manipulation) | |
scikit-learn (feature engineering, model) | |
lightgbm (model) | |
bayesian-optimization (hyperparameter optimization) | |
lime (feature importance) | |
matplotlib (visualization) |
Various methods and models can be considered for retail sales prediction. As the feature selection method in this study combines impurity- and LIME-based feature importance, we use two machine learning models to which both of these feature importance methods are applicable: random forest regression (RFR) and light gradient boosting machine (LightGBM).
Random forest [17] is an ensemble machine learning model based on the bagging technique and has been widely used for classification and regression tasks. In this experiment, RFR was used, and its hyperparameters were configured as shown in Table 4.
Hyperparameter settings for RFR.
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 100 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
min_samples_split | 2 | [2, 100] | Minimum number of samples required to split |
max_samples | 1.0 | [0.1, 1.0] | Fraction of the original dataset is given to any individual tree |
LightGBM [7] is a computationally efficient gradient boosting algorithm that prevents overfitting. The hyperparameters of LightGBM were set as listed in Table 5.
Hyperparameter settings for LightGBM.
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 200 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
num_leaves | 100 | [2, 512] | Maximum tree leaves |
subsample | 1.0 | [0.1, 1.0] | Subsample ratio of the training instance |
The feature rankings for the two models were derived based on measurements of feature importance. Table 6 and 7 present the partial results (top-5 features). In the case of RFR, Promo was evaluated as the most important feature. In addition, DayOfWeek_1 (i.e., Monday) and Month_12 (i.e., December), the derived features for modeling weekly and annual cycles, were evaluated as high-ranking features.
Partial results of measuring feature importance and overall rankings for RFR.
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
Promo | 0.1367 (2) | 829.5015 (1) | 1 |
DayOfWeek_1 | 0.0335 (5) | 225.5958 (2) | 2 |
Month_12 | 0.0223 (7) | 109.6942 (4) | 3 |
CompetitionOpenElapsedDays | 0.1306 (3) | 59.6449 (9) | 4 |
Promo2ElapsedDays | 0.0416 (4) | 47.9985 (11) | 5 |
Partial results of measuring feature importance and overall rankings for LightGBM.
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
CompetitionDistance | 43182 (1) | 709.5813 (3) | 1 |
Promo | 13054 (4) | 886.1849 (1) | 2 |
DayOfWeek_1 | 5119 (7) | 748.4422 (2) | 3 |
CompetitionOpenElapsedDays | 33031 (2) | 98.6852 (12) | 4 |
Month_12 | 4025 (11) | 463.3038 (5) | 5 |
However, the distance from the competing drug store (i.e., CompetitionDistance) is considered the most important feature of the LightGBM model.
Note that the ranking results of the two separately applied feature importance methods differ significantly, which supports the idea that the hybrid approach aggregating multiple feature importance is appropriate as a feature selection method.
In the performance analysis experiment, the lowest-ranking features were removed individually using the RFE method. For each iteration, the weighted mean absolute percentage error (WMAPE) value and training time were measured to derive a tradeoff according to the falling threshold β. Therefore, the β values that determine the tradeoff were divided into 0.05, 0.1, and 0.2 and applied to each experiment.
The performance analysis involves the basic performance measurements of models and additional performance measurements with Bayesian optimization (BO), which is widely used in model optimization. The search ranges of the hyperparameters in BO are presented in Tables 4 and 5.
To evaluate the accuracy of the prediction models, the WMAPE was used as a scale-independent metric.
where n is the number of instances, and yi and are the observation and prediction in the ith instance, respectively.
Fig. 4 shows the performance measured by the RFR and the tradeoffs for each β value (5, 10, and 20%). In RFR, the training time was dramatically reduced as the number of features decreased. In contrast, when the number of features was less than 11, the WMAPE increased steeply because features of high predictive power were eliminated.
On the other hand, LightGBM trains the model hundreds of times faster than RFR because it performs pruning, which significantly reduces the amount of computation. However, in general, higher WMAPE values than those of RFR are obtained. A small feature subset leads to a reduction in training time. However, the effect on the overall performance is trivial.
As both models are based on the ensemble approach, they naturally exhibit randomness and are sensitive to hyperparameter settings, resulting in a potentially volatile performance. In the experiment, we attempted to achieve a stable performance by applying BO to each basic model, and the results are shown in Fig. 5.
For both models, the results suggest that the degree of reduction in training time by feature elimination is inconsistent. In LightGBM, similar to Fig. 4, model training is completed much faster than in RFR. Therefore, the reduction in training time on a smaller feature subset is negligible. In the case of RFR, although significant training time is required owing to BO, the training time is considerably reduced while RFE is performed until near the beta value, whereas the predictive power of the model remains at an acceptable level.
The results of the overall performance analysis are presented in Table 8. In summary, the high computational cost of the RFR can be significantly reduced by RFE with an appropriate beta. In particular, our feature selection method leads to a considerable reduction in training time in cases involving BO. This advantage allows us to derive tradeoffs that ensure reduced computational cost and acceptable model predictive power. In the case of LightGBM, as it is a computationally efficient algorithm, the validity of the tradeoffs discovered by the RFE is evaluated to be relatively less important.
Performance evaluation results.
Models β | WMAPE | # Features | Training time (s) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | |
RFR | .1023 | .1072 | .1113 | .1132 | 42 | 21 | 16 | 15 | 736 | 384 | 304 | 284 |
LightGBM | .1275 | .1332 | .1382 | .1448 | 42 | 22 | 15 | 13 | 1.89 | 1.60 | 1.53 | 1.45 |
RFR+BO | .1013 | .1063 | .1108 | .1108 | 42 | 17 | 15 | 15 | 11,919 | 4,096 | 4,578 | 4,578 |
LightGBM+BO | .1027 | .1058 | .1095 | .1220 | 42 | 20 | 16 | 13 | 76 | 93 | 111 | 79 |
Data-intensive techniques, such as machine learning and deep learning models, which generally require massive amounts of computations, have been widely applied for time series prediction problems. In this regard, to find an appropriate tradeoff between computational cost and model predictive power, the effectiveness of the feature selection method using impurity- and XAI-based feature importance was evaluated in this study. The performance analysis was benchmarked against two machine learning models: RFR and LightGBM. In the case of RFR, which has a relatively high computational cost, it was confirmed that the featureimportance- based feature selection method affects the overall model performance more positively.
In the future, we plan to study an effective feature selection method that considers deep learning models with poor interpretability and high computational cost.
This work was supported by Hanshin University Research Grant.
Variables in the Rossmann dataset.
Data source | Variable | Description |
---|---|---|
Store information | Store | A unique identifier for each store |
Store Type | Differentiates among 4 different store models: a, b, c, d | |
Assortment | Describes an assortment level: a = basic, b = extra, c = extended | |
Competition Distance | Distance in meters to the nearest competitor store | |
Competition OpenSince | The approximated year and month of the time the nearest competitor was opened | |
Promo2 | A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating | |
Promo2 Since | The year and calendar week when the store started participating in Promo2 | |
Promo Interval | The consecutive intervals Promo2 is started, naming the months the promotion is started anew. | |
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store | ||
Store sales data | Store | |
Day Of Week | The day of the week | |
Data | A date formatted as YYYY-MM-DD | |
Sales | The turnover for any given day (target variable) | |
Customers | The number of customers on a given day | |
Open | An indicator for whether the store was open: 0 = closed, 1 = open | |
Promo | Indicates whether a store is running a promotion on that day | |
State Holiday | Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none | |
School Holiday | Indicates a school holiday: 1 = school holiday, 0 = none |
Descriptive statistics of the sales and number of customers.
Variable | Mean | Std. Dev. | Min | Max |
---|---|---|---|---|
Sales | 6955.51 | 3104.21 | 0 | 41551 |
Customers | 762.73 | 401.23 | 0 | 7388 |
Experimental settings.
Attribute | Description |
---|---|
GPU | NVIDIA GeForce 2080 RTX Super (8GB) |
CPU | Intel Core i9-9900 |
RAM | 32GB |
OS | Windows 10 Pro 64bit |
Python Libraries | numpy (numerical operations) |
pandas (data I/O and manipulation) | |
scikit-learn (feature engineering, model) | |
lightgbm (model) | |
bayesian-optimization (hyperparameter optimization) | |
lime (feature importance) | |
matplotlib (visualization) |
Hyperparameter settings for RFR.
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 100 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
min_samples_split | 2 | [2, 100] | Minimum number of samples required to split |
max_samples | 1.0 | [0.1, 1.0] | Fraction of the original dataset is given to any individual tree |
Hyperparameter settings for LightGBM.
Hyperparameter | Default value | Search range | Description |
---|---|---|---|
n_estimators | 200 | [50, 1000] | Number of trees |
max_depth | 50 | [30, 70] | Maximum tree depth |
num_leaves | 100 | [2, 512] | Maximum tree leaves |
subsample | 1.0 | [0.1, 1.0] | Subsample ratio of the training instance |
Partial results of measuring feature importance and overall rankings for RFR.
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
Promo | 0.1367 (2) | 829.5015 (1) | 1 |
DayOfWeek_1 | 0.0335 (5) | 225.5958 (2) | 2 |
Month_12 | 0.0223 (7) | 109.6942 (4) | 3 |
CompetitionOpenElapsedDays | 0.1306 (3) | 59.6449 (9) | 4 |
Promo2ElapsedDays | 0.0416 (4) | 47.9985 (11) | 5 |
Partial results of measuring feature importance and overall rankings for LightGBM.
Feature | Importance (impurity) | Importance (lime) | Ranking |
---|---|---|---|
CompetitionDistance | 43182 (1) | 709.5813 (3) | 1 |
Promo | 13054 (4) | 886.1849 (1) | 2 |
DayOfWeek_1 | 5119 (7) | 748.4422 (2) | 3 |
CompetitionOpenElapsedDays | 33031 (2) | 98.6852 (12) | 4 |
Month_12 | 4025 (11) | 463.3038 (5) | 5 |
Performance evaluation results.
Models β | WMAPE | # Features | Training time (s) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | Base | .05 | .1 | .2 | |
RFR | .1023 | .1072 | .1113 | .1132 | 42 | 21 | 16 | 15 | 736 | 384 | 304 | 284 |
LightGBM | .1275 | .1332 | .1382 | .1448 | 42 | 22 | 15 | 13 | 1.89 | 1.60 | 1.53 | 1.45 |
RFR+BO | .1013 | .1063 | .1108 | .1108 | 42 | 17 | 15 | 15 | 11,919 | 4,096 | 4,578 | 4,578 |
LightGBM+BO | .1027 | .1058 | .1095 | .1220 | 42 | 20 | 16 | 13 | 76 | 93 | 111 | 79 |
Arpita Nagpal, Deepti Gaur*
The Korea Institute of Information and Commucation Engineering 2015; 13(2): 113-122 https://doi.org/10.6109/jicce.2015.13.2.113