Journal of information and communication convergence engineering 2023; 21(1): 8289
Published online March 31, 2023
https://doi.org/10.56977/jicce.2023.21.1.82
© Korea Institute of Information and Communication Engineering
Correspondence to : Hyun Ahn (Email: hyunahn@hs.ac.kr)
Division of Software Convergence, Hanshin University, Osan 18101, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution NonCommercial License (http://creativecommons.org/licenses/bync/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Various machinelearning models may yield high predictive power for massive time series for time series prediction. However, these models are prone to instability in terms of computational cost because of the high dimensionality of the feature space and nonoptimized hyperparameter settings. Considering the potential risk that model training with a highdimensional feature set can be timeconsuming, we evaluate a featureimportancebased feature selection method to derive a tradeoff between predictive power and computational cost for time series prediction. We used two machine learning techniques for performance evaluation to generate prediction models from a retail sales dataset. First, we ranked the features using impurity and Local Interpretable Modelagnostic Explanations (LIME) based feature importance measures in the prediction models. Then, the recursive feature elimination method was applied to eliminate unimportant features sequentially. Consequently, we obtained a subset of features that could lead to reduced model training time while preserving acceptable model performance.
Keywords Feature importance, Feature selection, LIME, Time series prediction
Over the past decade, we have witnessed a data explosion in terms of scale and variety, and advances in computing technology have taken such phenomena into account. In this regard, various prediction models for time series prediction that yield a high predictive power by learning patterns of vast and diverse time series have been proposed [1,2].
However, as these dataintensive techniques (e.g., machine learning and deep learning models) generally require massive amounts of computation, the challenge of determining a tradeoff between performance and time cost cannot be understated. For instance, in several stateoftheart deep learning models, as the amount of data and feature space size increase, the learning process tends to cause nonlinearly increasing amounts of computation.
Hence, in this study, we experimentally validated a feature selection method that guarantees acceptable model performance based on feature importance and significantly reduces the training time. To this end, prediction models were created using a retail sales dataset [3], and features were ranked by applying two types of feature importance. Subsequently, using the recursive feature elimination (RFE) process, the model accuracy and training time for each feature subset were measured while sequentially removing features with lower ranks (i.e., less important features).
The remainder of this paper is organized as follows. Section II summarizes previous studies on feature selection methods based on feature importance for time series prediction. Section III describes the retail sales dataset used in the experiment. Section IV presents the feature selection method proposed herein. Section V provides the experimental results regarding the tradeoff between model performance and training time using feature importance. Finally, the conclusions and future directions are presented in Section VI.
Sales prediction is one of the most critical issues in building retail businesses. Specifically, store operations can be optimized through shortterm sales forecasting (e.g., dayahead prediction). It is also feasible to forecast future sales trends by analyzing longterm time series data. As sales prediction is a timeseries prediction problem, in several cases, raw data are given as multivariate rather than univariate. Therefore, computational approaches, such as machine learning techniques that can efficiently deal with it, are being actively applied [4,5].
For the Rossmann store sales dataset [3], a retail sales dataset provided by Kaggle [6], Kohli et al. [4] performed sales predictions for each drug store using linear regression and knearest neighbor (kNN) regression models. The mean absolute percentage error (MAPE) was 22.1 and 31.4%, respectively. The experimental results suggest that the predictive power of the linear regression model for nonlinear problems is insufficient. The kNN algorithm is based on feature similarity; however, it also shows limited performance because the importance of each feature is not considered sufficiently.
Weng et al. [5] performed sales prediction on the Rossmann dataset by combining long shortterm memory (LSTM [1]), which is a deep learning technique appropriate for nonlinear problems, and a light gradient boosting machine (LightGBM [7]), which has the capability to avoid overfitting. However, as the feature selection technique based on feature importance was not applied in this study, the high computational cost of the LSTM model, which processes sequences of multiple time steps as inputs, is inevitable. Therefore, concerning retail sales prediction, this study focuses on feature selection methods to explore the optimal solution for the tradeoff between the computational cost and model predictive power.
Feature selection strategies are broadly classified into wrapper, filter, and embedded methods [8]. Feature importance can be used as a criterion for scoring and ranking features and is mainly used as the basis for the operation of filter methods. For some machine learning models, the feature importance of a model is naturally quantified during the model training.
In [9,10], the Gini impurity, that indicates the effectiveness of splitting nodes, was employed as the feature importance for feature selection. Furthermore, in [10,11], the authors applied an recursive feature elimination (RFE) method to the features ranked by feature importance to obtain an optimal feature subset.
In [12], a hybrid RFE method combining different feature importance measures for support vector machine, random forest, and gradient boosting machine models was proposed to ensure the robustness of feature selection. Furthermore, to go beyond basic feature importance, the dynamic feature importance (DFI) index was proposed to readjust the measured feature importance according to the redundancy between features [13].
Taken together, a part of the machine learning models (particularly treebased) intrinsically provides the measured feature importance, which can make filterbased feature selection easier. On the other hand, to obtain feature importance from deep learning models with difficulty of interpretation, it is necessary to apply other explanation means, such as explainable artificial intelligence (XAI) methods.
In [14], the authors adopted localinterpretable model explanations (LIME), an XAI method, to measure feature importance for identified clusters. Similarly, in [15], three feature importance criteria (mean decrease accuracy, Shapley value, and LIME) were compared in terms of the stability of feature selection.
In this study, we evaluated the effectiveness of a feature selection method that combines impurity and LIMEbased feature importance measures, considering the computational cost and model predictive power.
This study uses the Rossmann store sales dataset [3] provided by Kaggle [6]. The dataset contains daily sales data for 1,115 drug stores in Europe. Table 1 represents the description of the dataset.
Variables in the Rossmann dataset
Data source  Variable  Description 

Store information  Store  A unique identifier for each store 
Store Type  Differentiates among 4 different store models: a, b, c, d  
Assortment  Describes an assortment level: a = basic, b = extra, c = extended  
Competition Distance  Distance in meters to the nearest competitor store  
Competition OpenSince  The approximated year and month of the time the nearest competitor was opened  
Promo2  A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating  
Promo2 Since  The year and calendar week when the store started participating in Promo2  
Promo Interval  The consecutive intervals Promo2 is started, naming the months the promotion is started anew.  
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store  
Store sales data  Store  
Day Of Week  The day of the week  
Data  A date formatted as YYYYMMDD  
Sales  The turnover for any given day (target variable)  
Customers  The number of customers on a given day  
Open  An indicator for whether the store was open: 0 = closed, 1 = open  
Promo  Indicates whether a store is running a promotion on that day  
State Holiday  Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none  
School Holiday  Indicates a school holiday: 1 = school holiday, 0 = none 
From the perspective of data quality, it is desirable to handle abnormal data using an appropriate method. We applied an interquartile range (IQR)based outlier detection method to eliminate extreme outliers. The IQR is defined as Q3 − Q1, which represents the difference between the third quartile (Q3) and the first quartile (Q1). The upper fence (f_{3}) and lower fence (f_{1}) that distinguish extreme outliers from normal data are defined as follows:
After removing the outliers, the remaining data are rescaled through the minmax normalization to speed up and stabilize the training of the models.
where x is the single value for a specific variable, x_{max} is the maximum value, and x_{min} is the minimum value. The normalized value x_{norm} is calculated according to the above equation.
The target variable we aim to predict is the daily retail sales, and the number of customers is an important variable required for the sales prediction. Table 2 presents the descriptive statistics for these two variables.
Descriptive statistics of the sales and number of customers
Variable  Mean  Std. Dev.  Min  Max 

Sales  6955.51  3104.21  0  41551 
Customers  762.73  401.23  0  7388 
Both the daily retail sales and the number of customers have weekly and yearly seasonal components, as shown in Fig. 2. These variables tended to peak at the end of each year.
Promo2, which indicates whether a store is participating in the promotion, is also an important variable to explore the significant difference from the sales data. As shown in Fig. 3, the group of stores that conduct promotions is more likely to achieve high sales.
This section describes the primary concepts of the feature selection method investigated in this study. First, two types of feature importance are briefly introduced: impuritybased and LIMEbased feature importance. Subsequently, an RFE method with a threshold is described as the feature selection method.
For treestructured models, impurity is an indicator used to evaluate the quality of splits on the nodes. In the case of random forest, the impuritybased feature importance is measured according to the following: Given an element s of a set of nodes S in a tree t, i(s) and w(s) represent the impurity and fraction of a number of samples of s, respectively. For regression tasks, the impurity of a node was measured using a loss function. In this study, we applied the mean squared error (MSE) as the metric for i(s).
where n is the number of samples, y_{i} is the ith observation, and
Assuming that t is binary, s has two children:s_{l} ands_{r}. The impurity reduction at s, denoted by Δi(s), measures the impurity differences between s and its children,s_{l} ands_{r}. Therefore, a split at s is considered effective if the impurity reduction occurs to a significant extent.
Given a particular feature θ, S^{θ} is defined as a subset of nodes that split the data on θ. Accordingly, the feature importance of θ for an individual tree t, denoted by I(θ ), is measured using the following equation. Further, the final feature importance for a random forest is calculated by averaging the I(θ ) values of the individual trees.
LIME [16] is an XAI technique used to understand and explain a complicated model for the instance level using surrogate models. As LIME focuses on local explanations, an explanation model that approximates the original model linearly is created, centered on a specific instance that needs to be explained. Explanations provided by an explanation model are referred to as interpretable visual representations, which vary depending on the type of data (text, image, and tabular data). Therefore, LIME users, including nonexperts, can easily understand the resultant explanations regardless of the actual features used by the model. LIME attempts to derive an explanation model that replicates the behaviors of the original model locally while having low model complexity.
where f is the original model and g is an explanation model as an element of a set of interpretable models G. x is an instance that needs to be explained. π_{x} measures the proximity between a perturbed instance z and x.
The term (f, g, π_{x}) represents a localityaware loss. It accumulates squared errors between the output of f for z, which is used as a label, and the output of g for z', which is a perturbed instance of reduced dimensionality. This term evaluates the local fidelity of g with respect to f.
Ω(g) represents the model complexity and is measured in different ways depending on the type of g. For example, if g is a treestructured model (e.g., decision tree), Ω(g) represents the depth of the tree, whereas for a linear model, it corresponds to the number of nonzero coefficients of the model.
As our objective model is a prediction model for retail sales, the feature importance measured by LIME is a regression coefficient in a ridge regression model, which is a default explanation model for regression tasks.
RFE is the simplest form of the backward feature selection method. This method iteratively removes features ranked by importance one at a time. Let ρ(θ_{t}) be a ranking of θ by t, where t ∊ 𝒯 is the type of feature importance. Then, the ranking score of features and the marginal sum of the ranking by the feature importance of individual features are obtained.
As our method aims to determine a tradeoff between the predictive power and computational cost, we introduce the falling threshold β, limiting the lower prediction accuracy bound in the RFE method. For example, assuming that the predictive power of the baseline model φ(𝓕) using the complete set of features is
For each iteration, RFE selects the least important feature θr from the current feature set Fs. In other words, the model was trained on the feature set 𝓕^{s}, except for the feature with the maximum of S(θ ). If the predictive power was better than or equal to beta, the feature was removed.
The detailed procedure of determining image size
Algorithm 1: Recursive Feature Elimination with β  

Input  𝓕: Set offeatures 
𝒯: Set of classes of feature importance  
begin  
for 

end  
while 

end  
return 
To validate the feature selection method, we conducted an experiment involving ranking the features by measuring their importance and analyzing performance for each feature subset arranged by the RFE method with β. The specifications of the equipment and software libraries used in the experiments are listed in Table 3.
Experimental settings
Attribute  Description 

GPU  NVIDIA GeForce 2080 RTX Super (8GB) 
CPU  Intel Core i99900 
RAM  32GB 
OS  Windows 10 Pro 64bit 
Python Libraries  numpy (numerical operations) 
pandas (data I/O and manipulation)  
scikitlearn (feature engineering, model)  
lightgbm (model)  
bayesianoptimization (hyperparameter optimization)  
lime (feature importance)  
matplotlib (visualization) 
Various methods and models can be considered for retail sales prediction. As the feature selection method in this study combines impurity and LIMEbased feature importance, we use two machine learning models to which both of these feature importance methods are applicable: random forest regression (RFR) and light gradient boosting machine (LightGBM).
Random forest [17] is an ensemble machine learning model based on the bagging technique and has been widely used for classification and regression tasks. In this experiment, RFR was used, and its hyperparameters were configured as shown in Table 4.
Hyperparameter settings for RFR
Hyperparameter  Default value  Search range  Description 

n_estimators  100  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
min_samples_split  2  [2, 100]  Minimum number of samples required to split 
max_samples  1.0  [0.1, 1.0]  Fraction of the original dataset is given to any individual tree 
LightGBM [7] is a computationally efficient gradient boosting algorithm that prevents overfitting. The hyperparameters of LightGBM were set as listed in Table 5.
Hyperparameter settings for LightGBM
Hyperparameter  Default value  Search range  Description 

n_estimators  200  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
num_leaves  100  [2, 512]  Maximum tree leaves 
subsample  1.0  [0.1, 1.0]  Subsample ratio of the training instance 
The feature rankings for the two models were derived based on measurements of feature importance. Table 6 and 7 present the partial results (top5 features). In the case of RFR, Promo was evaluated as the most important feature. In addition, DayOfWeek_1 (i.e., Monday) and Month_12 (i.e., December), the derived features for modeling weekly and annual cycles, were evaluated as highranking features.
Partial results of measuring feature importance and overall rankings for RFR
Feature  Importance (impurity)  Importance (lime)  Ranking 

Promo  0.1367 (2)  829.5015 (1)  1 
DayOfWeek_1  0.0335 (5)  225.5958 (2)  2 
Month_12  0.0223 (7)  109.6942 (4)  3 
CompetitionOpenElapsedDays  0.1306 (3)  59.6449 (9)  4 
Promo2ElapsedDays  0.0416 (4)  47.9985 (11)  5 
Partial results of measuring feature importance and overall rankings for LightGBM
Feature  Importance (impurity)  Importance (lime)  Ranking 

CompetitionDistance  43182 (1)  709.5813 (3)  1 
Promo  13054 (4)  886.1849 (1)  2 
DayOfWeek_1  5119 (7)  748.4422 (2)  3 
CompetitionOpenElapsedDays  33031 (2)  98.6852 (12)  4 
Month_12  4025 (11)  463.3038 (5)  5 
However, the distance from the competing drug store (i.e., CompetitionDistance) is considered the most important feature of the LightGBM model.
Note that the ranking results of the two separately applied feature importance methods differ significantly, which supports the idea that the hybrid approach aggregating multiple feature importance is appropriate as a feature selection method.
In the performance analysis experiment, the lowestranking features were removed individually using the RFE method. For each iteration, the weighted mean absolute percentage error (WMAPE) value and training time were measured to derive a tradeoff according to the falling threshold β. Therefore, the β values that determine the tradeoff were divided into 0.05, 0.1, and 0.2 and applied to each experiment.
The performance analysis involves the basic performance measurements of models and additional performance measurements with Bayesian optimization (BO), which is widely used in model optimization. The search ranges of the hyperparameters in BO are presented in Tables 4 and 5.
To evaluate the accuracy of the prediction models, the WMAPE was used as a scaleindependent metric.
where n is the number of instances, and y_{i} and are the observation and prediction in the ith instance, respectively.
Fig. 4 shows the performance measured by the RFR and the tradeoffs for each β value (5, 10, and 20%). In RFR, the training time was dramatically reduced as the number of features decreased. In contrast, when the number of features was less than 11, the WMAPE increased steeply because features of high predictive power were eliminated.
On the other hand, LightGBM trains the model hundreds of times faster than RFR because it performs pruning, which significantly reduces the amount of computation. However, in general, higher WMAPE values than those of RFR are obtained. A small feature subset leads to a reduction in training time. However, the effect on the overall performance is trivial.
As both models are based on the ensemble approach, they naturally exhibit randomness and are sensitive to hyperparameter settings, resulting in a potentially volatile performance. In the experiment, we attempted to achieve a stable performance by applying BO to each basic model, and the results are shown in Fig. 5.
For both models, the results suggest that the degree of reduction in training time by feature elimination is inconsistent. In LightGBM, similar to Fig. 4, model training is completed much faster than in RFR. Therefore, the reduction in training time on a smaller feature subset is negligible. In the case of RFR, although significant training time is required owing to BO, the training time is considerably reduced while RFE is performed until near the beta value, whereas the predictive power of the model remains at an acceptable level.
The results of the overall performance analysis are presented in Table 8. In summary, the high computational cost of the RFR can be significantly reduced by RFE with an appropriate beta. In particular, our feature selection method leads to a considerable reduction in training time in cases involving BO. This advantage allows us to derive tradeoffs that ensure reduced computational cost and acceptable model predictive power. In the case of LightGBM, as it is a computationally efficient algorithm, the validity of the tradeoffs discovered by the RFE is evaluated to be relatively less important.
Performance evaluation results
Models β  WMAPE  # Features  Training time (s)  

Base  .05  .1  .2  Base  .05  .1  .2  Base  .05  .1  .2  
RFR  .1023  .1072  .1113  .1132  42  21  16  15  736  384  304  284 
LightGBM  .1275  .1332  .1382  .1448  42  22  15  13  1.89  1.60  1.53  1.45 
RFR+BO  .1013  .1063  .1108  .1108  42  17  15  15  11,919  4,096  4,578  4,578 
LightGBM+BO  .1027  .1058  .1095  .1220  42  20  16  13  76  93  111  79 
Dataintensive techniques, such as machine learning and deep learning models, which generally require massive amounts of computations, have been widely applied for time series prediction problems. In this regard, to find an appropriate tradeoff between computational cost and model predictive power, the effectiveness of the feature selection method using impurity and XAIbased feature importance was evaluated in this study. The performance analysis was benchmarked against two machine learning models: RFR and LightGBM. In the case of RFR, which has a relatively high computational cost, it was confirmed that the featureimportance based feature selection method affects the overall model performance more positively.
In the future, we plan to study an effective feature selection method that considers deep learning models with poor interpretability and high computational cost.
This work was supported by Hanshin University Research Grant.
received his B.S., M.S., and Ph.D. degrees in computer science from Kyonggi University in 2011, 2013, and 2017, respectively. He was a research professor at the Department of Computer Engineering, Kyonggi University from 2018 to 2021. He is an assistant professor at the Division of Software Convergence, Hanshin University. His research interests include processaware information systems and data engineering.
Journal of information and communication convergence engineering 2023; 21(1): 8289
Published online March 31, 2023 https://doi.org/10.56977/jicce.2023.21.1.82
Copyright © Korea Institute of Information and Communication Engineering.
^{1}Division of Software Convergence, Hanshin University, 18101, Republic of Korea
Correspondence to:Hyun Ahn (Email: hyunahn@hs.ac.kr)
Division of Software Convergence, Hanshin University, Osan 18101, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution NonCommercial License (http://creativecommons.org/licenses/bync/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Various machinelearning models may yield high predictive power for massive time series for time series prediction. However, these models are prone to instability in terms of computational cost because of the high dimensionality of the feature space and nonoptimized hyperparameter settings. Considering the potential risk that model training with a highdimensional feature set can be timeconsuming, we evaluate a featureimportancebased feature selection method to derive a tradeoff between predictive power and computational cost for time series prediction. We used two machine learning techniques for performance evaluation to generate prediction models from a retail sales dataset. First, we ranked the features using impurity and Local Interpretable Modelagnostic Explanations (LIME) based feature importance measures in the prediction models. Then, the recursive feature elimination method was applied to eliminate unimportant features sequentially. Consequently, we obtained a subset of features that could lead to reduced model training time while preserving acceptable model performance.
Keywords: Feature importance, Feature selection, LIME, Time series prediction
Over the past decade, we have witnessed a data explosion in terms of scale and variety, and advances in computing technology have taken such phenomena into account. In this regard, various prediction models for time series prediction that yield a high predictive power by learning patterns of vast and diverse time series have been proposed [1,2].
However, as these dataintensive techniques (e.g., machine learning and deep learning models) generally require massive amounts of computation, the challenge of determining a tradeoff between performance and time cost cannot be understated. For instance, in several stateoftheart deep learning models, as the amount of data and feature space size increase, the learning process tends to cause nonlinearly increasing amounts of computation.
Hence, in this study, we experimentally validated a feature selection method that guarantees acceptable model performance based on feature importance and significantly reduces the training time. To this end, prediction models were created using a retail sales dataset [3], and features were ranked by applying two types of feature importance. Subsequently, using the recursive feature elimination (RFE) process, the model accuracy and training time for each feature subset were measured while sequentially removing features with lower ranks (i.e., less important features).
The remainder of this paper is organized as follows. Section II summarizes previous studies on feature selection methods based on feature importance for time series prediction. Section III describes the retail sales dataset used in the experiment. Section IV presents the feature selection method proposed herein. Section V provides the experimental results regarding the tradeoff between model performance and training time using feature importance. Finally, the conclusions and future directions are presented in Section VI.
Sales prediction is one of the most critical issues in building retail businesses. Specifically, store operations can be optimized through shortterm sales forecasting (e.g., dayahead prediction). It is also feasible to forecast future sales trends by analyzing longterm time series data. As sales prediction is a timeseries prediction problem, in several cases, raw data are given as multivariate rather than univariate. Therefore, computational approaches, such as machine learning techniques that can efficiently deal with it, are being actively applied [4,5].
For the Rossmann store sales dataset [3], a retail sales dataset provided by Kaggle [6], Kohli et al. [4] performed sales predictions for each drug store using linear regression and knearest neighbor (kNN) regression models. The mean absolute percentage error (MAPE) was 22.1 and 31.4%, respectively. The experimental results suggest that the predictive power of the linear regression model for nonlinear problems is insufficient. The kNN algorithm is based on feature similarity; however, it also shows limited performance because the importance of each feature is not considered sufficiently.
Weng et al. [5] performed sales prediction on the Rossmann dataset by combining long shortterm memory (LSTM [1]), which is a deep learning technique appropriate for nonlinear problems, and a light gradient boosting machine (LightGBM [7]), which has the capability to avoid overfitting. However, as the feature selection technique based on feature importance was not applied in this study, the high computational cost of the LSTM model, which processes sequences of multiple time steps as inputs, is inevitable. Therefore, concerning retail sales prediction, this study focuses on feature selection methods to explore the optimal solution for the tradeoff between the computational cost and model predictive power.
Feature selection strategies are broadly classified into wrapper, filter, and embedded methods [8]. Feature importance can be used as a criterion for scoring and ranking features and is mainly used as the basis for the operation of filter methods. For some machine learning models, the feature importance of a model is naturally quantified during the model training.
In [9,10], the Gini impurity, that indicates the effectiveness of splitting nodes, was employed as the feature importance for feature selection. Furthermore, in [10,11], the authors applied an recursive feature elimination (RFE) method to the features ranked by feature importance to obtain an optimal feature subset.
In [12], a hybrid RFE method combining different feature importance measures for support vector machine, random forest, and gradient boosting machine models was proposed to ensure the robustness of feature selection. Furthermore, to go beyond basic feature importance, the dynamic feature importance (DFI) index was proposed to readjust the measured feature importance according to the redundancy between features [13].
Taken together, a part of the machine learning models (particularly treebased) intrinsically provides the measured feature importance, which can make filterbased feature selection easier. On the other hand, to obtain feature importance from deep learning models with difficulty of interpretation, it is necessary to apply other explanation means, such as explainable artificial intelligence (XAI) methods.
In [14], the authors adopted localinterpretable model explanations (LIME), an XAI method, to measure feature importance for identified clusters. Similarly, in [15], three feature importance criteria (mean decrease accuracy, Shapley value, and LIME) were compared in terms of the stability of feature selection.
In this study, we evaluated the effectiveness of a feature selection method that combines impurity and LIMEbased feature importance measures, considering the computational cost and model predictive power.
This study uses the Rossmann store sales dataset [3] provided by Kaggle [6]. The dataset contains daily sales data for 1,115 drug stores in Europe. Table 1 represents the description of the dataset.
Variables in the Rossmann dataset.
Data source  Variable  Description 

Store information  Store  A unique identifier for each store 
Store Type  Differentiates among 4 different store models: a, b, c, d  
Assortment  Describes an assortment level: a = basic, b = extra, c = extended  
Competition Distance  Distance in meters to the nearest competitor store  
Competition OpenSince  The approximated year and month of the time the nearest competitor was opened  
Promo2  A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating  
Promo2 Since  The year and calendar week when the store started participating in Promo2  
Promo Interval  The consecutive intervals Promo2 is started, naming the months the promotion is started anew.  
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store  
Store sales data  Store  
Day Of Week  The day of the week  
Data  A date formatted as YYYYMMDD  
Sales  The turnover for any given day (target variable)  
Customers  The number of customers on a given day  
Open  An indicator for whether the store was open: 0 = closed, 1 = open  
Promo  Indicates whether a store is running a promotion on that day  
State Holiday  Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none  
School Holiday  Indicates a school holiday: 1 = school holiday, 0 = none 
From the perspective of data quality, it is desirable to handle abnormal data using an appropriate method. We applied an interquartile range (IQR)based outlier detection method to eliminate extreme outliers. The IQR is defined as Q3 − Q1, which represents the difference between the third quartile (Q3) and the first quartile (Q1). The upper fence (f_{3}) and lower fence (f_{1}) that distinguish extreme outliers from normal data are defined as follows:
After removing the outliers, the remaining data are rescaled through the minmax normalization to speed up and stabilize the training of the models.
where x is the single value for a specific variable, x_{max} is the maximum value, and x_{min} is the minimum value. The normalized value x_{norm} is calculated according to the above equation.
The target variable we aim to predict is the daily retail sales, and the number of customers is an important variable required for the sales prediction. Table 2 presents the descriptive statistics for these two variables.
Descriptive statistics of the sales and number of customers.
Variable  Mean  Std. Dev.  Min  Max 

Sales  6955.51  3104.21  0  41551 
Customers  762.73  401.23  0  7388 
Both the daily retail sales and the number of customers have weekly and yearly seasonal components, as shown in Fig. 2. These variables tended to peak at the end of each year.
Promo2, which indicates whether a store is participating in the promotion, is also an important variable to explore the significant difference from the sales data. As shown in Fig. 3, the group of stores that conduct promotions is more likely to achieve high sales.
This section describes the primary concepts of the feature selection method investigated in this study. First, two types of feature importance are briefly introduced: impuritybased and LIMEbased feature importance. Subsequently, an RFE method with a threshold is described as the feature selection method.
For treestructured models, impurity is an indicator used to evaluate the quality of splits on the nodes. In the case of random forest, the impuritybased feature importance is measured according to the following: Given an element s of a set of nodes S in a tree t, i(s) and w(s) represent the impurity and fraction of a number of samples of s, respectively. For regression tasks, the impurity of a node was measured using a loss function. In this study, we applied the mean squared error (MSE) as the metric for i(s).
where n is the number of samples, y_{i} is the ith observation, and
Assuming that t is binary, s has two children:s_{l} ands_{r}. The impurity reduction at s, denoted by Δi(s), measures the impurity differences between s and its children,s_{l} ands_{r}. Therefore, a split at s is considered effective if the impurity reduction occurs to a significant extent.
Given a particular feature θ, S^{θ} is defined as a subset of nodes that split the data on θ. Accordingly, the feature importance of θ for an individual tree t, denoted by I(θ ), is measured using the following equation. Further, the final feature importance for a random forest is calculated by averaging the I(θ ) values of the individual trees.
LIME [16] is an XAI technique used to understand and explain a complicated model for the instance level using surrogate models. As LIME focuses on local explanations, an explanation model that approximates the original model linearly is created, centered on a specific instance that needs to be explained. Explanations provided by an explanation model are referred to as interpretable visual representations, which vary depending on the type of data (text, image, and tabular data). Therefore, LIME users, including nonexperts, can easily understand the resultant explanations regardless of the actual features used by the model. LIME attempts to derive an explanation model that replicates the behaviors of the original model locally while having low model complexity.
where f is the original model and g is an explanation model as an element of a set of interpretable models G. x is an instance that needs to be explained. π_{x} measures the proximity between a perturbed instance z and x.
The term (f, g, π_{x}) represents a localityaware loss. It accumulates squared errors between the output of f for z, which is used as a label, and the output of g for z', which is a perturbed instance of reduced dimensionality. This term evaluates the local fidelity of g with respect to f.
Ω(g) represents the model complexity and is measured in different ways depending on the type of g. For example, if g is a treestructured model (e.g., decision tree), Ω(g) represents the depth of the tree, whereas for a linear model, it corresponds to the number of nonzero coefficients of the model.
As our objective model is a prediction model for retail sales, the feature importance measured by LIME is a regression coefficient in a ridge regression model, which is a default explanation model for regression tasks.
RFE is the simplest form of the backward feature selection method. This method iteratively removes features ranked by importance one at a time. Let ρ(θ_{t}) be a ranking of θ by t, where t ∊ 𝒯 is the type of feature importance. Then, the ranking score of features and the marginal sum of the ranking by the feature importance of individual features are obtained.
As our method aims to determine a tradeoff between the predictive power and computational cost, we introduce the falling threshold β, limiting the lower prediction accuracy bound in the RFE method. For example, assuming that the predictive power of the baseline model φ(𝓕) using the complete set of features is
For each iteration, RFE selects the least important feature θr from the current feature set Fs. In other words, the model was trained on the feature set 𝓕^{s}, except for the feature with the maximum of S(θ ). If the predictive power was better than or equal to beta, the feature was removed.
The detailed procedure of determining image size
Algorithm 1: Recursive Feature Elimination with β  

Input  𝓕: Set offeatures 
𝒯: Set of classes of feature importance  
begin  
for 

end  
while 

end  
return 
To validate the feature selection method, we conducted an experiment involving ranking the features by measuring their importance and analyzing performance for each feature subset arranged by the RFE method with β. The specifications of the equipment and software libraries used in the experiments are listed in Table 3.
Experimental settings.
Attribute  Description 

GPU  NVIDIA GeForce 2080 RTX Super (8GB) 
CPU  Intel Core i99900 
RAM  32GB 
OS  Windows 10 Pro 64bit 
Python Libraries  numpy (numerical operations) 
pandas (data I/O and manipulation)  
scikitlearn (feature engineering, model)  
lightgbm (model)  
bayesianoptimization (hyperparameter optimization)  
lime (feature importance)  
matplotlib (visualization) 
Various methods and models can be considered for retail sales prediction. As the feature selection method in this study combines impurity and LIMEbased feature importance, we use two machine learning models to which both of these feature importance methods are applicable: random forest regression (RFR) and light gradient boosting machine (LightGBM).
Random forest [17] is an ensemble machine learning model based on the bagging technique and has been widely used for classification and regression tasks. In this experiment, RFR was used, and its hyperparameters were configured as shown in Table 4.
Hyperparameter settings for RFR.
Hyperparameter  Default value  Search range  Description 

n_estimators  100  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
min_samples_split  2  [2, 100]  Minimum number of samples required to split 
max_samples  1.0  [0.1, 1.0]  Fraction of the original dataset is given to any individual tree 
LightGBM [7] is a computationally efficient gradient boosting algorithm that prevents overfitting. The hyperparameters of LightGBM were set as listed in Table 5.
Hyperparameter settings for LightGBM.
Hyperparameter  Default value  Search range  Description 

n_estimators  200  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
num_leaves  100  [2, 512]  Maximum tree leaves 
subsample  1.0  [0.1, 1.0]  Subsample ratio of the training instance 
The feature rankings for the two models were derived based on measurements of feature importance. Table 6 and 7 present the partial results (top5 features). In the case of RFR, Promo was evaluated as the most important feature. In addition, DayOfWeek_1 (i.e., Monday) and Month_12 (i.e., December), the derived features for modeling weekly and annual cycles, were evaluated as highranking features.
Partial results of measuring feature importance and overall rankings for RFR.
Feature  Importance (impurity)  Importance (lime)  Ranking 

Promo  0.1367 (2)  829.5015 (1)  1 
DayOfWeek_1  0.0335 (5)  225.5958 (2)  2 
Month_12  0.0223 (7)  109.6942 (4)  3 
CompetitionOpenElapsedDays  0.1306 (3)  59.6449 (9)  4 
Promo2ElapsedDays  0.0416 (4)  47.9985 (11)  5 
Partial results of measuring feature importance and overall rankings for LightGBM.
Feature  Importance (impurity)  Importance (lime)  Ranking 

CompetitionDistance  43182 (1)  709.5813 (3)  1 
Promo  13054 (4)  886.1849 (1)  2 
DayOfWeek_1  5119 (7)  748.4422 (2)  3 
CompetitionOpenElapsedDays  33031 (2)  98.6852 (12)  4 
Month_12  4025 (11)  463.3038 (5)  5 
However, the distance from the competing drug store (i.e., CompetitionDistance) is considered the most important feature of the LightGBM model.
Note that the ranking results of the two separately applied feature importance methods differ significantly, which supports the idea that the hybrid approach aggregating multiple feature importance is appropriate as a feature selection method.
In the performance analysis experiment, the lowestranking features were removed individually using the RFE method. For each iteration, the weighted mean absolute percentage error (WMAPE) value and training time were measured to derive a tradeoff according to the falling threshold β. Therefore, the β values that determine the tradeoff were divided into 0.05, 0.1, and 0.2 and applied to each experiment.
The performance analysis involves the basic performance measurements of models and additional performance measurements with Bayesian optimization (BO), which is widely used in model optimization. The search ranges of the hyperparameters in BO are presented in Tables 4 and 5.
To evaluate the accuracy of the prediction models, the WMAPE was used as a scaleindependent metric.
where n is the number of instances, and y_{i} and are the observation and prediction in the ith instance, respectively.
Fig. 4 shows the performance measured by the RFR and the tradeoffs for each β value (5, 10, and 20%). In RFR, the training time was dramatically reduced as the number of features decreased. In contrast, when the number of features was less than 11, the WMAPE increased steeply because features of high predictive power were eliminated.
On the other hand, LightGBM trains the model hundreds of times faster than RFR because it performs pruning, which significantly reduces the amount of computation. However, in general, higher WMAPE values than those of RFR are obtained. A small feature subset leads to a reduction in training time. However, the effect on the overall performance is trivial.
As both models are based on the ensemble approach, they naturally exhibit randomness and are sensitive to hyperparameter settings, resulting in a potentially volatile performance. In the experiment, we attempted to achieve a stable performance by applying BO to each basic model, and the results are shown in Fig. 5.
For both models, the results suggest that the degree of reduction in training time by feature elimination is inconsistent. In LightGBM, similar to Fig. 4, model training is completed much faster than in RFR. Therefore, the reduction in training time on a smaller feature subset is negligible. In the case of RFR, although significant training time is required owing to BO, the training time is considerably reduced while RFE is performed until near the beta value, whereas the predictive power of the model remains at an acceptable level.
The results of the overall performance analysis are presented in Table 8. In summary, the high computational cost of the RFR can be significantly reduced by RFE with an appropriate beta. In particular, our feature selection method leads to a considerable reduction in training time in cases involving BO. This advantage allows us to derive tradeoffs that ensure reduced computational cost and acceptable model predictive power. In the case of LightGBM, as it is a computationally efficient algorithm, the validity of the tradeoffs discovered by the RFE is evaluated to be relatively less important.
Performance evaluation results.
Models β  WMAPE  # Features  Training time (s)  

Base  .05  .1  .2  Base  .05  .1  .2  Base  .05  .1  .2  
RFR  .1023  .1072  .1113  .1132  42  21  16  15  736  384  304  284 
LightGBM  .1275  .1332  .1382  .1448  42  22  15  13  1.89  1.60  1.53  1.45 
RFR+BO  .1013  .1063  .1108  .1108  42  17  15  15  11,919  4,096  4,578  4,578 
LightGBM+BO  .1027  .1058  .1095  .1220  42  20  16  13  76  93  111  79 
Dataintensive techniques, such as machine learning and deep learning models, which generally require massive amounts of computations, have been widely applied for time series prediction problems. In this regard, to find an appropriate tradeoff between computational cost and model predictive power, the effectiveness of the feature selection method using impurity and XAIbased feature importance was evaluated in this study. The performance analysis was benchmarked against two machine learning models: RFR and LightGBM. In the case of RFR, which has a relatively high computational cost, it was confirmed that the featureimportance based feature selection method affects the overall model performance more positively.
In the future, we plan to study an effective feature selection method that considers deep learning models with poor interpretability and high computational cost.
This work was supported by Hanshin University Research Grant.
Variables in the Rossmann dataset.
Data source  Variable  Description 

Store information  Store  A unique identifier for each store 
Store Type  Differentiates among 4 different store models: a, b, c, d  
Assortment  Describes an assortment level: a = basic, b = extra, c = extended  
Competition Distance  Distance in meters to the nearest competitor store  
Competition OpenSince  The approximated year and month of the time the nearest competitor was opened  
Promo2  A continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating  
Promo2 Since  The year and calendar week when the store started participating in Promo2  
Promo Interval  The consecutive intervals Promo2 is started, naming the months the promotion is started anew.  
e.g., “Feb,May,Aug,Nov” indicates that each round starts in February, May, August, November of any given year for that store  
Store sales data  Store  
Day Of Week  The day of the week  
Data  A date formatted as YYYYMMDD  
Sales  The turnover for any given day (target variable)  
Customers  The number of customers on a given day  
Open  An indicator for whether the store was open: 0 = closed, 1 = open  
Promo  Indicates whether a store is running a promotion on that day  
State Holiday  Indicates a state holiday: a = public holiday, b = Easter holiday, c = Christmas, 0 = none  
School Holiday  Indicates a school holiday: 1 = school holiday, 0 = none 
Descriptive statistics of the sales and number of customers.
Variable  Mean  Std. Dev.  Min  Max 

Sales  6955.51  3104.21  0  41551 
Customers  762.73  401.23  0  7388 
Experimental settings.
Attribute  Description 

GPU  NVIDIA GeForce 2080 RTX Super (8GB) 
CPU  Intel Core i99900 
RAM  32GB 
OS  Windows 10 Pro 64bit 
Python Libraries  numpy (numerical operations) 
pandas (data I/O and manipulation)  
scikitlearn (feature engineering, model)  
lightgbm (model)  
bayesianoptimization (hyperparameter optimization)  
lime (feature importance)  
matplotlib (visualization) 
Hyperparameter settings for RFR.
Hyperparameter  Default value  Search range  Description 

n_estimators  100  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
min_samples_split  2  [2, 100]  Minimum number of samples required to split 
max_samples  1.0  [0.1, 1.0]  Fraction of the original dataset is given to any individual tree 
Hyperparameter settings for LightGBM.
Hyperparameter  Default value  Search range  Description 

n_estimators  200  [50, 1000]  Number of trees 
max_depth  50  [30, 70]  Maximum tree depth 
num_leaves  100  [2, 512]  Maximum tree leaves 
subsample  1.0  [0.1, 1.0]  Subsample ratio of the training instance 
Partial results of measuring feature importance and overall rankings for RFR.
Feature  Importance (impurity)  Importance (lime)  Ranking 

Promo  0.1367 (2)  829.5015 (1)  1 
DayOfWeek_1  0.0335 (5)  225.5958 (2)  2 
Month_12  0.0223 (7)  109.6942 (4)  3 
CompetitionOpenElapsedDays  0.1306 (3)  59.6449 (9)  4 
Promo2ElapsedDays  0.0416 (4)  47.9985 (11)  5 
Partial results of measuring feature importance and overall rankings for LightGBM.
Feature  Importance (impurity)  Importance (lime)  Ranking 

CompetitionDistance  43182 (1)  709.5813 (3)  1 
Promo  13054 (4)  886.1849 (1)  2 
DayOfWeek_1  5119 (7)  748.4422 (2)  3 
CompetitionOpenElapsedDays  33031 (2)  98.6852 (12)  4 
Month_12  4025 (11)  463.3038 (5)  5 
Performance evaluation results.
Models β  WMAPE  # Features  Training time (s)  

Base  .05  .1  .2  Base  .05  .1  .2  Base  .05  .1  .2  
RFR  .1023  .1072  .1113  .1132  42  21  16  15  736  384  304  284 
LightGBM  .1275  .1332  .1382  .1448  42  22  15  13  1.89  1.60  1.53  1.45 
RFR+BO  .1013  .1063  .1108  .1108  42  17  15  15  11,919  4,096  4,578  4,578 
LightGBM+BO  .1027  .1058  .1095  .1220  42  20  16  13  76  93  111  79 
Arpita Nagpal, Deepti Gaur*
The Korea Institute of Information and Commucation Engineering 2015; 13(2): 113122 https://doi.org/10.6109/jicce.2015.13.2.113