Search 닫기

Regular paper

Split Viewer

Journal of information and communication convergence engineering 2024; 22(4): 310-315

Published online December 31, 2024

https://doi.org/10.56977/jicce.2024.22.4.310

© Korea Institute of Information and Communication Engineering

Utilization of XGBoost for Behavior Analysis of Lottery Purchasers

Esther Kim 1, Yunjun Park 2, Gwanghyun Jo 2*, and Seong-Yoon Shin3*

1Department of Counselling Psychology, Korea Baptist Theological University/Seminary
2Department of mathematical data science, Hanyang university ERICA, Ansan, Republic of Korea
3Department of Computer Science and Engineering, Kunsan National University, Guansan-si, Republic of Korea

Correspondence to : Seong-Yoon Shin (E-mail: s3397220@kunsan.ac.kr) Department of Computer Science and Engineering, Kunsan National University
Gwanghyun Jo (E-mail: gwanghyun@hanyang.ac.kr) Department of Mathematical Data Science, Hanyang University, ERICA

Received: October 22, 2024; Revised: November 13, 2024; Accepted: November 13, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

In this study, we conducted a data-driven analysis of lottery purchase behavior by using the XGBoost algorithm to predict future lottery purchase amounts based on purchase patterns of the previous four weeks. We began by judiciously defining key features including the weekly average purchase amount and variance in purchase amount. Subsequently, we evaluated the proposed method’s performance, finding the predicted future purchase amounts to match the actual purchase amounts. A key strength of this study was the interpretability of feature variables. Through the feature importance score from XGBoost, we found that features that capture impulsive patterns in purchases (e.g., variability in purchase amount) are strongly correlated with future spending, which agrees with conventional behavior analysis. Our study can be extended to the development of early warning systems designed to identify at-risk and potentially addicted purchasers on online lottery platforms.

Keywords XGBoost, Lottery Purchase, Behavior Analysis, Feature Importance

The proliferation of online lottery platforms has significantly increased the accessibility of lottery tickets, raising concerns about the potential for gambling addiction [1-2]. For example, the relationship between online gambling and problematic behaviors was analyzed in [1]. Accordingly, efforts have been made to develop objective indicators in order to identify individuals at high risk of addictive behaviors [4-7]. Traditional approaches typically rely on statistical correlation analyses of carefully selected survey questions, or focus on identifying key risk factors associated with addictive behaviors. For example, factors such as gambling frequency, betting amounts, total money spent, and variability in purchase amounts are commonly linked to addictive tendencies. More recently, advancements in machine learning technologies have offered new opportunities to predict lottery purchasing patterns and address addiction-related issues using data-driven methods [8-11]. A key advantage of these algorithms is their ability to autonomously identify the factors (features) that influence addictive behaviors, even without direct input from predefined survey items. However, a major challenge associated with data-driven algorithms is the loss of explainability, as machine-learning models are often regarded as black-box methods.

This study proposes a prediction algorithm to analyze lottery purchase patterns using a dataset provided by the Dong-Hang Lottery Co., Ltd., which consists of user purchase histories. Given that overall purchase amounts are a critical factor related to lottery addiction, we predicted future purchase amounts based on purchase history. After accumulating historical user purchase data, the prediction model can be used to develop an alarm system for addictive behavior on lottery platforms. Another primary focus of this study was the explainability of the prediction model. We employed XGBoost [12], a highly efficient and accurate tree-based ensemble model. XGBoost-based methods have been successfully applied in various fields including purchasing behavior analysis [13], risk prediction [14-15], and clinical detection [16-17]. One key advantage of this algorithm is its ability to provide feature importance scores, which show correlations between the features and the target variable. Therefore, XGBoost can provide insights into purchase history patterns that correlate with addictive behavior.

Because the performance of a data-driven algorithm is affected by feature variables, we judiciously defined these variables from the purchase history. For example, the weekly defined average purchase amount for each event and variability in purchase amounts play an important role in our analysis, with both defined as feature variables. Compared to well-established theories in behavior analysis, our features can be used to capture impulsive and addictive purchasing patterns in individuals. After the feature vectors are selected, the hyper-parameters in XGBoost are determined heuristically. In the results section, we report the R2 and Pearson’s coefficient scores for the proposed method. Furthermore, we discuss the features that correlate with addictive purchasing behaviors based on the feature importance scores obtained using XGBoost.

The remainder of this paper is organized as follows. In Section 2, we describe the overall algorithm workflow, including feature selection. Section 3 presents the experimental results. Finally, Section 4 concludes the paper.

We developed an XGBoost-based prediction model for lottery purchase amounts. In the following subsections, we describe the data collection and preprocessing methods, define the features used for the XGBoost-based algorithm, and finally present the algorithm itself.

A. Data collection and preprocessing

Historical purchase data from users, obtained from January to April of 2024, were provided by DongHang Lottery Co., Ltd. To ensure data quality and focus on active users, we restricted the dataset to users with more than 300 purchase transactions during this period. For each user, we calculated the daily number of purchases and total purchase amount. Thus, our dataset captures both the frequency and monetary volume of purchases, providing a comprehensive overview of each user’s purchasing behavior. To formalize the data, we introduce the following notation: for a given user u and day t, let Iu1t represent the total purchase amount on day t, and Iu2t represent the number of purchases on day t. For simplicity, we omit the user subscript u unless otherwise necessary.

The objective of this study was to predict the future total purchase amount over the following week using purchase patterns from the preceding τ number of days:

Y= s=17I1t+s

An appropriate value of τ was crucial for this analysis. If τ is too large, the feature vectors of each sample may become inefficiently bulky. Conversely, if τ is too small, XGBoost would be rendered unable to capture the user’s purchase pattern. To determine τ, we calculated the mutual information (MI) between Iu1t and Iu2tτ for each user, and obtained an average for all users, denoted as MII1t, I1tτ. Here, MI is defined as

MIX,Y= yY xXPx,ylogP x,yPxPy

The graph of mutual information, MII1t, I1tτ defined by the function of increasing τ, is shown in Fig 1. Because MI levels off after 28 days, we set τ=28. Notably, there are some local peaks at τ=7, 14, 21, which we attribute to a weekly periodic pattern in lottery purchasing behavior, as users tend to make purchases more frequently on certain days of the week.

Fig. 1. Mutual information between It(t) and It(t−τ).

The input data, representing the purchase history, are structured as

X0=I1t27,,I1t,I2t27,,I2t.

Thus, the primary dataset consisting of pairs (X0, Y) can be directly used for supervised learning. However, we defined additional features from X0 to enhance interpretability,.

B. Feature extraction

In this subsection, we define the new features derived from a given user’s purchase history. The primary objective is to capture meaningful features that relate to future purchase amounts. The first feature is the weekly sum of purchase amounts:

F1i= s=17I1t+s7i, i=1,4.

The second feature is the weekly number of purchase events:

F2i= s=17I2t+s7i, i=1,4.

Next, the average purchase amount per event is defined as

F3i=F1iF2i, i=1,4.

We also consider the number of days with at least one purchase per week, denoted as F4i, (i=1,,4). The average time interval between purchases is also used as a feature, denoted by . Finally, the variance in weekly purchase amount is a feature denoted by F6i, i=1,,4.

Together with the purchase history vector X0, the features defined above form the final input variables:

X=X0,F1,F2,F3,F4,F5,F6 

where Fj represents [Fj1, ...Fj4] for each j = 1, ...4. Because we have four months of data for each individual, we can generate 13 data points of the form (X,Y) by shifting the time t in Eq. (1) by a one-week interval. We may then construct the training dataset by aggregating each user’s data from Eqs. (1) and (3):

Xi,Yii=1Ns,

where Ns=44,967 represents the total number of samples.

Before concluding this subsection, we discuss the role of each feature in comparison to those in prior studies on behavioral analysis, which were not necessarily data-driven. The frequency of and amount of money spent on gambling have long been recognized as key predictors of addictive behavior. According to Hodgnis et al. [18], the higher an individual’s gambling frequency, the greater the risk of gambling addiction. Similarly, Marzar et al. [19] found that individuals who engage in gambling more frequently are more likely to experience addictions. The results of these studies support the idea that features F1, F2, and F4 are major indicators for predicting gambling behavior one week into the future. Naturally, we can expect the target variable Y to increase alongside F1, F2, and F4.

We now discuss F3, which represents the average amount spent per purchase. Labrie et al. [20] reported that the act of placing larger bets is associated with a higher likelihood of gambling addiction. Features F5 and F6, which capture impulsive purchasing patterns, are also meaningful for the behavioral analysis. Blaszczynski and Nower [21] suggested that individuals who make impulsive purchases tend to gamble at shorter intervals, potentially leading to addiction. Similarly, Labrie et al. [20] provided insights into the time periods associated with gambling purchases, as addicted gamblers exhibited patterns of concentrated high betting within short periods. Therefore, [20,21] justify our selection of the features F5 and F6, as it is likely that people exhibiting frequent purchases and high variability in purchase amounts tend to impulsively buy lottery items.

C. XGBoost-based prediction model

XGBoost is a modified tree-based algorithm that improves upon the traditional greedy tree algorithm by incorporating a regularized objective function [12]. We now briefly describe XGBoost for the sake of completeness. Suppose that in the training stage of the tree algorithm, k trees fii=1k predict the target variable Y with residual vector rk; that is,.

rk,j=Yj Y˜j,where  Y˜j= i=1kfiXj.

In a conventional greedy-tree-based algorithm [22], a new tree fk+1 is added to reduce the residual rk based on a loss function LYj, Y˜j, such as the mean absolute error (MAE). After updating fk+1, the prediction model is modified as

Y˜= i=1 k+1fiX.

In XGBoost, the new tree fk+1 is added to optimize a new objective function Ωθ, which includes both LY,Y˜ and a regularization term Λθ. As described in [12], the regularization function is defined as

Λθ=γT+12λw2. 

Therefore, instead of simply reducing the residual error, XGBoost adds a new tree that maintains a balance between predictive accuracy and model complexity, thereby preventing overfitting on the training set.

Once XGBoost is trained on the dataset defined by Eq. (4), it can be used to predict future purchase amounts. One primary goal of this study is to capture the correlation between future purchase amounts and past purchase patterns. To achieve this, we utilized gain-type feature importance, which evaluates the contribution of each feature toward improving the model’s performance. Whenever a feature is used to split a node, XGBoost calculates the extent to which the split reduces the loss function. This reduction accumulates across all splits involving the features, allowing us to identify those features that most directly influence the prediction of future purchase amounts.

We now examine the performance of XGBoost in predicting future lottery purchases based on a four-week purchase history. The dataset includes the purchase histories of 3,459 users from January to April of 2024. Of these, 70% were used as the training set, with the remaining 30% allocated to the test set. In Subsection A, we report the performance of the model across various hyperparameter configurations. In Subsection B, we present the feature importance scores, highlighting the most influential features in predicting future purchase amounts.

A. Performance of XGBoost

The evaluation of model performance was based on R2 and Pearson’s correlation coefficient (p), which are defined as follows:

R2=1yi y˜i2yi y¯i2,p=yiy¯ y˜ i y˜ ¯yiy ¯ 2 y˜i y˜¯ 2.

Both metrics range from 0 to 1, with values closer to 1 indicating better prediction performance.

The XGBoost parameters were selected heuristically. In Table 1, we report the performance of XGBoost for different combinations of learning rate α, tree depth d, and number of trees n. From these results, we selected the following parameters: number of trees=1000, learning rate=0.01, tree depth=6. This configuration resulted in a Person correlation coefficient of 0.86 on the test set. Additionally, we set γ=2 and λ=1 for the regularization parameters in Eq. (5).

Table 1 . Predictive accuracy results in terms of R2 and p score

(α, d, n)R2 scorep score
(e-2,6,800)Train setTest setTrain setTest set
(e-2,6,1000)0.8310.7430.9120.862
(2e-2,6,1500)0.8410.7430.9180.862
(2e-2,6,800)0.8640.7430.9300.862
(2e-2,6,1000)0.8670.7420.9320.862
(2e-2,6,1500)0.8820.7420.9400.861
(e-2,8,800)0.9110.7400.9560.861
(e-2,8,1000)0.9000.7420.9500.861
(e-2,8,1500)0.9130.7410.9570.861


Using these optimized parameters, we plotted the actual future purchase amounts against the predicted amounts, as shown in Fig. 2. These results demonstrate an overall match between the predicted and actual purchase amounts for both the training and test sets.

Fig. 2. Prediction results of XGBoost on training set (top) and test set (bottom).

B. Feature importance analysis

In this subsection, we report the feature importance scores generated by XGBoost following model training. The gaintype feature importance scores are presented in Fig. 3 in descending order. The weekly purchase amount is associated with the highest score, which is natural because the target variable is the total purchase amount for the following week. The average purchase amount per event had the secondhighest score, indicating a strong correlation with future purchase amounts. This finding coincides with that of [20], wherein placing larger bets in a single game was associated with a higher likelihood of addiction. The third highest scoring feature is variance in weekly purchase amounts. This finding is consistent with those of [20,21], as individuals with impulsive purchasing behaviors for lottery items are more likely to exhibit higher variance in their spending. Therefore, we suggest that monitoring both the average bet amount and variance in purchase amounts could be effective in identifying problematic gambling behavior.

Fig. 3. Prediction results of XGBoost on training set (top) and test set (bottom).

In this study, we employed XGBoost to conduct a behavior analysis, using data from the previous four weeks of purchasing patterns to predict future lottery purchase amounts. A feature importance analysis revealed that the features that capture impulsive purchase patterns are strongly correlated with future purchase amounts, which agrees with the behavior analyses conducted in [20,21]. The present study relies on real purchasing behavior data – rather than subjective data, such as surveys – making the analysis robust and objective. Furthermore, by identifying impulsive purchasing behaviors in an explainable manner, this study offers practical insights into the early detection of and intervention in gambling problems. This approach paves the way for the development of warning systems for at-risk users of online lottery platforms, which could contribute significantly to preventing and managing gambling addiction through data-driven technological solutions.

This study was conducted using data provided by Dong-Hang Lottery Co., Ltd. for research purposes.

  1. S. M. Gainsbury, “Online gambling addiction: the relationship between internet gambling and disordered gambling,” Current addiction reports, vol. 2, no. 2, pp. 185-193, Jun. 2015. DOI: 10.1007/s40429-015-0057-8.
    Pubmed KoreaMed CrossRef
  2. M. Guillou-Landreat, K. Gallopel-Morvan, D. Lever, D. Le Goff, and J.-Y. Le Reste, “Gambling marketing strategies and the internet: What do we know? A systematic review,” Frontiers in Psychiatry, vol. 12, Feb. 2021. DOI: 10.3389/fpsyt.2021.583817.
    Pubmed KoreaMed CrossRef
  3. K. Caler, J. R. V. Garcia, and L. Nower, “Assessing problem gambling: A review of classic and specialized measures,” Current Addiction Reports, vol. 3, pp. 437-444, Oct. 2016. DOI: 10.1007/s40429-016-0118-7.
    CrossRef
  4. N. M. Petry, “Validity of a gambling scale for the Addiction Severity Index,” The Journal of nervous and mental disease, vol. 191, no. 6, pp. 399-407, Jun. 2003.
    Pubmed CrossRef
  5. D. Pickering, B. Keen, G. Entwistle, and A. Blaszczynski, “Measuring treatment outcomes in gambling disorders: A systematic review,” Addiction, vol. 113, no. 3, pp. 411-426, Mar. 2018. DOI: 10.1111/add.13968.
    Pubmed KoreaMed CrossRef
  6. E. Kim and S. J. Kwon, “Effect of Speculative Experience in Internet Games on Gambling Problems: Moderated Mediating Effects of Illegal Internet Gambling Behavior and Exposure to COVID-19 Risk,” The Korean Journal of Health Psychology, vol. 28, no. 2, pp. 353-365, 2023. DOI: 10.17315/kjhp.2023.28.2.006.
  7. S. J. Kwon, Y. Kim, and E. Kim, “Development of Internet game speculative experience scale,” Korean Journal of Health Psychology, vol. 25, no. 4, pp. 651-666, Jul. 2020. DOI: 10.17315/kjhp.2020.25.4.003.
  8. A. Hassanniakalager and P. W. S. Newall, “A machine learning perspective on responsible gambling,” Behavioural Public Policy, vol. 6, no. 2, pp. 237-260, Apr. 2022. DOI: 10.1017/bpp.2019.9.
    CrossRef
  9. J. Lee, O. Jung, Y. Lee, O. Kim, and C. Park, “A comparison and interpretation of machine learning algorithm for the prediction of online purchase conversion,” Journal of Theoretical and Applied Electronic Commerce Research, vol. 16, no. 5, pp. 1472-1491, May 2021. DOI: 10.3390/jtaer16050083.
    CrossRef
  10. K. Coussement and K. W. De Bock, “Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning,” Journal of Business Research, vol. 66, no. 9, pp. 1629-1636, Sep. 2013. DOI: 10.1016/j.jbusres.2012.12.008.
    CrossRef
  11. K. K. Mak, K. Lee, and C. Park, “Applications of machine learning in addiction studies: A systematic review,” Psychiatry research, vol. 275, pp. 53-60, May 2019. DOI: 10.1016/j.psychres.2019.03.001.
    Pubmed CrossRef
  12. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in in Proceedings of the 22nd ACM sigkdd international conference on knowledge discovery and data mining, San Francisco, USA, pp. 785-794, 2016. DOI: 10.1145/2939672.293978.
    CrossRef
  13. P. Song and Y. Liu, “An XGBoost algorithm for predicting purchasing behaviour on E-commerce platforms,” Tehnički vjesnik, vol. 27, no. 5, pp. 1467-1471, Oct. 2020. DOI: 10.17559/TV-20200808113807.
    CrossRef
  14. X. Shi, Y. D. Wong, M. Z.-F. Li, C. Palanisamy, and C. Chai, “A feature learning approach based on XGBoost for driving assessment and risk prediction,” Accident Analysis & Prevention, vol. 129, pp. 170-179, Aug. 2019. DOI: 10.1016/j.aap.2019.05.005.
    Pubmed CrossRef
  15. S. Kabiraj, M. Raihan, N. Alvi, M. Afrin, L. Akter, and S. A. Sohagi, “Breast cancer risk prediction using XGBoost and random forest algorithm,” in in 2020 11th international conference on computing, communication and networking technologies (ICCCNT), Kharagpur, IN, pp. 1-4, 2020. DOI: 10.1109/ICCCNT49239.2020.9225451.
    CrossRef
  16. A. Ogunleye and Q.-G. Wang, “XGBoost model for chronic kidney disease diagnosis,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 17, no. 6, pp. 2131-2140, Nov.-Dec. 2019. DOI: 10.1109/TCBB.2019.2911071.
    Pubmed CrossRef
  17. S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimedia Tools and Applications, vol. 82, no. 22, pp. 34163-34181, Mar. 2023. DOI: 10.1007/s11042-023-15165-8.
    Pubmed KoreaMed CrossRef
  18. D. C. Hodgins, D. P. Schopflocher, C. R. Martin, N. el-Guebaly, D. M. Casey, S. R. Currie, G. J. Smith, and R. J. Williams, “Disordered gambling among higher-frequency gamblers: who is at risk?,” Psychological medicine, vol. 42, no. 11, pp. 2433-2444, Apr. 2012. DOI: 10.1017/S0033291712000724.
    Pubmed CrossRef
  19. A. Mazar, M. Zorn, N. Becker, and R. A. Volberg, “Gambling formats, involvement, and problem gambling: which types of gambling are more risky?,” BMC Public Health, vol. 20, no. 711, pp. 1-10, May 2020. DOI: 10.1186/s12889-020-08822-2.
    Pubmed KoreaMed CrossRef
  20. R. LaBrie and H. J. Shaffer, “Identifying behavioral markers of disordered Internet sports gambling,” Addiction Research & Theory, vol. 19, no. 1, pp. 56-65, Sep. 2010. DOI: 10.3109/16066359.2010.512106.
    CrossRef
  21. A. Blaszczynski and L. Nower, “A pathways model of problem and pathological gambling,” Addiction, vol. 97, no. 5, pp. 487-499, 2002. DOI: 10.1046/j.1360-0443.2002.00015.x.
    Pubmed CrossRef
  22. J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, vol. 29, no. 5, pp. 1189-1232, Oct. 2001.
    CrossRef

Esther Kim

received her M.A. and Ph.D. degrees at the Department of Counseling and Clinical Psychology from the Korea Baptist Theological University in 2019 and 2024, respectively. She is currently a postdoctoral researcher at the same university. Her research interests include understanding addiction problems and mental health issues, as well as the application of AI in these areas.

She can be contacted via email at: est0224@hanmail.net


Yunjun Park

is an undergraduate student at the Department of Mathematical Data Science of Hanyang University ERICA. His research interests include computer vision, numerical analysis, machine learning, and generative AI (Diffusion, GAN).

He can be contacted at email: pyjun0418@hanyang.ac.kr


Seong Yoon Shin

received his M.S. and Ph.D. degrees from the Dept. of Computer Information Engineering at Kunsan National University, Gunsan, Republic of Korea, in 1997 and 2003, respectively. From 2006 to the present, he has been a professor at the School of Computer Science and Engineering. His research interests include image processing, computer vision, and virtual reality. He can be contacted at email: s3397220@kunsan.ac.kr


Gwanghyun Jo

received his M.S. and Ph. D. degrees from the Department of Mathematical Science, KAIST in 2013 and 2018, respectively. From 2019 to 2023, he was a faculty member of the Department of Mathematics at Kunsan university, Republic of Korea. From 2023 to the present, he has been a faculty member of the Department of Mathematical Data Science, Hanyang university ERICA. His research interests include numerical analysis, computational fluid dynamics, and machine learning.

He can be contacted at email: gwanghyun@hanyang.ac.kr


Article

Regular paper

Journal of information and communication convergence engineering 2024; 22(4): 310-315

Published online December 31, 2024 https://doi.org/10.56977/jicce.2024.22.4.310

Copyright © Korea Institute of Information and Communication Engineering.

Utilization of XGBoost for Behavior Analysis of Lottery Purchasers

Esther Kim 1, Yunjun Park 2, Gwanghyun Jo 2*, and Seong-Yoon Shin3*

1Department of Counselling Psychology, Korea Baptist Theological University/Seminary
2Department of mathematical data science, Hanyang university ERICA, Ansan, Republic of Korea
3Department of Computer Science and Engineering, Kunsan National University, Guansan-si, Republic of Korea

Correspondence to:Seong-Yoon Shin (E-mail: s3397220@kunsan.ac.kr) Department of Computer Science and Engineering, Kunsan National University
Gwanghyun Jo (E-mail: gwanghyun@hanyang.ac.kr) Department of Mathematical Data Science, Hanyang University, ERICA

Received: October 22, 2024; Revised: November 13, 2024; Accepted: November 13, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In this study, we conducted a data-driven analysis of lottery purchase behavior by using the XGBoost algorithm to predict future lottery purchase amounts based on purchase patterns of the previous four weeks. We began by judiciously defining key features including the weekly average purchase amount and variance in purchase amount. Subsequently, we evaluated the proposed method’s performance, finding the predicted future purchase amounts to match the actual purchase amounts. A key strength of this study was the interpretability of feature variables. Through the feature importance score from XGBoost, we found that features that capture impulsive patterns in purchases (e.g., variability in purchase amount) are strongly correlated with future spending, which agrees with conventional behavior analysis. Our study can be extended to the development of early warning systems designed to identify at-risk and potentially addicted purchasers on online lottery platforms.

Keywords: XGBoost, Lottery Purchase, Behavior Analysis, Feature Importance

I. INTRODUCTION

The proliferation of online lottery platforms has significantly increased the accessibility of lottery tickets, raising concerns about the potential for gambling addiction [1-2]. For example, the relationship between online gambling and problematic behaviors was analyzed in [1]. Accordingly, efforts have been made to develop objective indicators in order to identify individuals at high risk of addictive behaviors [4-7]. Traditional approaches typically rely on statistical correlation analyses of carefully selected survey questions, or focus on identifying key risk factors associated with addictive behaviors. For example, factors such as gambling frequency, betting amounts, total money spent, and variability in purchase amounts are commonly linked to addictive tendencies. More recently, advancements in machine learning technologies have offered new opportunities to predict lottery purchasing patterns and address addiction-related issues using data-driven methods [8-11]. A key advantage of these algorithms is their ability to autonomously identify the factors (features) that influence addictive behaviors, even without direct input from predefined survey items. However, a major challenge associated with data-driven algorithms is the loss of explainability, as machine-learning models are often regarded as black-box methods.

This study proposes a prediction algorithm to analyze lottery purchase patterns using a dataset provided by the Dong-Hang Lottery Co., Ltd., which consists of user purchase histories. Given that overall purchase amounts are a critical factor related to lottery addiction, we predicted future purchase amounts based on purchase history. After accumulating historical user purchase data, the prediction model can be used to develop an alarm system for addictive behavior on lottery platforms. Another primary focus of this study was the explainability of the prediction model. We employed XGBoost [12], a highly efficient and accurate tree-based ensemble model. XGBoost-based methods have been successfully applied in various fields including purchasing behavior analysis [13], risk prediction [14-15], and clinical detection [16-17]. One key advantage of this algorithm is its ability to provide feature importance scores, which show correlations between the features and the target variable. Therefore, XGBoost can provide insights into purchase history patterns that correlate with addictive behavior.

Because the performance of a data-driven algorithm is affected by feature variables, we judiciously defined these variables from the purchase history. For example, the weekly defined average purchase amount for each event and variability in purchase amounts play an important role in our analysis, with both defined as feature variables. Compared to well-established theories in behavior analysis, our features can be used to capture impulsive and addictive purchasing patterns in individuals. After the feature vectors are selected, the hyper-parameters in XGBoost are determined heuristically. In the results section, we report the R2 and Pearson’s coefficient scores for the proposed method. Furthermore, we discuss the features that correlate with addictive purchasing behaviors based on the feature importance scores obtained using XGBoost.

The remainder of this paper is organized as follows. In Section 2, we describe the overall algorithm workflow, including feature selection. Section 3 presents the experimental results. Finally, Section 4 concludes the paper.

II. METHODS

We developed an XGBoost-based prediction model for lottery purchase amounts. In the following subsections, we describe the data collection and preprocessing methods, define the features used for the XGBoost-based algorithm, and finally present the algorithm itself.

A. Data collection and preprocessing

Historical purchase data from users, obtained from January to April of 2024, were provided by DongHang Lottery Co., Ltd. To ensure data quality and focus on active users, we restricted the dataset to users with more than 300 purchase transactions during this period. For each user, we calculated the daily number of purchases and total purchase amount. Thus, our dataset captures both the frequency and monetary volume of purchases, providing a comprehensive overview of each user’s purchasing behavior. To formalize the data, we introduce the following notation: for a given user u and day t, let Iu1t represent the total purchase amount on day t, and Iu2t represent the number of purchases on day t. For simplicity, we omit the user subscript u unless otherwise necessary.

The objective of this study was to predict the future total purchase amount over the following week using purchase patterns from the preceding τ number of days:

Y= s=17I1t+s

An appropriate value of τ was crucial for this analysis. If τ is too large, the feature vectors of each sample may become inefficiently bulky. Conversely, if τ is too small, XGBoost would be rendered unable to capture the user’s purchase pattern. To determine τ, we calculated the mutual information (MI) between Iu1t and Iu2tτ for each user, and obtained an average for all users, denoted as MII1t, I1tτ. Here, MI is defined as

MIX,Y= yY xXPx,ylogP x,yPxPy

The graph of mutual information, MII1t, I1tτ defined by the function of increasing τ, is shown in Fig 1. Because MI levels off after 28 days, we set τ=28. Notably, there are some local peaks at τ=7, 14, 21, which we attribute to a weekly periodic pattern in lottery purchasing behavior, as users tend to make purchases more frequently on certain days of the week.

Figure 1. Mutual information between It(t) and It(t−τ).

The input data, representing the purchase history, are structured as

X0=I1t27,,I1t,I2t27,,I2t.

Thus, the primary dataset consisting of pairs (X0, Y) can be directly used for supervised learning. However, we defined additional features from X0 to enhance interpretability,.

B. Feature extraction

In this subsection, we define the new features derived from a given user’s purchase history. The primary objective is to capture meaningful features that relate to future purchase amounts. The first feature is the weekly sum of purchase amounts:

F1i= s=17I1t+s7i, i=1,4.

The second feature is the weekly number of purchase events:

F2i= s=17I2t+s7i, i=1,4.

Next, the average purchase amount per event is defined as

F3i=F1iF2i, i=1,4.

We also consider the number of days with at least one purchase per week, denoted as F4i, (i=1,,4). The average time interval between purchases is also used as a feature, denoted by . Finally, the variance in weekly purchase amount is a feature denoted by F6i, i=1,,4.

Together with the purchase history vector X0, the features defined above form the final input variables:

X=X0,F1,F2,F3,F4,F5,F6 

where Fj represents [Fj1, ...Fj4] for each j = 1, ...4. Because we have four months of data for each individual, we can generate 13 data points of the form (X,Y) by shifting the time t in Eq. (1) by a one-week interval. We may then construct the training dataset by aggregating each user’s data from Eqs. (1) and (3):

Xi,Yii=1Ns,

where Ns=44,967 represents the total number of samples.

Before concluding this subsection, we discuss the role of each feature in comparison to those in prior studies on behavioral analysis, which were not necessarily data-driven. The frequency of and amount of money spent on gambling have long been recognized as key predictors of addictive behavior. According to Hodgnis et al. [18], the higher an individual’s gambling frequency, the greater the risk of gambling addiction. Similarly, Marzar et al. [19] found that individuals who engage in gambling more frequently are more likely to experience addictions. The results of these studies support the idea that features F1, F2, and F4 are major indicators for predicting gambling behavior one week into the future. Naturally, we can expect the target variable Y to increase alongside F1, F2, and F4.

We now discuss F3, which represents the average amount spent per purchase. Labrie et al. [20] reported that the act of placing larger bets is associated with a higher likelihood of gambling addiction. Features F5 and F6, which capture impulsive purchasing patterns, are also meaningful for the behavioral analysis. Blaszczynski and Nower [21] suggested that individuals who make impulsive purchases tend to gamble at shorter intervals, potentially leading to addiction. Similarly, Labrie et al. [20] provided insights into the time periods associated with gambling purchases, as addicted gamblers exhibited patterns of concentrated high betting within short periods. Therefore, [20,21] justify our selection of the features F5 and F6, as it is likely that people exhibiting frequent purchases and high variability in purchase amounts tend to impulsively buy lottery items.

C. XGBoost-based prediction model

XGBoost is a modified tree-based algorithm that improves upon the traditional greedy tree algorithm by incorporating a regularized objective function [12]. We now briefly describe XGBoost for the sake of completeness. Suppose that in the training stage of the tree algorithm, k trees fii=1k predict the target variable Y with residual vector rk; that is,.

rk,j=Yj Y˜j,where  Y˜j= i=1kfiXj.

In a conventional greedy-tree-based algorithm [22], a new tree fk+1 is added to reduce the residual rk based on a loss function LYj, Y˜j, such as the mean absolute error (MAE). After updating fk+1, the prediction model is modified as

Y˜= i=1 k+1fiX.

In XGBoost, the new tree fk+1 is added to optimize a new objective function Ωθ, which includes both LY,Y˜ and a regularization term Λθ. As described in [12], the regularization function is defined as

Λθ=γT+12λw2. 

Therefore, instead of simply reducing the residual error, XGBoost adds a new tree that maintains a balance between predictive accuracy and model complexity, thereby preventing overfitting on the training set.

Once XGBoost is trained on the dataset defined by Eq. (4), it can be used to predict future purchase amounts. One primary goal of this study is to capture the correlation between future purchase amounts and past purchase patterns. To achieve this, we utilized gain-type feature importance, which evaluates the contribution of each feature toward improving the model’s performance. Whenever a feature is used to split a node, XGBoost calculates the extent to which the split reduces the loss function. This reduction accumulates across all splits involving the features, allowing us to identify those features that most directly influence the prediction of future purchase amounts.

III. Results

We now examine the performance of XGBoost in predicting future lottery purchases based on a four-week purchase history. The dataset includes the purchase histories of 3,459 users from January to April of 2024. Of these, 70% were used as the training set, with the remaining 30% allocated to the test set. In Subsection A, we report the performance of the model across various hyperparameter configurations. In Subsection B, we present the feature importance scores, highlighting the most influential features in predicting future purchase amounts.

A. Performance of XGBoost

The evaluation of model performance was based on R2 and Pearson’s correlation coefficient (p), which are defined as follows:

R2=1yi y˜i2yi y¯i2,p=yiy¯ y˜ i y˜ ¯yiy ¯ 2 y˜i y˜¯ 2.

Both metrics range from 0 to 1, with values closer to 1 indicating better prediction performance.

The XGBoost parameters were selected heuristically. In Table 1, we report the performance of XGBoost for different combinations of learning rate α, tree depth d, and number of trees n. From these results, we selected the following parameters: number of trees=1000, learning rate=0.01, tree depth=6. This configuration resulted in a Person correlation coefficient of 0.86 on the test set. Additionally, we set γ=2 and λ=1 for the regularization parameters in Eq. (5).

Table 1 . Predictive accuracy results in terms of R2 and p score.

(α, d, n)R2 scorep score
(e-2,6,800)Train setTest setTrain setTest set
(e-2,6,1000)0.8310.7430.9120.862
(2e-2,6,1500)0.8410.7430.9180.862
(2e-2,6,800)0.8640.7430.9300.862
(2e-2,6,1000)0.8670.7420.9320.862
(2e-2,6,1500)0.8820.7420.9400.861
(e-2,8,800)0.9110.7400.9560.861
(e-2,8,1000)0.9000.7420.9500.861
(e-2,8,1500)0.9130.7410.9570.861


Using these optimized parameters, we plotted the actual future purchase amounts against the predicted amounts, as shown in Fig. 2. These results demonstrate an overall match between the predicted and actual purchase amounts for both the training and test sets.

Figure 2. Prediction results of XGBoost on training set (top) and test set (bottom).

B. Feature importance analysis

In this subsection, we report the feature importance scores generated by XGBoost following model training. The gaintype feature importance scores are presented in Fig. 3 in descending order. The weekly purchase amount is associated with the highest score, which is natural because the target variable is the total purchase amount for the following week. The average purchase amount per event had the secondhighest score, indicating a strong correlation with future purchase amounts. This finding coincides with that of [20], wherein placing larger bets in a single game was associated with a higher likelihood of addiction. The third highest scoring feature is variance in weekly purchase amounts. This finding is consistent with those of [20,21], as individuals with impulsive purchasing behaviors for lottery items are more likely to exhibit higher variance in their spending. Therefore, we suggest that monitoring both the average bet amount and variance in purchase amounts could be effective in identifying problematic gambling behavior.

Figure 3. Prediction results of XGBoost on training set (top) and test set (bottom).

IV. CONCLUSION

In this study, we employed XGBoost to conduct a behavior analysis, using data from the previous four weeks of purchasing patterns to predict future lottery purchase amounts. A feature importance analysis revealed that the features that capture impulsive purchase patterns are strongly correlated with future purchase amounts, which agrees with the behavior analyses conducted in [20,21]. The present study relies on real purchasing behavior data – rather than subjective data, such as surveys – making the analysis robust and objective. Furthermore, by identifying impulsive purchasing behaviors in an explainable manner, this study offers practical insights into the early detection of and intervention in gambling problems. This approach paves the way for the development of warning systems for at-risk users of online lottery platforms, which could contribute significantly to preventing and managing gambling addiction through data-driven technological solutions.

ACKNOWLEDGMENT

This study was conducted using data provided by Dong-Hang Lottery Co., Ltd. for research purposes.

Fig 1.

Figure 1.Mutual information between It(t) and It(t−τ).
Journal of Information and Communication Convergence Engineering 2024; 22: 310-315https://doi.org/10.56977/jicce.2024.22.4.310

Fig 2.

Figure 2.Prediction results of XGBoost on training set (top) and test set (bottom).
Journal of Information and Communication Convergence Engineering 2024; 22: 310-315https://doi.org/10.56977/jicce.2024.22.4.310

Fig 3.

Figure 3.Prediction results of XGBoost on training set (top) and test set (bottom).
Journal of Information and Communication Convergence Engineering 2024; 22: 310-315https://doi.org/10.56977/jicce.2024.22.4.310

Table 1 . Predictive accuracy results in terms of R2 and p score.

(α, d, n)R2 scorep score
(e-2,6,800)Train setTest setTrain setTest set
(e-2,6,1000)0.8310.7430.9120.862
(2e-2,6,1500)0.8410.7430.9180.862
(2e-2,6,800)0.8640.7430.9300.862
(2e-2,6,1000)0.8670.7420.9320.862
(2e-2,6,1500)0.8820.7420.9400.861
(e-2,8,800)0.9110.7400.9560.861
(e-2,8,1000)0.9000.7420.9500.861
(e-2,8,1500)0.9130.7410.9570.861

References

  1. S. M. Gainsbury, “Online gambling addiction: the relationship between internet gambling and disordered gambling,” Current addiction reports, vol. 2, no. 2, pp. 185-193, Jun. 2015. DOI: 10.1007/s40429-015-0057-8.
    Pubmed KoreaMed CrossRef
  2. M. Guillou-Landreat, K. Gallopel-Morvan, D. Lever, D. Le Goff, and J.-Y. Le Reste, “Gambling marketing strategies and the internet: What do we know? A systematic review,” Frontiers in Psychiatry, vol. 12, Feb. 2021. DOI: 10.3389/fpsyt.2021.583817.
    Pubmed KoreaMed CrossRef
  3. K. Caler, J. R. V. Garcia, and L. Nower, “Assessing problem gambling: A review of classic and specialized measures,” Current Addiction Reports, vol. 3, pp. 437-444, Oct. 2016. DOI: 10.1007/s40429-016-0118-7.
    CrossRef
  4. N. M. Petry, “Validity of a gambling scale for the Addiction Severity Index,” The Journal of nervous and mental disease, vol. 191, no. 6, pp. 399-407, Jun. 2003.
    Pubmed CrossRef
  5. D. Pickering, B. Keen, G. Entwistle, and A. Blaszczynski, “Measuring treatment outcomes in gambling disorders: A systematic review,” Addiction, vol. 113, no. 3, pp. 411-426, Mar. 2018. DOI: 10.1111/add.13968.
    Pubmed KoreaMed CrossRef
  6. E. Kim and S. J. Kwon, “Effect of Speculative Experience in Internet Games on Gambling Problems: Moderated Mediating Effects of Illegal Internet Gambling Behavior and Exposure to COVID-19 Risk,” The Korean Journal of Health Psychology, vol. 28, no. 2, pp. 353-365, 2023. DOI: 10.17315/kjhp.2023.28.2.006.
  7. S. J. Kwon, Y. Kim, and E. Kim, “Development of Internet game speculative experience scale,” Korean Journal of Health Psychology, vol. 25, no. 4, pp. 651-666, Jul. 2020. DOI: 10.17315/kjhp.2020.25.4.003.
  8. A. Hassanniakalager and P. W. S. Newall, “A machine learning perspective on responsible gambling,” Behavioural Public Policy, vol. 6, no. 2, pp. 237-260, Apr. 2022. DOI: 10.1017/bpp.2019.9.
    CrossRef
  9. J. Lee, O. Jung, Y. Lee, O. Kim, and C. Park, “A comparison and interpretation of machine learning algorithm for the prediction of online purchase conversion,” Journal of Theoretical and Applied Electronic Commerce Research, vol. 16, no. 5, pp. 1472-1491, May 2021. DOI: 10.3390/jtaer16050083.
    CrossRef
  10. K. Coussement and K. W. De Bock, “Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning,” Journal of Business Research, vol. 66, no. 9, pp. 1629-1636, Sep. 2013. DOI: 10.1016/j.jbusres.2012.12.008.
    CrossRef
  11. K. K. Mak, K. Lee, and C. Park, “Applications of machine learning in addiction studies: A systematic review,” Psychiatry research, vol. 275, pp. 53-60, May 2019. DOI: 10.1016/j.psychres.2019.03.001.
    Pubmed CrossRef
  12. T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in in Proceedings of the 22nd ACM sigkdd international conference on knowledge discovery and data mining, San Francisco, USA, pp. 785-794, 2016. DOI: 10.1145/2939672.293978.
    CrossRef
  13. P. Song and Y. Liu, “An XGBoost algorithm for predicting purchasing behaviour on E-commerce platforms,” Tehnički vjesnik, vol. 27, no. 5, pp. 1467-1471, Oct. 2020. DOI: 10.17559/TV-20200808113807.
    CrossRef
  14. X. Shi, Y. D. Wong, M. Z.-F. Li, C. Palanisamy, and C. Chai, “A feature learning approach based on XGBoost for driving assessment and risk prediction,” Accident Analysis & Prevention, vol. 129, pp. 170-179, Aug. 2019. DOI: 10.1016/j.aap.2019.05.005.
    Pubmed CrossRef
  15. S. Kabiraj, M. Raihan, N. Alvi, M. Afrin, L. Akter, and S. A. Sohagi, “Breast cancer risk prediction using XGBoost and random forest algorithm,” in in 2020 11th international conference on computing, communication and networking technologies (ICCCNT), Kharagpur, IN, pp. 1-4, 2020. DOI: 10.1109/ICCCNT49239.2020.9225451.
    CrossRef
  16. A. Ogunleye and Q.-G. Wang, “XGBoost model for chronic kidney disease diagnosis,” IEEE/ACM transactions on computational biology and bioinformatics, vol. 17, no. 6, pp. 2131-2140, Nov.-Dec. 2019. DOI: 10.1109/TCBB.2019.2911071.
    Pubmed CrossRef
  17. S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimedia Tools and Applications, vol. 82, no. 22, pp. 34163-34181, Mar. 2023. DOI: 10.1007/s11042-023-15165-8.
    Pubmed KoreaMed CrossRef
  18. D. C. Hodgins, D. P. Schopflocher, C. R. Martin, N. el-Guebaly, D. M. Casey, S. R. Currie, G. J. Smith, and R. J. Williams, “Disordered gambling among higher-frequency gamblers: who is at risk?,” Psychological medicine, vol. 42, no. 11, pp. 2433-2444, Apr. 2012. DOI: 10.1017/S0033291712000724.
    Pubmed CrossRef
  19. A. Mazar, M. Zorn, N. Becker, and R. A. Volberg, “Gambling formats, involvement, and problem gambling: which types of gambling are more risky?,” BMC Public Health, vol. 20, no. 711, pp. 1-10, May 2020. DOI: 10.1186/s12889-020-08822-2.
    Pubmed KoreaMed CrossRef
  20. R. LaBrie and H. J. Shaffer, “Identifying behavioral markers of disordered Internet sports gambling,” Addiction Research & Theory, vol. 19, no. 1, pp. 56-65, Sep. 2010. DOI: 10.3109/16066359.2010.512106.
    CrossRef
  21. A. Blaszczynski and L. Nower, “A pathways model of problem and pathological gambling,” Addiction, vol. 97, no. 5, pp. 487-499, 2002. DOI: 10.1046/j.1360-0443.2002.00015.x.
    Pubmed CrossRef
  22. J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, vol. 29, no. 5, pp. 1189-1232, Oct. 2001.
    CrossRef
JICCE
Dec 31, 2024 Vol.22 No.4, pp. 267~343

Stats or Metrics

Share this article on

  • line

Related articles in JICCE

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255