Journal of information and communication convergence engineering 2024; 22(2): 133-138
Published online June 30, 2024
https://doi.org/10.56977/jicce.2024.22.2.133
© Korea Institute of Information and Communication Engineering
Correspondence to : Gwanghyun Jo (E-mail: gwanghyun@hanyang.ac.kr)
Department of Mathematical Data Science, Hanyang University ERICA
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Dissolved oxygen (DO) is an important factor in ecosystems. However, the analysis of DO is frequently rather complicated because of the nonlinear phenomenon of the river system. Therefore, a convenient model-free algorithm for DO variable is required. In this study, a data-driven algorithm for predicting DO was developed by combining XGBoost and an artificial neural network (ANN), called ANN-XGB. To train the model, two years of ecosystem data were collected in Anyang, Seoul using the Troll 9500 model. One advantage of the proposed algorithm is its ability to capture abrupt changes in climate-related features that arise from sudden events. Moreover, our algorithm can provide a feature importance analysis owing to the use of XGBoost. The results obtained using the ANN-XGB algorithm were compared with those obtained using the ANN algorithm in the Results Section. The predictions made by ANN-XGB were mostly in closer agreement with the measured DO values in the river than those made by the ANN.
Keywords XGBoost, artificial neural network, dissolved oxygen, feature importance
Dissolved oxygen (DO) significantly influences biological and chemical processes in a river system [1-4]. Maintaining a certain level of DO is essential for the survival of aquatic organisms. In [5], it was discovered that fish mortality in an urban river was caused by the depletion of the DO concentration. Therefore, there is a need for real-time prediction of DO levels in advance of sudden decreases to maintain a healthy ecosystem. Some researchers have adopted ecosystem models [6-8] to predict water-quality variable. These approaches consider both dynamic relations and stochastic variances. However, reconstructing solutions are complicated because of the nonlinearity of the model. Moreover, it is difficult to include singular outer-events, such as sudden heavy rain or snow near the river.
Recently, empirical models have been used to analyze and simulate the complex nonlinear relationships between variables in water resources. In particular, artificial neural network (ANN) models have been employed efficiently to predict water quality-related features [9-15]. Some researchers have found that hybrid models that combine ANN with other machine learning algorithms are superior to simple ANN for water quality analysis. Kavousi-Fard developed an ANN-teacher learning algorithm [16], and Jha and Sahoo proposed an ANN-genetic algorithm to simulate groundwater levels [17]. In particular, Ravansalar et al. developed a hybrid Wavelet-ANN model to predict DO levels in the River Calder [18].
In this study, we developed a hybrid method that combines an ANN and extreme gradient boosting (XG-boost) to predict DO in the Anyang Stream in Seoul, Republic of Korea. XG-boost [19] is a gradient boosting machine [20] type algorithm with a decision tree as the basis function. Because of its scalability and robustness, XGBoost has been successfully employed in various research areas. For example, forecasting the crude oil price [21], classifying rock facies [22], detecting DDOS [23], and forecasting the load in a power system [24]. The first step of our algorithm was to train the ANN using time-series DO data. The naïve version of the ANN-predicted DO was then substituted into the XGBoost machine together with meteorological data. Therefore, our method can capture the effects of sudden events in meteorological systems such as heavy rain. This algorithm was called ANN-XGB. One of the advantages of ANN-XGB is its capability of feature importance analysis. We analyzed the correlation of each feature with DO by counting the number of appearances of each feature in the tree branches. The performance of ANN-XGB is reported in the Results Section, where ANN-XGB outperformed simple ANN in most cases. This feature importance indicates that tidal level and water temperature play important roles in predicting DO variable.
The remainder of this paper is organized as follows. In Section 2, the entire procedure for the development of the ANN-XGB algorithms is presented, including data acquisition, pre-processing, and algorithm design. The results are reported in Section 3, and conclusions are presented in Section 4.
In this section, we describe the ANN-XGB algorithm for predicting DO variables.
Anyang Stream is a tributary of the largest river in Republic of Korea (the Han River) that flows through the capital city and converges with the downstream reaches of the Han River 50 km from the river mouth. The river finally reaches the Yellow Sea, which has a large tidal range of up to 10 m. Because of this large tidal range, the downstream reach of the Han River shows the effect of tides on the water level. The present study site is located 400 m from the main stream of the Han River within the reach of tidal influence; the water level increases and the tributary flow velocity decreases during high tide, and vice versa during low tide. Occasional short-term depletion of dissolved oxygen and high turbidity in rain events caused fish deaths and is regarded as a major stress factor to aquatic organisms. Located in a densely populated residential area, the clarity of the water and scenery of swimming fish is of concern to many local residents visiting the riverside trails of Anyang Stream.
A water quality sensor (Troll 9500 model, In-situ Inc. USA) was installed at the foot of a bridge (Yangpyung Bridge) in the Anyang Stream to monitor variations in water quality (Fig. 1). Using the Troll 9500 model, temperature, dissolved oxygen (DO), electric conductivity, and turbidity data were collected at 15-minute interval during the period 2010-2012 year. The sensors were cleaned and maintained every one or two weeks to remove biofouling and for calibration. Table 1 summarizes the types and ranges of waterquality-related data collated using the Troll 9500 model.
Table 1 . Types and ranges of water quality related data.
Parameters | Type | Range | Accuracy |
---|---|---|---|
Water temperature | Platinum resistance thermometer | -5~50 C | ±0.1 C |
Turbidity | Nephelonmeter, 90 light scattering 860 nm LED | 0~2,000 NTU | 2 NTU |
Dissolved oxygen | Optical fluorescence quenching | 0~ 20 mg/L | ±0.1 mg/L |
Conductivity | 4-cell | 5 to 2,000 μS/cm | 2 μS/cm |
Together with the water quality data, we also considered the meteorological data provided by the Korean Meteorological Administration. Meteorological data included air temperature, wind speed, solar radiation, cloud, and precipitation at one-hour interval (Table 2). The water levels of the Han River were provided by the Han River Flood Control Office and monitored every 10 min.
Table 2 . Types and ranges of weather-related data
Parameters | Range | Average | Standard deviation |
---|---|---|---|
Air Temperature | -17.7~33.7 C | 10.86 C | ±11.3 C |
Wind speed | 0~13 m/s | 2.76 m/s | ±1.48 m/s |
Radiation | 0~398 mj/m2 | 94.23 mj/m2 | ±87.9 mj/m2 |
Precipitation | 0.68~8.4 m | 2.72 mm | ±13.94 mm |
Water level | 0.68~8.4 m | 2.23 m | ±0.68 m |
Cloud amount | 0~10 | 5.52 | ±3.99 |
Hydrological and meteorological data measured from 2010 to 2012 were selected as input variables. Hydrologic data included water temperature, turbidity, DO, conductivity, water height, and flow rate (Table 1), and meteorological data included air temperature, air speed, solar radiation, cloud, and precipitation (Table 2). Time series of DO of form Xt-τ, Xt-τ+1, ..., Xt are also adopted to train prediction model. While it is natural to assume that there is correlation between Xt and Xt-τ, it is important to determine the length of such sequence. If the length, which is determined by delay parameter-τ, is too short, the prediction model will lose accuracy. Conversely, if τ is too large, we may include meaningless information. Mutual information between Xt and Xt-τ is used to determine the proper length of sequence. The mutual information between the two variables X and Y is calculated as follows:
where P(x,y) is the joint probability mass function of X and Y. We measure MI (Xt, Xt−τ) with increasing τ (Fig. 2).
Since MI (Xt, Xt−τ) function tends to be flat at seven days, we choose τ as seven days.
For the ANN-XGB model, which is described in the next subsection, water quality data and a three-year meteorological dataset (7,368 samples × 11 variables) were divided into two sets. The first two years were selected as the training data and another one year as the test set. The training and test datasets have dimensions of 5,184 samples × 11 features and 2184 × 11 features, respectively. Because normalization is crucial, we normalize the DO variables between 0 and 1, that is,
In this subsection, we proposed ANN-XGB, which combines ANN and XG-Boosting to predict DO p-hour ahead. The XG-ANN model structure developed in this study is shown in Fig. 3. First, the ANN was trained using a time series of DO variables. Given a typical time t, the series of values {DOt−τΔT, DOt−(τ+1)ΔT, ..., DOt} is substituted into ANN to predict {DOt+ΔT, DOt+2ΔT, ..., DOt+pΔT}. Here, time sampling rate and delay parameters are chosen as ΔT = 1 h and τ =198. We report the performance of the ANN-XGB in the Results Section by varying p between 1 and 12 h. The sample data (Xj, Yj) are described as follows:
The ANN is trained by standard supervised learning on dataset (Xj, Yj) which is generated by the measured DO from 2010 to 2012. We denote the ANN-predicted DO as {
In this section, the performance of ANN-XGB for predicting DOt+pΔT are presented.
We now describe the ANN- and XGB-related parameters. We experimentally determined the hyper-parameters in the ANN. Interestingly, with a small number of parameters, the ANN accurately predicted the DO variables. Therefore, instead of using complex models, such as LSTM or transformers, we chose a simple version of the ANN and used two hidden layers with a sigmoid function for activation, where the first hidden layer had 80 nodes and the second layer had 40 nodes. Next, for XG-boost, we chose a learning rate of 0.01. The number of estimators (trees) used in the model was. 400. To prevent overfitting, the subsample was set to 0.75 and the maximum depth was set to 12.
The model was trained on the period (2010. 3~2012. 1.2) and tested on the period (2012. 1.4~2012. 2.7) and (2012. 2.11~2013. 3.12), and (2012. 3. 14~2012. 4. 7).
We described several errors in the evaluation of the prediction model. First, the R2 represents the percentage of variability that can be predicted by the model and is calculated as:
where
where,
Let us first compare the prediction results of 1-hour ahead DO by ANN and ANN-XGB (see Fig. 4). Overall, both the predictions by the ANN and ANN-XGB represent the actual shape of the DO during the test period. However, the predictions made by the ANN-XGB were closer to the actual DO than those made by the ANN.
Now, we compare the ANN-XGB with the ANN by varying p from 1 to 12 has shown in Table 3. ANN-XGB tended to outperform ANN when p < 3. However, the performance of the two models was similar. This suggests that ANN-XGB is preferable for short-time prediction.
Table 3 . Accuracy of prediction of DOt+pΔT in terms of R2, NSE, RMSE and MAE
p | R2 | NSE | RMSE | MAE | ||||
---|---|---|---|---|---|---|---|---|
ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | |
1 | 0.88 | 0.94 | 0.84 | 0.94 | 1.35 | 0.89 | 0.88 | 0.55 |
2 | 0.87 | 0.90 | 0.83 | 0.89 | 1.38 | 1.12 | 0.90 | 0.72 |
3 | 0.88 | 0.87 | 0.87 | 0.85 | 1.24 | 1.32 | 0.80 | 0.86 |
4 | 0.87 | 0.86 | 0.85 | 0.84 | 1.31 | 1.37 | 0.84 | 0.89 |
5 | 0.85 | 0.85 | 0.85 | 0.82 | 1.32 | 1.41 | 0.85 | 0.93 |
6 | 0.84 | 0.85 | 0.84 | 0.82 | 1.35 | 1.41 | 0.86 | 0.92 |
7 | 0.82 | 0.83 | 0.83 | 0.81 | 1.39 | 1.46 | 0.89 | 0.96 |
8 | 0.80 | 0.82 | 0.82 | 0.81 | 1.41 | 1.47 | 0.91 | 0.96 |
9 | 0.79 | 0.82 | 0.82 | 0.80 | 1.43 | 1.50 | 0.92 | 0.97 |
10 | 0.78 | 0.82 | 0.82 | 0.80 | 1.44 | 1.51 | 0.91 | 0.97 |
11 | 0.78 | 0.82 | 0.81 | 0.80 | 1.45 | 1.50 | 0.92 | 0.96 |
12 | 0.79 | 0.81 | 0.81 | 0.80 | 1.45 | 1.49 | 0.93 | 0.96 |
One of the advantages of the proposed scheme is its capability for correlation analysis of the target and feature variables via the feature importance score given by the XGB algorithm. For example, weight type feature importance was obtained by counting the number of features appearing in tree branches. Thus, a high feature importance score for a certain variable implies a strong influence on the target variable. Fig. 5 shows the weight type importance scores for the ANN-XGB model. The tidal level, water temperature, and turbidity have high scores, implying that they are correlated with changes in DO variable.
We proposed hybrid models combining ANN and XGboost to predict DO in the Anyang Stream. First, an ANN model was developed to predict the DO, where the hyperparameters were tuned optimally. Next, XGB was adopted in the second stage, where the predictions of the first stage were used as input features together with hydrologic and meteorological data. The R2 and NSE coefficients are approximately 0.94 for the XGB-ANN, outperforming the ANN with R2 and NSE coefficients of 0.88 and 0.84. We also document the feature importance scores obtained using XGB. The results indicate that tidal level and water temperature are correlated with changes in DO variable.
A limitation of the proposed algorithm is that our methods are model-free. Therefore, the accuracy of the proposed algorithm relies heavily on the range of the training period, because it is more likely to occur during various types of meteorological events during the training period. Combining ecosystem models [6-8] might enhance the generalization capability of our algorithm.
In future work, we will develop prediction models for other water quality related parameters, such as water temperature and turbidity. Additionally, to validate the proposed algorithms, downstream data collected at other sites should be considered in future studies.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. 2020R1C1C1A01005396).
Keun Young Lee
received his Ph. D. degree from the Department of Mathematical Science, KAIST, in 2009. From 2017 to 2020, he was a faculty member of the Department of Mathematics at Sejong University, Republic of Korea. From 2020 to the present, he has been an independent scholar in Republic of Korea. His research interests include Banach space theory, machine learning, and fuzzy theory.
Bomchul Kim
received his B.S. degree in oceanography from Seoul University in 1977, received his M. S. in bio-engineering from KAIST in 1980, and received a Ph. D. in oceanography from Seoul University in 1987. He has been a faculty member of the Department of Environmental Science at Kangwon National University since 1981. Currently, he is a professor emeritus at Kangwon National University.
Gwanghyun Jo
received his M.S. and Ph. D. degree in the Department of Mathematical Science, KAIST, in 2013 and 2018, respectively. From 2019 to 2023, he was a faculty member of the Department of Mathematics at Kunsan University, Republic of Korea. From 2023 to the present, he has been a faculty member of the Department of Mathematical Data Science, Hanyang University, ERICA. His research interests include numerical analysis, computational fluid dynamics, and machine learning.
Journal of information and communication convergence engineering 2024; 22(2): 133-138
Published online June 30, 2024 https://doi.org/10.56977/jicce.2024.22.2.133
Copyright © Korea Institute of Information and Communication Engineering.
Keun Young Lee 1, Bomchul Kim 2, and Gwanghyun Jo3*
1Independent scholar, Republic of Korea
2Department of Environmental Science, Kangwon National University, Chuncheon Kangwon-do, Republic of Korea
3Department of Mathematical Data Analysis, Hanyang University ERICA, Ansan Gyeonggi-do, Republic of Korea
Correspondence to:Gwanghyun Jo (E-mail: gwanghyun@hanyang.ac.kr)
Department of Mathematical Data Science, Hanyang University ERICA
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Dissolved oxygen (DO) is an important factor in ecosystems. However, the analysis of DO is frequently rather complicated because of the nonlinear phenomenon of the river system. Therefore, a convenient model-free algorithm for DO variable is required. In this study, a data-driven algorithm for predicting DO was developed by combining XGBoost and an artificial neural network (ANN), called ANN-XGB. To train the model, two years of ecosystem data were collected in Anyang, Seoul using the Troll 9500 model. One advantage of the proposed algorithm is its ability to capture abrupt changes in climate-related features that arise from sudden events. Moreover, our algorithm can provide a feature importance analysis owing to the use of XGBoost. The results obtained using the ANN-XGB algorithm were compared with those obtained using the ANN algorithm in the Results Section. The predictions made by ANN-XGB were mostly in closer agreement with the measured DO values in the river than those made by the ANN.
Keywords: XGBoost, artificial neural network, dissolved oxygen, feature importance
Dissolved oxygen (DO) significantly influences biological and chemical processes in a river system [1-4]. Maintaining a certain level of DO is essential for the survival of aquatic organisms. In [5], it was discovered that fish mortality in an urban river was caused by the depletion of the DO concentration. Therefore, there is a need for real-time prediction of DO levels in advance of sudden decreases to maintain a healthy ecosystem. Some researchers have adopted ecosystem models [6-8] to predict water-quality variable. These approaches consider both dynamic relations and stochastic variances. However, reconstructing solutions are complicated because of the nonlinearity of the model. Moreover, it is difficult to include singular outer-events, such as sudden heavy rain or snow near the river.
Recently, empirical models have been used to analyze and simulate the complex nonlinear relationships between variables in water resources. In particular, artificial neural network (ANN) models have been employed efficiently to predict water quality-related features [9-15]. Some researchers have found that hybrid models that combine ANN with other machine learning algorithms are superior to simple ANN for water quality analysis. Kavousi-Fard developed an ANN-teacher learning algorithm [16], and Jha and Sahoo proposed an ANN-genetic algorithm to simulate groundwater levels [17]. In particular, Ravansalar et al. developed a hybrid Wavelet-ANN model to predict DO levels in the River Calder [18].
In this study, we developed a hybrid method that combines an ANN and extreme gradient boosting (XG-boost) to predict DO in the Anyang Stream in Seoul, Republic of Korea. XG-boost [19] is a gradient boosting machine [20] type algorithm with a decision tree as the basis function. Because of its scalability and robustness, XGBoost has been successfully employed in various research areas. For example, forecasting the crude oil price [21], classifying rock facies [22], detecting DDOS [23], and forecasting the load in a power system [24]. The first step of our algorithm was to train the ANN using time-series DO data. The naïve version of the ANN-predicted DO was then substituted into the XGBoost machine together with meteorological data. Therefore, our method can capture the effects of sudden events in meteorological systems such as heavy rain. This algorithm was called ANN-XGB. One of the advantages of ANN-XGB is its capability of feature importance analysis. We analyzed the correlation of each feature with DO by counting the number of appearances of each feature in the tree branches. The performance of ANN-XGB is reported in the Results Section, where ANN-XGB outperformed simple ANN in most cases. This feature importance indicates that tidal level and water temperature play important roles in predicting DO variable.
The remainder of this paper is organized as follows. In Section 2, the entire procedure for the development of the ANN-XGB algorithms is presented, including data acquisition, pre-processing, and algorithm design. The results are reported in Section 3, and conclusions are presented in Section 4.
In this section, we describe the ANN-XGB algorithm for predicting DO variables.
Anyang Stream is a tributary of the largest river in Republic of Korea (the Han River) that flows through the capital city and converges with the downstream reaches of the Han River 50 km from the river mouth. The river finally reaches the Yellow Sea, which has a large tidal range of up to 10 m. Because of this large tidal range, the downstream reach of the Han River shows the effect of tides on the water level. The present study site is located 400 m from the main stream of the Han River within the reach of tidal influence; the water level increases and the tributary flow velocity decreases during high tide, and vice versa during low tide. Occasional short-term depletion of dissolved oxygen and high turbidity in rain events caused fish deaths and is regarded as a major stress factor to aquatic organisms. Located in a densely populated residential area, the clarity of the water and scenery of swimming fish is of concern to many local residents visiting the riverside trails of Anyang Stream.
A water quality sensor (Troll 9500 model, In-situ Inc. USA) was installed at the foot of a bridge (Yangpyung Bridge) in the Anyang Stream to monitor variations in water quality (Fig. 1). Using the Troll 9500 model, temperature, dissolved oxygen (DO), electric conductivity, and turbidity data were collected at 15-minute interval during the period 2010-2012 year. The sensors were cleaned and maintained every one or two weeks to remove biofouling and for calibration. Table 1 summarizes the types and ranges of waterquality-related data collated using the Troll 9500 model.
Table 1 . Types and ranges of water quality related data..
Parameters | Type | Range | Accuracy |
---|---|---|---|
Water temperature | Platinum resistance thermometer | -5~50 C | ±0.1 C |
Turbidity | Nephelonmeter, 90 light scattering 860 nm LED | 0~2,000 NTU | 2 NTU |
Dissolved oxygen | Optical fluorescence quenching | 0~ 20 mg/L | ±0.1 mg/L |
Conductivity | 4-cell | 5 to 2,000 μS/cm | 2 μS/cm |
Together with the water quality data, we also considered the meteorological data provided by the Korean Meteorological Administration. Meteorological data included air temperature, wind speed, solar radiation, cloud, and precipitation at one-hour interval (Table 2). The water levels of the Han River were provided by the Han River Flood Control Office and monitored every 10 min.
Table 2 . Types and ranges of weather-related data.
Parameters | Range | Average | Standard deviation |
---|---|---|---|
Air Temperature | -17.7~33.7 C | 10.86 C | ±11.3 C |
Wind speed | 0~13 m/s | 2.76 m/s | ±1.48 m/s |
Radiation | 0~398 mj/m2 | 94.23 mj/m2 | ±87.9 mj/m2 |
Precipitation | 0.68~8.4 m | 2.72 mm | ±13.94 mm |
Water level | 0.68~8.4 m | 2.23 m | ±0.68 m |
Cloud amount | 0~10 | 5.52 | ±3.99 |
Hydrological and meteorological data measured from 2010 to 2012 were selected as input variables. Hydrologic data included water temperature, turbidity, DO, conductivity, water height, and flow rate (Table 1), and meteorological data included air temperature, air speed, solar radiation, cloud, and precipitation (Table 2). Time series of DO of form Xt-τ, Xt-τ+1, ..., Xt are also adopted to train prediction model. While it is natural to assume that there is correlation between Xt and Xt-τ, it is important to determine the length of such sequence. If the length, which is determined by delay parameter-τ, is too short, the prediction model will lose accuracy. Conversely, if τ is too large, we may include meaningless information. Mutual information between Xt and Xt-τ is used to determine the proper length of sequence. The mutual information between the two variables X and Y is calculated as follows:
where P(x,y) is the joint probability mass function of X and Y. We measure MI (Xt, Xt−τ) with increasing τ (Fig. 2).
Since MI (Xt, Xt−τ) function tends to be flat at seven days, we choose τ as seven days.
For the ANN-XGB model, which is described in the next subsection, water quality data and a three-year meteorological dataset (7,368 samples × 11 variables) were divided into two sets. The first two years were selected as the training data and another one year as the test set. The training and test datasets have dimensions of 5,184 samples × 11 features and 2184 × 11 features, respectively. Because normalization is crucial, we normalize the DO variables between 0 and 1, that is,
In this subsection, we proposed ANN-XGB, which combines ANN and XG-Boosting to predict DO p-hour ahead. The XG-ANN model structure developed in this study is shown in Fig. 3. First, the ANN was trained using a time series of DO variables. Given a typical time t, the series of values {DOt−τΔT, DOt−(τ+1)ΔT, ..., DOt} is substituted into ANN to predict {DOt+ΔT, DOt+2ΔT, ..., DOt+pΔT}. Here, time sampling rate and delay parameters are chosen as ΔT = 1 h and τ =198. We report the performance of the ANN-XGB in the Results Section by varying p between 1 and 12 h. The sample data (Xj, Yj) are described as follows:
The ANN is trained by standard supervised learning on dataset (Xj, Yj) which is generated by the measured DO from 2010 to 2012. We denote the ANN-predicted DO as {
In this section, the performance of ANN-XGB for predicting DOt+pΔT are presented.
We now describe the ANN- and XGB-related parameters. We experimentally determined the hyper-parameters in the ANN. Interestingly, with a small number of parameters, the ANN accurately predicted the DO variables. Therefore, instead of using complex models, such as LSTM or transformers, we chose a simple version of the ANN and used two hidden layers with a sigmoid function for activation, where the first hidden layer had 80 nodes and the second layer had 40 nodes. Next, for XG-boost, we chose a learning rate of 0.01. The number of estimators (trees) used in the model was. 400. To prevent overfitting, the subsample was set to 0.75 and the maximum depth was set to 12.
The model was trained on the period (2010. 3~2012. 1.2) and tested on the period (2012. 1.4~2012. 2.7) and (2012. 2.11~2013. 3.12), and (2012. 3. 14~2012. 4. 7).
We described several errors in the evaluation of the prediction model. First, the R2 represents the percentage of variability that can be predicted by the model and is calculated as:
where
where,
Let us first compare the prediction results of 1-hour ahead DO by ANN and ANN-XGB (see Fig. 4). Overall, both the predictions by the ANN and ANN-XGB represent the actual shape of the DO during the test period. However, the predictions made by the ANN-XGB were closer to the actual DO than those made by the ANN.
Now, we compare the ANN-XGB with the ANN by varying p from 1 to 12 has shown in Table 3. ANN-XGB tended to outperform ANN when p < 3. However, the performance of the two models was similar. This suggests that ANN-XGB is preferable for short-time prediction.
Table 3 . Accuracy of prediction of DOt+pΔT in terms of R2, NSE, RMSE and MAE.
p | R2 | NSE | RMSE | MAE | ||||
---|---|---|---|---|---|---|---|---|
ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | |
1 | 0.88 | 0.94 | 0.84 | 0.94 | 1.35 | 0.89 | 0.88 | 0.55 |
2 | 0.87 | 0.90 | 0.83 | 0.89 | 1.38 | 1.12 | 0.90 | 0.72 |
3 | 0.88 | 0.87 | 0.87 | 0.85 | 1.24 | 1.32 | 0.80 | 0.86 |
4 | 0.87 | 0.86 | 0.85 | 0.84 | 1.31 | 1.37 | 0.84 | 0.89 |
5 | 0.85 | 0.85 | 0.85 | 0.82 | 1.32 | 1.41 | 0.85 | 0.93 |
6 | 0.84 | 0.85 | 0.84 | 0.82 | 1.35 | 1.41 | 0.86 | 0.92 |
7 | 0.82 | 0.83 | 0.83 | 0.81 | 1.39 | 1.46 | 0.89 | 0.96 |
8 | 0.80 | 0.82 | 0.82 | 0.81 | 1.41 | 1.47 | 0.91 | 0.96 |
9 | 0.79 | 0.82 | 0.82 | 0.80 | 1.43 | 1.50 | 0.92 | 0.97 |
10 | 0.78 | 0.82 | 0.82 | 0.80 | 1.44 | 1.51 | 0.91 | 0.97 |
11 | 0.78 | 0.82 | 0.81 | 0.80 | 1.45 | 1.50 | 0.92 | 0.96 |
12 | 0.79 | 0.81 | 0.81 | 0.80 | 1.45 | 1.49 | 0.93 | 0.96 |
One of the advantages of the proposed scheme is its capability for correlation analysis of the target and feature variables via the feature importance score given by the XGB algorithm. For example, weight type feature importance was obtained by counting the number of features appearing in tree branches. Thus, a high feature importance score for a certain variable implies a strong influence on the target variable. Fig. 5 shows the weight type importance scores for the ANN-XGB model. The tidal level, water temperature, and turbidity have high scores, implying that they are correlated with changes in DO variable.
We proposed hybrid models combining ANN and XGboost to predict DO in the Anyang Stream. First, an ANN model was developed to predict the DO, where the hyperparameters were tuned optimally. Next, XGB was adopted in the second stage, where the predictions of the first stage were used as input features together with hydrologic and meteorological data. The R2 and NSE coefficients are approximately 0.94 for the XGB-ANN, outperforming the ANN with R2 and NSE coefficients of 0.88 and 0.84. We also document the feature importance scores obtained using XGB. The results indicate that tidal level and water temperature are correlated with changes in DO variable.
A limitation of the proposed algorithm is that our methods are model-free. Therefore, the accuracy of the proposed algorithm relies heavily on the range of the training period, because it is more likely to occur during various types of meteorological events during the training period. Combining ecosystem models [6-8] might enhance the generalization capability of our algorithm.
In future work, we will develop prediction models for other water quality related parameters, such as water temperature and turbidity. Additionally, to validate the proposed algorithms, downstream data collected at other sites should be considered in future studies.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. 2020R1C1C1A01005396).
Table 1 . Types and ranges of water quality related data..
Parameters | Type | Range | Accuracy |
---|---|---|---|
Water temperature | Platinum resistance thermometer | -5~50 C | ±0.1 C |
Turbidity | Nephelonmeter, 90 light scattering 860 nm LED | 0~2,000 NTU | 2 NTU |
Dissolved oxygen | Optical fluorescence quenching | 0~ 20 mg/L | ±0.1 mg/L |
Conductivity | 4-cell | 5 to 2,000 μS/cm | 2 μS/cm |
Table 2 . Types and ranges of weather-related data.
Parameters | Range | Average | Standard deviation |
---|---|---|---|
Air Temperature | -17.7~33.7 C | 10.86 C | ±11.3 C |
Wind speed | 0~13 m/s | 2.76 m/s | ±1.48 m/s |
Radiation | 0~398 mj/m2 | 94.23 mj/m2 | ±87.9 mj/m2 |
Precipitation | 0.68~8.4 m | 2.72 mm | ±13.94 mm |
Water level | 0.68~8.4 m | 2.23 m | ±0.68 m |
Cloud amount | 0~10 | 5.52 | ±3.99 |
Table 3 . Accuracy of prediction of DOt+pΔT in terms of R2, NSE, RMSE and MAE.
p | R2 | NSE | RMSE | MAE | ||||
---|---|---|---|---|---|---|---|---|
ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | ANN | ANN-XGB | |
1 | 0.88 | 0.94 | 0.84 | 0.94 | 1.35 | 0.89 | 0.88 | 0.55 |
2 | 0.87 | 0.90 | 0.83 | 0.89 | 1.38 | 1.12 | 0.90 | 0.72 |
3 | 0.88 | 0.87 | 0.87 | 0.85 | 1.24 | 1.32 | 0.80 | 0.86 |
4 | 0.87 | 0.86 | 0.85 | 0.84 | 1.31 | 1.37 | 0.84 | 0.89 |
5 | 0.85 | 0.85 | 0.85 | 0.82 | 1.32 | 1.41 | 0.85 | 0.93 |
6 | 0.84 | 0.85 | 0.84 | 0.82 | 1.35 | 1.41 | 0.86 | 0.92 |
7 | 0.82 | 0.83 | 0.83 | 0.81 | 1.39 | 1.46 | 0.89 | 0.96 |
8 | 0.80 | 0.82 | 0.82 | 0.81 | 1.41 | 1.47 | 0.91 | 0.96 |
9 | 0.79 | 0.82 | 0.82 | 0.80 | 1.43 | 1.50 | 0.92 | 0.97 |
10 | 0.78 | 0.82 | 0.82 | 0.80 | 1.44 | 1.51 | 0.91 | 0.97 |
11 | 0.78 | 0.82 | 0.81 | 0.80 | 1.45 | 1.50 | 0.92 | 0.96 |
12 | 0.79 | 0.81 | 0.81 | 0.80 | 1.45 | 1.49 | 0.93 | 0.96 |
Samyuktha Muralidharan, Savita Yadav, Jungwoo Huh, Sanghoon Lee, and Jongwook Woo
Journal of information and communication convergence engineering 2022; 20(2): 96-102 https://doi.org/10.6109/jicce.2022.20.2.96Je, Sung-Kwan;Cho, Jae-Hyun;Kim, Gwang-Baek;
The Korea Institute of Information and Commucation Engineering 2004; 2(2): 132-137 https://doi.org/10.7853/.2004.2.2.132