Journal of information and communication convergence engineering 2021; 19(2): 79-83
Published online June 30, 2021
https://doi.org/10.6109/jicce.2021.19.2.79
© Korea Institute of Information and Communication Engineering
This paper proposes a novel image classification method based on few-shot learning, which is mainly used to solve model overfitting and non-convergence in image classification tasks of small datasets and improve the accuracy of classification. This method uses model structure optimization to extend the basic convolutional neural network (CNN) model and extracts more image features by adding convolutional layers, thereby improving the classification accuracy. We incorporated certain measures to improve the performance of the model. First, we used general methods such as setting a lower learning rate and shuffling to promote the rapid convergence of the model. Second, we used the data expansion technology to preprocess small datasets to increase the number of training data sets and suppress over-fitting. We applied the model to 10 monkey species and achieved outstanding performances. Experiments indicated that our proposed method achieved an accuracy of 87.92%, which is 26.1% higher than that of the traditional CNN method and 1.1% higher than that of the deep convolutional neural network ResNet50.
Keywords Deep learning, Feature extraction, Few-shot learning, Image classification
In recent years, deep learning models have been applied to computer vision tasks with great success, such as face recognition, object recognition, image classification, and semantic segmentation etc. [1-3]. Furthermore, several outstanding deep learning models have emerged, such as LeNet, AlexNet, GoogLeNet, VGG(Visual Geometry Group), and ResNet [4-8]. These deep learning models based on the convolutional neural network (CNN) model have different characteristics and can achieve satisfactory results for different tasks. Deep learning models can automatically learn features from data, which generally requires a large amount of available training data, particularly for very high-dimensional input samples such as image and video processing. If the number of samples is small, the deep learning model can extract minimal features, and the expected results produced by the model are unsatisfactory.
CNNs are also called shift-invariant artificial neural networks [9] because they can express learning and can perform shift-invariant classification of input information according to their hierarchical structure. As the deep learning model mainly relies on the sample data features extracted by the convolutional layer to realize object recognition and image classification, the number of samples determine the accuracy of the model prediction to a certain extent. The aforementioned classic CNNs are all models trained on large image datasets such as ImageNet [10, 11], which can achieve better results in computer vision tasks. However, in realistic scenes, minimal features are extracted by the model owing to the small amount of sample data, resulting in non-convergence and over-fitting of the deep learning model during training that directly affects the prediction results of the model.
This paper proposes an optimized image classification method that can solve the problem of image classification in small datasets. In these methods, we use a deepened model depth to realize the feature extraction of small dataset images. On the one hand, we use the data augmentation technology to increase the number of training samples and the regularization technology to suppress model overfitting. On the other hand, by setting a lower learning rate, shuffle, and other model optimization methods, model convergence is accelerated. The use of these model optimization methods can significantly improve the prediction accuracy.
Humans can recognize new objects using minimal samples. For example, children can recognize simple objects, such as apples and strawberries, with only individual pictures in the book. Cognitive ability refers to the perception of objects. Nerve cells play an essential role in cognition. Researchers hope that a machine learning model can quickly achieve few-shot learning [12] by training a large amount of data in a particular category and then training only a small number of samples for a new category arising from a downstream task. Traditional few-shot learning considers that both the training data and test data come from the same domain. If the unknown domain is included in the downstream task, the traditional few-shot learning method is unsatisfactory.
Recently, along with the rapid development of machine learning, the development of few-shot learning in the image processing field has surpassed that in the natural language processing field, and excellent few-shot learning has been focused on [13, 14]. Few-shot learning is an application of meta-learning in the field of supervised learning. Meta-learning, also known as learning-of-learning, decomposes a dataset into multiple meta-tasks in the meta-learning phase to learn the model generalization ability when categories change. There is no need to face brand new categories in the metatesting stage. The classification can be completed by changing the existing model. The definition of small-sample learning is as follows: a few-shot learning set contains multiple categories, each with multiple samples. C categories are randomly chosen from the training set of the training phase, along with K samples for each category (total CK data), and the meta-task is composed of inputs to the model support set. Extract the batches of samples from the remaining data as model prediction objects (batch sets). The model must learn to distinguish these C categories from C*K data, and these tasks are called C-way K-shot problems.
In a learning process with only a few-shot learning session, different meta-tasks are sampled for each training (episode); therefore, the training as a whole involves different combinations of categories. This mechanism allows the model to learn the common parts of various meta-tasks, such as extracting important features and comparing sample similarities, but removes the relational task-specific parts from the meta-tasks. Models trained using this learning mechanism can classify well even when faced with new and unseen meta-tasks.
Few-shot learning models can be broadly divided into model-based, metric-based, and optimization-based models. The mathematical definition of the model-based model is shown in (1).
where
where
The study of monkey species is conducive to researchers on the habits, population classification, and genetic characteristics of monkeys and is of great significance for studying human evolutionary history. This study established a small deep learning model to classify approximately 1,400 monkey group images of 10 species. This dataset was taken from Wikipedia’s monkey cladogram [15], named 10 monkey species. The training dataset had 1098 images, and the test dataset had 272 images. The number of images in each category was not uniform. Compared to the Dogs vs. Cats dataset [16] (approximately 25,000 images), the number of samples in this dataset was minimal. Therefore, it was challenging to establish a deep model to achieve monkey species classification and improve the classification accuracy.
When there are enough data samples, the CNN model can be competent for most image recognition and classification tasks owing to its simple structure, small number of parameters, fast data feature extraction, and high prediction accuracy. However, when dealing with image classification tasks of small datasets, although pre-training models (such as VGG and ResNet) can extract more data features, the result of the pre-trained model will increase the number of model parameters, resulting in a deeper model hierarchy. The model training time is extremely long. Therefore, we reformed and optimized the CNN model structure without significantly increasing the model parameters, mainly including the following aspects.
First, we extracted more data features by adding a small number of convolutional and pooling layers. Here, we used a 3 × 3 convolution kernel, a convolution stride size of 1 × 1, and set the convolution layer activation function to ReLU. We set the maximum pooling size to 2 × 2 and the pooling stride size to 2 × 2. Moreover, we set up a combination module of two layers of convolution and a maximum pooling layer, and added two such combination modules before the fully connected layer.
Second, in the fully connected layer, we used regularization techniques such as dropout to reduce unnecessary neurons, provide essential data features for the classifier, and improve classification accuracy. In the fully connected layer, we set the two layers dropout values to 0.5 and 0.25.
Third, we set softmax as a classification function to realize the classification output of input image feature neurons and output the classification results of 10 categories.
Finally, in the model training process, we used the categorical cross-entropy verification function as the loss function, and we used Adam [17] as the optimizer of stochastic gradient descent to optimize the training process of the model and accelerate the model convergence. The structure of the proposed model is illustrated in Fig. 1.
Data enhancement, also known as data expansion, refers to the value of limited data corresponding to more data without significantly increasing the data. Data augmentation is very effective for small sample datasets. Data enhancement can be divided into two methods: supervised and unsupervised enhancement. Supervised data enhancement can be divided into single-sample data enhancement and multi-sample data enhancement methods. Unsupervised data enhancement can be divided into two directions: generating new data and learning enhancement strategies. In this study, we mainly adopted the data enhancement method of geometric transformation, as presented in Table 1. Data enhancement by such geometric transformation can increase the number of training samples to improve the model generalization ability, effectively suppress model overfitting, and improve the accuracy of model classification.
Parameters | Values |
---|---|
Rescale | 1.0/255 |
Rotation (º) | 40 |
Width shift | 0.2 |
Height shift | 0.2 |
Zoom range | 0.2 |
Shear range | 0.2 |
Horizontal flip | TRUE |
Fill mode | NEAREST |
To test the classification effectiveness and performance of the proposed model, we used the Python computer programming language to construct a deep learning model and deployed the model on a graphics workstation equipped with an Intel Core i7-4790 chip, 16 GB memory, 2T hard disk, and a GTX960 graphics display card. The experimental procedure was as follows.
Classification experiments were performed on the preprocessed dataset in the CNN model and the proposed model; the number of batches of training data was set to 64, and the number of training rounds was set to 100. Fig. 2 depicts the accuracy and loss rate curves of the CNN model, and Fig. 3 shows the accuracy and loss rate curves of the proposed model during training. The average accuracy of the CNN model in the prediction dataset was 69.74%, and the accuracy of the proposed model was 87.92%. In addition, to comprehensively evaluate the performance of the proposed model, we trained and tested the preprocessed data in deep learning models such as VGG16 and ResNet50. The test results are presented in Table 2.
Model | Accuracy | |
---|---|---|
Training | Test | |
CNN | 0.8876 | 0.6974 |
VGG16 | 0.9191 | 0.8213 |
ResNet50 | 0.9268 | 0.8696 |
Ours | 0.9358 | 0.8792 |
This paper proposes a novel image classification method based on few-shot learning in monkey species. This method increased the number of convolutional layers to achieve the rapid extraction of sample data features from a small dataset based on a CNN. Then, by fine-tuning the fully connected layer and adopting the dropout mechanism, the most extensive feature data was retained for the classification function to achieve a fast and accurate classification, thereby improving the classification accuracy.
We trained and tested the proposed model on a dataset of 10 monkey species and obtained a test accuracy of 87.92%. Compared to the CNN model, the accuracy rate increased by 26.1%. Compared to the VGG16 and ResNet50 deep learning models, the accuracy of the proposed model increased by 7% and 1.1%, respectively. The experimental results indicated that the proposed model exhibited an outstanding performance in image classification tasks on a small dataset.
He received his M.S. degree in Computer Application Technology from Huazhong University of Science and Technology, Wuhan, China in 2009. He received Ph.D. degree from the School of Computer Information Engineering of Kunsan National University Korea in 2020. From 2016 to the present, he has been an associate professor in the Information Technology Center of Jiujiang University in China. His research interests include data science, information system, and artificial intelligence.
He received his M.S. degree from the Dept. pf Management Information of Hankuk University of Foreign Studies, Seoul, Korea, in 2001. He a doctoral student from the Dept. of Computer Information Engineering of Kunsan National University, Gunsan, Korea, in from 2018 present. From 2006 to the present, he has been a professor in the same department. His research interests include image processing, big dara, and ERP.
He received his M.S. and Ph.D. degrees from the Dept. of Computer Information Engineering of Kunsan National University, Gunsan, Korea, in 1997 and 2003, respectively. From 2006 to the present, he has been a professor in the same department. His research interests include image processing, computer vision, and virtual reality.
Journal of information and communication convergence engineering 2021; 19(2): 79-83
Published online June 30, 2021 https://doi.org/10.6109/jicce.2021.19.2.79
Copyright © Korea Institute of Information and Communication Engineering.
Wang, Guangxing;Lee, Kwang-Chan;Shin, Seong-Yoon;
Jiujiang University, Kunsan National University
This paper proposes a novel image classification method based on few-shot learning, which is mainly used to solve model overfitting and non-convergence in image classification tasks of small datasets and improve the accuracy of classification. This method uses model structure optimization to extend the basic convolutional neural network (CNN) model and extracts more image features by adding convolutional layers, thereby improving the classification accuracy. We incorporated certain measures to improve the performance of the model. First, we used general methods such as setting a lower learning rate and shuffling to promote the rapid convergence of the model. Second, we used the data expansion technology to preprocess small datasets to increase the number of training data sets and suppress over-fitting. We applied the model to 10 monkey species and achieved outstanding performances. Experiments indicated that our proposed method achieved an accuracy of 87.92%, which is 26.1% higher than that of the traditional CNN method and 1.1% higher than that of the deep convolutional neural network ResNet50.
Keywords: Deep learning, Feature extraction, Few-shot learning, Image classification
In recent years, deep learning models have been applied to computer vision tasks with great success, such as face recognition, object recognition, image classification, and semantic segmentation etc. [1-3]. Furthermore, several outstanding deep learning models have emerged, such as LeNet, AlexNet, GoogLeNet, VGG(Visual Geometry Group), and ResNet [4-8]. These deep learning models based on the convolutional neural network (CNN) model have different characteristics and can achieve satisfactory results for different tasks. Deep learning models can automatically learn features from data, which generally requires a large amount of available training data, particularly for very high-dimensional input samples such as image and video processing. If the number of samples is small, the deep learning model can extract minimal features, and the expected results produced by the model are unsatisfactory.
CNNs are also called shift-invariant artificial neural networks [9] because they can express learning and can perform shift-invariant classification of input information according to their hierarchical structure. As the deep learning model mainly relies on the sample data features extracted by the convolutional layer to realize object recognition and image classification, the number of samples determine the accuracy of the model prediction to a certain extent. The aforementioned classic CNNs are all models trained on large image datasets such as ImageNet [10, 11], which can achieve better results in computer vision tasks. However, in realistic scenes, minimal features are extracted by the model owing to the small amount of sample data, resulting in non-convergence and over-fitting of the deep learning model during training that directly affects the prediction results of the model.
This paper proposes an optimized image classification method that can solve the problem of image classification in small datasets. In these methods, we use a deepened model depth to realize the feature extraction of small dataset images. On the one hand, we use the data augmentation technology to increase the number of training samples and the regularization technology to suppress model overfitting. On the other hand, by setting a lower learning rate, shuffle, and other model optimization methods, model convergence is accelerated. The use of these model optimization methods can significantly improve the prediction accuracy.
Humans can recognize new objects using minimal samples. For example, children can recognize simple objects, such as apples and strawberries, with only individual pictures in the book. Cognitive ability refers to the perception of objects. Nerve cells play an essential role in cognition. Researchers hope that a machine learning model can quickly achieve few-shot learning [12] by training a large amount of data in a particular category and then training only a small number of samples for a new category arising from a downstream task. Traditional few-shot learning considers that both the training data and test data come from the same domain. If the unknown domain is included in the downstream task, the traditional few-shot learning method is unsatisfactory.
Recently, along with the rapid development of machine learning, the development of few-shot learning in the image processing field has surpassed that in the natural language processing field, and excellent few-shot learning has been focused on [13, 14]. Few-shot learning is an application of meta-learning in the field of supervised learning. Meta-learning, also known as learning-of-learning, decomposes a dataset into multiple meta-tasks in the meta-learning phase to learn the model generalization ability when categories change. There is no need to face brand new categories in the metatesting stage. The classification can be completed by changing the existing model. The definition of small-sample learning is as follows: a few-shot learning set contains multiple categories, each with multiple samples. C categories are randomly chosen from the training set of the training phase, along with K samples for each category (total CK data), and the meta-task is composed of inputs to the model support set. Extract the batches of samples from the remaining data as model prediction objects (batch sets). The model must learn to distinguish these C categories from C*K data, and these tasks are called C-way K-shot problems.
In a learning process with only a few-shot learning session, different meta-tasks are sampled for each training (episode); therefore, the training as a whole involves different combinations of categories. This mechanism allows the model to learn the common parts of various meta-tasks, such as extracting important features and comparing sample similarities, but removes the relational task-specific parts from the meta-tasks. Models trained using this learning mechanism can classify well even when faced with new and unseen meta-tasks.
Few-shot learning models can be broadly divided into model-based, metric-based, and optimization-based models. The mathematical definition of the model-based model is shown in (1).
where
where
The study of monkey species is conducive to researchers on the habits, population classification, and genetic characteristics of monkeys and is of great significance for studying human evolutionary history. This study established a small deep learning model to classify approximately 1,400 monkey group images of 10 species. This dataset was taken from Wikipedia’s monkey cladogram [15], named 10 monkey species. The training dataset had 1098 images, and the test dataset had 272 images. The number of images in each category was not uniform. Compared to the Dogs vs. Cats dataset [16] (approximately 25,000 images), the number of samples in this dataset was minimal. Therefore, it was challenging to establish a deep model to achieve monkey species classification and improve the classification accuracy.
When there are enough data samples, the CNN model can be competent for most image recognition and classification tasks owing to its simple structure, small number of parameters, fast data feature extraction, and high prediction accuracy. However, when dealing with image classification tasks of small datasets, although pre-training models (such as VGG and ResNet) can extract more data features, the result of the pre-trained model will increase the number of model parameters, resulting in a deeper model hierarchy. The model training time is extremely long. Therefore, we reformed and optimized the CNN model structure without significantly increasing the model parameters, mainly including the following aspects.
First, we extracted more data features by adding a small number of convolutional and pooling layers. Here, we used a 3 × 3 convolution kernel, a convolution stride size of 1 × 1, and set the convolution layer activation function to ReLU. We set the maximum pooling size to 2 × 2 and the pooling stride size to 2 × 2. Moreover, we set up a combination module of two layers of convolution and a maximum pooling layer, and added two such combination modules before the fully connected layer.
Second, in the fully connected layer, we used regularization techniques such as dropout to reduce unnecessary neurons, provide essential data features for the classifier, and improve classification accuracy. In the fully connected layer, we set the two layers dropout values to 0.5 and 0.25.
Third, we set softmax as a classification function to realize the classification output of input image feature neurons and output the classification results of 10 categories.
Finally, in the model training process, we used the categorical cross-entropy verification function as the loss function, and we used Adam [17] as the optimizer of stochastic gradient descent to optimize the training process of the model and accelerate the model convergence. The structure of the proposed model is illustrated in Fig. 1.
Data enhancement, also known as data expansion, refers to the value of limited data corresponding to more data without significantly increasing the data. Data augmentation is very effective for small sample datasets. Data enhancement can be divided into two methods: supervised and unsupervised enhancement. Supervised data enhancement can be divided into single-sample data enhancement and multi-sample data enhancement methods. Unsupervised data enhancement can be divided into two directions: generating new data and learning enhancement strategies. In this study, we mainly adopted the data enhancement method of geometric transformation, as presented in Table 1. Data enhancement by such geometric transformation can increase the number of training samples to improve the model generalization ability, effectively suppress model overfitting, and improve the accuracy of model classification.
Parameters | Values |
---|---|
Rescale | 1.0/255 |
Rotation (º) | 40 |
Width shift | 0.2 |
Height shift | 0.2 |
Zoom range | 0.2 |
Shear range | 0.2 |
Horizontal flip | TRUE |
Fill mode | NEAREST |
To test the classification effectiveness and performance of the proposed model, we used the Python computer programming language to construct a deep learning model and deployed the model on a graphics workstation equipped with an Intel Core i7-4790 chip, 16 GB memory, 2T hard disk, and a GTX960 graphics display card. The experimental procedure was as follows.
Classification experiments were performed on the preprocessed dataset in the CNN model and the proposed model; the number of batches of training data was set to 64, and the number of training rounds was set to 100. Fig. 2 depicts the accuracy and loss rate curves of the CNN model, and Fig. 3 shows the accuracy and loss rate curves of the proposed model during training. The average accuracy of the CNN model in the prediction dataset was 69.74%, and the accuracy of the proposed model was 87.92%. In addition, to comprehensively evaluate the performance of the proposed model, we trained and tested the preprocessed data in deep learning models such as VGG16 and ResNet50. The test results are presented in Table 2.
Model | Accuracy | |
---|---|---|
Training | Test | |
CNN | 0.8876 | 0.6974 |
VGG16 | 0.9191 | 0.8213 |
ResNet50 | 0.9268 | 0.8696 |
Ours | 0.9358 | 0.8792 |
This paper proposes a novel image classification method based on few-shot learning in monkey species. This method increased the number of convolutional layers to achieve the rapid extraction of sample data features from a small dataset based on a CNN. Then, by fine-tuning the fully connected layer and adopting the dropout mechanism, the most extensive feature data was retained for the classification function to achieve a fast and accurate classification, thereby improving the classification accuracy.
We trained and tested the proposed model on a dataset of 10 monkey species and obtained a test accuracy of 87.92%. Compared to the CNN model, the accuracy rate increased by 26.1%. Compared to the VGG16 and ResNet50 deep learning models, the accuracy of the proposed model increased by 7% and 1.1%, respectively. The experimental results indicated that the proposed model exhibited an outstanding performance in image classification tasks on a small dataset.
Parameters | Values |
---|---|
Rescale | 1.0/255 |
Rotation (º) | 40 |
Width shift | 0.2 |
Height shift | 0.2 |
Zoom range | 0.2 |
Shear range | 0.2 |
Horizontal flip | TRUE |
Fill mode | NEAREST |
Model | Accuracy | |
---|---|---|
Training | Test | |
CNN | 0.8876 | 0.6974 |
VGG16 | 0.9191 | 0.8213 |
ResNet50 | 0.9268 | 0.8696 |
Ours | 0.9358 | 0.8792 |