Journal of information and communication convergence engineering 2020; 18(4): 267-277
Published online December 31, 2020
https://doi.org/10.6109/jicce.2020.18.4.267
© Korea Institute of Information and Communication Engineering
It is important to understand the exact habitat distribution of endangered species because of their decreasing numbers. In this study, we build a system with a deep learning module that collects the image data of endangered animals, processes the data, and saves the data automatically. The system provides a more efficient way than human effort for classifying images and addresses two problems faced in previous studies. First, specious answers were suggested in those studies because the probability distributions of answer candidates were calculated even if the actual answer did not exist within the group. Second, when there were more than two entities in an image, only a single entity was focused on. We applied an object detection algorithm (YOLO) to resolve these problems. Our system has an average precision of 86.79%, a mean recall rate of 93.23%, and a processing speed of 13 frames per second.
Keywords System Design, System Development, Object Detection, Endangered Species
Recently, many animals on the planet have become extinct owing to factors such as climate change. Many studies on the maintenance of biodiversity have been conducted. Unmanned cameras have been installed and video recordings of living creatures obtained to collect the information needed for these studies, but huge amounts of human resources were needed to process the acquired data. The processing was slow and inefficient because it had to be performed using human perception and judgment. This issue has led to the emergence of research on machine systems for automatically processing and distinguishing animal images.
Nguyen et al. proposed a convolutional neural network system for classifying the three most commonly observed animal species in Victoria, Australia [1]. Zhuang et al. introduced a deep learning model that automatically categorizes and annotates marine biological image data without the need for manual processing by experts, and conducted experiments using data from SeaCLEF2017 [2]. In another study, Nguyen et al. considered two experimental scenarios for classifying images of wild animals using model architectures based on Lite AlexNet, VGG-16, and ResNet-50 [3]. In the first scenario, the model was trained from scratch, and in the second, a technique called fine-tuning was used. Weights pre-trained by ImageNet, which contains large-capacity images, were fed into the model to fit the model to the target data. Many pre-training techniques are used for monitoring and classifying large amounts of animal image data to provide a good view of the local image features in big data.
Two problems arise in wildlife image classification when simple classification is used. First, because the classification results indicate the probabilistic similarity of how close the object in the image is to various correct answer candidates, the model suggests the best wrong answer even when the correct answer is not in the candidate group. Second, the answer is given for only one object, even if multiple objects exist within the image. In this study, we propose an object-detection approach to address these problems. We establish a system for processing and storing image data of endangered species using YOLO, a well-known object detection algorithm.
The contributions of this study are as follows: First, we automated the acquisition, processing, and storage of image data for investigating the ecological preservation of endangered species. Second, by replacing the simple classification algorithm with an object detection algorithm in our new system, the system can overcome the limitations of current classification systems. Third, the developed system can be linked to other systems to establish larger systems for monitoring the illegal poaching and smuggling of endangered species.
In this study, we focused on five species of parrots designated as endangered species by the Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES). The illegal trade of endangered parrot species has become rampant internationally [4]. The five types of parrots are listed in Table 1. The remainder of this paper is organized as follows. The relevant literature are reviewed in Section 2 and the originality of the developed system is described. The components and the development of the system are detailed in Section 3. The construction of the object detection model is described in Section 4. The results of the experiments are presented in Section 5. Finally, the main conclusions of the study and future works are summarized in Section 6.
Species Information | Picture |
---|---|
Scientific Name | |
Ara Chloroptera | |
Appendix | |
II | |
Scientific Name | |
Ara Ararauna | |
Appendix | |
II | |
Scientific Name | |
Cacatua Galerita | |
Appendix | |
II | |
Scientific Name | |
Cacatua Goffiniana | |
Appendix | |
I | |
Scientific Name | |
Psittacus Erithacus | |
Appendix | |
I |
Unmanned cameras are typically used to obtain video recordings of wild animals. However, human labor is required for the determination of animal objects contained in the acquired images. This incurs time cost and is inevitably subjective because of human decisions. Researchers have attempted to resolve this problem by building systems that can process the data automatically. Nguyen et al. proposed a deep learning model that can classify the animal species observed in Victoria, Australia [1].
Zhuang et al. presented a model that can handle images of marine species without human input and tested its performance on the SeaCLEF2017 dataset [2]. Noruzadeh et al. noted that much of the video data from the Snapshot Serengeti project remained unprocessed because the data obtained from the cameras came from volunteer participants. They claimed that a deep learning model can identify the number of objects and even the behavior of each object in an image [3]. Byeong-hyeok and Sun-hyeon described the identification of five species mostly found in the Sobaeksan National Park using an object recognition model [5].
There are four main domains in computer vision, namely, simple classification of objects in an image, localization of objects in an image, object detection for the classification and location tracking of many objects in the image, and semantic segmentation for detecting objects in the image by classifying their pixels. Among these domains, object detection algorithms can be divided into two categories. The first category consists of two-step detection algorithms which perform detection tasks in regions that may contain objects. The other category consists of one-step detection algorithms which can simultaneously perform detection tasks over the entire image. Ren et al. presented a method that generates region proposals using a region proposal network and performs detection tasks in the proposed regions via the Fast R-CNN framework [6]. Redmon et al. reported a unified convolutional network for classifying and localizing objects in an image [7]. It is well-known amongst researchers that one-step algorithms are faster than two-step algorithms, but not as accurate. Nevertheless, because one-step algorithms have improved to date, we decided to employ YOLO for our system.
In this section, the components of our system are described through several diagrams.
The system developed in this study consists of four main parts, as shown in Fig. 1. In the first part, a camera interface is used to acquire images. In the second part, a server receives the images acquired from the cameras and submits requests for image analysis to a deep learning server. In the third part, the deep learning server analyzes the images. The fourth part is a database to store the analyzed images. The system flow is illustrated in Fig. 2. Images are acquired from the web cameras and sent to the server. The server requests image data analysis from the deep learning server through routing. The deep learning server analyzes the images and sends the results to the server. The database stores the results and image data. Finally, the server returns the results to the origin of the images.
A webcam is used to transmit images to the server in the camera interface instead of using physical storage devices on a batch basis. The Opencv computer vision library is used to retrieve video data from a webcam connected to a laptop [8]. Opencv is an open-source library that contains hundreds of computer vision algorithms. The library typically includes the following functionalities:
Image processing (imgproc): Modules that perform image transformations such as geometric image conversion or color changes.
Video analysis: Modules for video analysis such as motion measurements, background removal, and object tracking algorithms.
Video I/O (video): An interface that facilitates the use of features such as video capture and video codecs.
The Opencv Library can be used to create an object that contains the camera module information through a class called VideoCapture. Invoking the read method for this object will initiate video frame acquisition by the webcam. After obtaining the images, the server receives requests that meet the relevant routing rules and starts a related logic process.
The server receives image data from the webcam and sends them to the deep learning server for analysis. The server then receives the results and stores them in the database. Although the deep learning modules can also be loaded on the server, the server and deep learning server are separated for the following reasons:
To facilitate maintenance through the separation of modules by function.
To build the system in a hierarchical structure to facilitate the addition of new functions in the future.
To utilize Tensorflow graph computation, which has speed advantages.
The routing rules are added by the Python decorator in the Flask Library [9]. The image data to be transmitted are first transformed into an array of the pixel values by calling the img_to-array method. Next, the Json-type data that consist of a key word and the pixel array values are sent to the deep learning server using the post method. The results are returned after the operations on the deep learning server are completed.
The TensorFlow Serving API includes a function that responds to requests generated by a model expressed as a Tensorflow graph [10]. The process for converting a model to a Tensorflow graph and starting the service is as follows: First, the object detection model is built and trained using the TensorFlow Library. The trained model is then converted to a TensorFlow graph using the library. The graph can start a service using the tensorflow_model_server instruction. The service serves as a director for deep learning operations. Once the service has been constructed, it performs tasks related to the client requests. All the tasks in the TensorFlow Serving API are performed separately, as shown in Fig. 3.
MySql is a widely used relational database [11]. As a relational database, MySql can store and manage data in a structured structure. To allow the images to be stored in the database and the desired data to be queried and retrieved according to conditions such as the date and class, we decided to store the images in byte form as blobs in the database table. The BytesIO class and base64 module are used to store image data in byte form. The images, detection results, and date information are grouped into a data frame and stored in the database.
The overall system was developed based on a module-view-controller (MVC) pattern. The MVC pattern is a system design pattern used in software engineering to separate the user interface from the process logic so that they can be easily maintained without affecting each other. Each component is described below (Fig. 4):
Controller: Changes the status of the model by sending commands to the model. The server and the deep learning server constitute the controller.
Model: The view or controller reads the status of the model and processes the associated logic. Here, the database constitutes the model.
View: The view generates the results for viewing by the user. The camera interface constitutes the view.
The details of the system development environment are presented below:
CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz
GPU: GeForce RTX 2080 Ti (×2)
OS: Ubuntu 16.04 LTS.
All the components of the system were developed in Anaconda using the Python language. The code was written in a Jupyter Notebook.
Many researchers have analyzed image data using computer vision methods based on deep learning. Attempts have been made to extract features that are invariant to changes in the size and rotation of the image [12] and to extract characteristic vectors by considering the variance of the size and direction of the edges in the image [13]. LeCun et al. proposed a method for training a convolutional neural network through backpropagation [14]. Convolutional neural networks subsequently became mainstream in computer vision studies. In particular, He et al. surpassed human recognition capability with an error rate of only 3.57% in the ImageNet Large Scale Visual Recognition image dataset classification competition in 2015 [15]. However, simple classification still faces the problem of the proposal of wrong answers if there is no correct answer in the answer group. In addition, when there are multiple objects in an image, the results are determined based on only the object with the most influence on the image characteristics. Some approaches to overcome the disadvantages of simple classification consider the locations of the objects in the image using convolutional neural networks [6, 16, 17]. In this study, the object detection was performed using the YOLO model, which has advantages in terms of both speed and accuracy [7, 18, 19]. The training of the YOLO model implemented on the deep learning server described in the previous section is explained in this section. The process of training the model is illustrated in Fig. 5.
A deep learning model requires substantial data to achieve good performance. To train a model which can recognize hand-written digits ranging from 0 to 9 with more than 90% accuracy, 60,000 training sets and 10,000 test sets were required [20]. The ImageNet dataset used in many studies consists of 14,197,122 images belonging to more than 20,000 classes [21, 22]. The amount of data can be increased using the two techniques of acquiring more data and transforming the acquired data to create new data.
In addition to the data obtained at the Seoul Grand Park, we developed an image crawling program to obtain sufficient data to train the deep learning models. Websites such as Google and Yahoo have sections that show a list of images associated with the keywords entered by the user. This section is accessible through several crawling libraries that enable users to acquire images. The number of images gathered from crawling and from zoos was 12,770; therefore, 2,554 images were used for each species. However, only 1,250 images per species were injected for training (see Section 5). We manually annotated the location of every parrot in the images.
The data can be manipulated by using many techniques to create new images, for example, by the rotation or parallel translation of the images or by changing the image chroma, color or brightness. The deliberate mixture of noise with the original data to produce new images has also been studied [23]. Girschick used a technique called image warping to change the size of the original image. He also used another technique in which only the area of the original image that contains objects is cut off [24]. Data augmentation has been applied in many studies because of constraints in obtaining sufficient data for deep learning [25, 26]. In this study, a large amount of data was obtained by applying data augmentation to the original data obtained from websites and zoos. We only applied horizontal and vertical flipping so that the number of training data images per species was 5,000, which is four times that of 1,250. The total number of training data images was 25,000, as shown in Fig. 6.
Images can also be acquired by building a model that creates non-existent images [27]. However, the process is very complex and beyond the scope of this study.
We used MobileNet as the model architecture based on the approach in [28]. MobileNet has the following features:
Channel Reduction: The operation is slowed down if the model size is too large so that an appropriate number of channels is maintained.
Depthwise separable convolution: This reduces the number of parameters in the filters and extracts significant features with independent filters.
The model was constructed based on the MobileNet architecture and the YOLO style, as shown at the bottom of Fig. 5. Predictions were made using three-scale feature maps. This is the concept of the pyramidal feature hierarchy reported in [29]. Objects of varying sizes can be detected because the pyramidal feature hierarchy outputs different scale feature maps for each layer. This allows the convolutional neural network to perform more accurate predictions on each feature map instead of predictions on a single feature map from the last layer, as shown in Fig. 7. The total number of layers in the model was 181, and the total number of model parameters was 7,728,666, of which 7,684,506 parameters were trainable and 44,160 parameters were not. The shapes of the three output layers were 13×13×30, 26×26×30, and 52×52×30. The predictions contain the coordinates of the bounding box and the class probabilities of the object. Because the entire network is too large to be shown, we show a part of the network in our system in Fig. 8.
One of the limitations to creating a well-trained deep-running model is that a vast amount of data is required. The solutions include creating new images by data augmentation and by data crawling, as mentioned in Section 4.A. An additional method that is frequently used is the refinement of convolution layers extracted from entire networks, such as ImageNet, which have been previously trained with large amounts of data from different domains, as shown in Fig. 9. This is known as transfer learning. Transfer learning has been employed in many studies to resolve problems caused by poor image quantity [30-32]. We also exploited the pretrained weights in the ImageNet dataset to perform object detection in images of endangered parrot species by initializing our model layers with the weights [21, 22]. Because the ImageNet dataset contains a variety of bird images, we expect that this would improve the efficiency of our system.
YOLO extracts the feature maps from the final three layers and predicts the area in which the object is included. Each feature map is divided into a N × N grid to identify areas where the objects are likely to be. Each grid contains a number of boxes called anchor boxes that represent the candidate areas. Five values are calculated at this time: the soft-max probabilities for the classes of an object in the bounding box, the objectness score indicating the probability with which the object might exist, and the center coordinates x, y, and the width and height w, h of the bounding box surrounding the object. The calculated values are compared with the ground truth values to obtain the total loss. The network model parameters are then updated through backpropagation using the loss.
Fig. 10 shows
The values of
(The authors of [19] did not provide an explicit formula for the loss function, so there are some differences between implementations.)
The
Approximately half of the 2,554 images for each species was used for the model evaluation. 1,304 images were used to test the model for each species. We did not apply the data augmentation described in the previous section for the test images because it was not meaningful to do so. The pixel values of all images were divided by 255 so that forward propagation and back propagation could be performed quickly. Because the maximum RGB value of each pixel is 255, the pixel values for all images have values between 0 and 1.
We evaluated our model using the custom data described in Section 4-A. The model evaluation was conducted using 2 GPUs; see Section 3-C. The evaluation metrics are presented in Section 5-C.
The numbers of each parrot species in the entire image data set are presented in Table 2. The most common species in the data is Ara Ararauna, of which there are 1657 parrots in the dataset. The least abundant species in the dataset is Cacatua Goffiniana, of which there are only 1406 parrots. The difference between the numbers of the two species is 251. Because there are different numbers of objects for each species, the evaluation must be performed considering the number of objects.
Scientific Name | Number of parrots in data |
---|---|
Ara Chloroptera | 1651 |
Ara Ararauna | 1657 |
Cacatua Galerita | 1456 |
Cacatua Goffiniana | 1406 |
Psittacus erithacus | 1564 |
The main measurements used to evaluate the object detection performance of the models are the mean average precision, the recall, and the frames per second [6, 19]. The average precision is the average class precision of each bounding box. The precision can be obtained by dividing the frequency of the actual positives by the sum of the frequencies of the actual positives and false positives. The true positive, false positive, false negative, and true negative are defined in Table 3. For example, if a predicted value is positive and the associated answer is also positive, then it is a true positive value.
Actual | |||
---|---|---|---|
Positive | Negative | ||
Prediction | Positive | True Positive | False Positive |
Negative | False Negative | True Negative |
(5) presents the expression used to obtain the average class precision of each bounding box. When the IOU threshold was set to 0.5, the mAP was approximately 86.79% (see Table 4). The recall rate is the actual positive frequency divided by the sum of the actual positive and false negative frequencies, as given in (6). The mean recall rate in Table 5 is 93.23%. The FPS is the number of images processed by the network in each second. 13 FPS was achieved in this study.
Scientific Name | Average Precision |
---|---|
Ara Chloroptera | 89% |
Ara Ararauna | 88% |
Cacatua Galerita | 87% |
Cacatua Goffiniana | 86% |
Psittacus erithacus | 84% |
Scientific Name | Recall Rate |
---|---|
Ara Chloroptera | 96% |
Ara Ararauna | 94% |
Cacatua Galerita | 93% |
Cacatua Goffiniana | 93% |
Psittacus erithacus | 91% |
Note that a direct comparison with other detection systems is impossible because of the differences in three aspects, namely, the backbone architecture used to extract the features (MobileNet vs. others), the method used to detect the objects (YOLO vs. others), and the training and inference data. We present the performance reported in each paper in Table 6 (details in [19], [28]). We expect that our system performance will improve with further optimization.
Architecture | Method | Data | mAP | Time |
---|---|---|---|---|
Darknet | YOLO | MS COCO | 51.5 | 22 |
MobileNet | SSDLite | MS COCO | 22.1 | 200 |
MobileNet | YOLO | Custom | 86 | 13 |
Fig. 12 shows some result images.
Fig. 13 shows the results of the response from the server. The name of the image is presented, and the object detection results are shown in parentheses. The detection results comprise the x and y coordinates of the top left corner, the x and y coordinates of the lower right corner, the predicted classes, and the probability value.
Recently, many species of animals have suffered from declining populations or become extinct owing to various factors. Identifying the habits, habitats, and populations of each species is essential for maintaining biodiversity. Unmanned cameras have been installed in previous studies to acquire video data. Recent advancements in deep learning technology and infrastructure have resulted in improved methods for data processing, but the previous studies faced some limitations because of the use of simple classifications. In this study, we employed object detection instead of simple classification to establish a system for receiving, analyzing, and storing image data. In particular, the system was tested with five endangered parrot species from CITIES. The model showed good performance. We expect that this system will increase the efficiency of research on endangered species. We further hope that this system will be used as a part of a larger system, such as a system for monitoring illegal poaching and smuggling.
The system in this study was built by applying only a single architecture and object detection method; however, we are confident that a better detection system can be achieved if we experiment with more architectures and object detection methods in our future research.
Received his B.S. degree from the Department of Computer Science, Sangmyung University, Seoul, Korea in 2019. He is currently pursuing his M.S. degree from the Department of Computer Science, Sangmyung University, Seoul, Korea. His current research interests include mobile system development and computer vision.
Received his Ph.D. degree from the Department of Biomedical Engineering, Yonsei University, Seoul, Korea in 2008. He is currently serving as associate professor at the Department of Intelligent Engineering Information for human and director of the Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, Korea. His research interests include biomedical engineering and human computer interaction
Journal of information and communication convergence engineering 2020; 18(4): 267-277
Published online December 31, 2020 https://doi.org/10.6109/jicce.2020.18.4.267
Copyright © Korea Institute of Information and Communication Engineering.
Dea-Gyu Choe, Dong-Keun Kim
Sangmyung University, Sangmyung University
It is important to understand the exact habitat distribution of endangered species because of their decreasing numbers. In this study, we build a system with a deep learning module that collects the image data of endangered animals, processes the data, and saves the data automatically. The system provides a more efficient way than human effort for classifying images and addresses two problems faced in previous studies. First, specious answers were suggested in those studies because the probability distributions of answer candidates were calculated even if the actual answer did not exist within the group. Second, when there were more than two entities in an image, only a single entity was focused on. We applied an object detection algorithm (YOLO) to resolve these problems. Our system has an average precision of 86.79%, a mean recall rate of 93.23%, and a processing speed of 13 frames per second.
Keywords: System Design, System Development, Object Detection, Endangered Species
Recently, many animals on the planet have become extinct owing to factors such as climate change. Many studies on the maintenance of biodiversity have been conducted. Unmanned cameras have been installed and video recordings of living creatures obtained to collect the information needed for these studies, but huge amounts of human resources were needed to process the acquired data. The processing was slow and inefficient because it had to be performed using human perception and judgment. This issue has led to the emergence of research on machine systems for automatically processing and distinguishing animal images.
Nguyen et al. proposed a convolutional neural network system for classifying the three most commonly observed animal species in Victoria, Australia [1]. Zhuang et al. introduced a deep learning model that automatically categorizes and annotates marine biological image data without the need for manual processing by experts, and conducted experiments using data from SeaCLEF2017 [2]. In another study, Nguyen et al. considered two experimental scenarios for classifying images of wild animals using model architectures based on Lite AlexNet, VGG-16, and ResNet-50 [3]. In the first scenario, the model was trained from scratch, and in the second, a technique called fine-tuning was used. Weights pre-trained by ImageNet, which contains large-capacity images, were fed into the model to fit the model to the target data. Many pre-training techniques are used for monitoring and classifying large amounts of animal image data to provide a good view of the local image features in big data.
Two problems arise in wildlife image classification when simple classification is used. First, because the classification results indicate the probabilistic similarity of how close the object in the image is to various correct answer candidates, the model suggests the best wrong answer even when the correct answer is not in the candidate group. Second, the answer is given for only one object, even if multiple objects exist within the image. In this study, we propose an object-detection approach to address these problems. We establish a system for processing and storing image data of endangered species using YOLO, a well-known object detection algorithm.
The contributions of this study are as follows: First, we automated the acquisition, processing, and storage of image data for investigating the ecological preservation of endangered species. Second, by replacing the simple classification algorithm with an object detection algorithm in our new system, the system can overcome the limitations of current classification systems. Third, the developed system can be linked to other systems to establish larger systems for monitoring the illegal poaching and smuggling of endangered species.
In this study, we focused on five species of parrots designated as endangered species by the Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES). The illegal trade of endangered parrot species has become rampant internationally [4]. The five types of parrots are listed in Table 1. The remainder of this paper is organized as follows. The relevant literature are reviewed in Section 2 and the originality of the developed system is described. The components and the development of the system are detailed in Section 3. The construction of the object detection model is described in Section 4. The results of the experiments are presented in Section 5. Finally, the main conclusions of the study and future works are summarized in Section 6.
Species Information | Picture |
---|---|
Scientific Name | |
Ara Chloroptera | |
Appendix | |
II | |
Scientific Name | |
Ara Ararauna | |
Appendix | |
II | |
Scientific Name | |
Cacatua Galerita | |
Appendix | |
II | |
Scientific Name | |
Cacatua Goffiniana | |
Appendix | |
I | |
Scientific Name | |
Psittacus Erithacus | |
Appendix | |
I |
Unmanned cameras are typically used to obtain video recordings of wild animals. However, human labor is required for the determination of animal objects contained in the acquired images. This incurs time cost and is inevitably subjective because of human decisions. Researchers have attempted to resolve this problem by building systems that can process the data automatically. Nguyen et al. proposed a deep learning model that can classify the animal species observed in Victoria, Australia [1].
Zhuang et al. presented a model that can handle images of marine species without human input and tested its performance on the SeaCLEF2017 dataset [2]. Noruzadeh et al. noted that much of the video data from the Snapshot Serengeti project remained unprocessed because the data obtained from the cameras came from volunteer participants. They claimed that a deep learning model can identify the number of objects and even the behavior of each object in an image [3]. Byeong-hyeok and Sun-hyeon described the identification of five species mostly found in the Sobaeksan National Park using an object recognition model [5].
There are four main domains in computer vision, namely, simple classification of objects in an image, localization of objects in an image, object detection for the classification and location tracking of many objects in the image, and semantic segmentation for detecting objects in the image by classifying their pixels. Among these domains, object detection algorithms can be divided into two categories. The first category consists of two-step detection algorithms which perform detection tasks in regions that may contain objects. The other category consists of one-step detection algorithms which can simultaneously perform detection tasks over the entire image. Ren et al. presented a method that generates region proposals using a region proposal network and performs detection tasks in the proposed regions via the Fast R-CNN framework [6]. Redmon et al. reported a unified convolutional network for classifying and localizing objects in an image [7]. It is well-known amongst researchers that one-step algorithms are faster than two-step algorithms, but not as accurate. Nevertheless, because one-step algorithms have improved to date, we decided to employ YOLO for our system.
In this section, the components of our system are described through several diagrams.
The system developed in this study consists of four main parts, as shown in Fig. 1. In the first part, a camera interface is used to acquire images. In the second part, a server receives the images acquired from the cameras and submits requests for image analysis to a deep learning server. In the third part, the deep learning server analyzes the images. The fourth part is a database to store the analyzed images. The system flow is illustrated in Fig. 2. Images are acquired from the web cameras and sent to the server. The server requests image data analysis from the deep learning server through routing. The deep learning server analyzes the images and sends the results to the server. The database stores the results and image data. Finally, the server returns the results to the origin of the images.
A webcam is used to transmit images to the server in the camera interface instead of using physical storage devices on a batch basis. The Opencv computer vision library is used to retrieve video data from a webcam connected to a laptop [8]. Opencv is an open-source library that contains hundreds of computer vision algorithms. The library typically includes the following functionalities:
Image processing (imgproc): Modules that perform image transformations such as geometric image conversion or color changes.
Video analysis: Modules for video analysis such as motion measurements, background removal, and object tracking algorithms.
Video I/O (video): An interface that facilitates the use of features such as video capture and video codecs.
The Opencv Library can be used to create an object that contains the camera module information through a class called VideoCapture. Invoking the read method for this object will initiate video frame acquisition by the webcam. After obtaining the images, the server receives requests that meet the relevant routing rules and starts a related logic process.
The server receives image data from the webcam and sends them to the deep learning server for analysis. The server then receives the results and stores them in the database. Although the deep learning modules can also be loaded on the server, the server and deep learning server are separated for the following reasons:
To facilitate maintenance through the separation of modules by function.
To build the system in a hierarchical structure to facilitate the addition of new functions in the future.
To utilize Tensorflow graph computation, which has speed advantages.
The routing rules are added by the Python decorator in the Flask Library [9]. The image data to be transmitted are first transformed into an array of the pixel values by calling the img_to-array method. Next, the Json-type data that consist of a key word and the pixel array values are sent to the deep learning server using the post method. The results are returned after the operations on the deep learning server are completed.
The TensorFlow Serving API includes a function that responds to requests generated by a model expressed as a Tensorflow graph [10]. The process for converting a model to a Tensorflow graph and starting the service is as follows: First, the object detection model is built and trained using the TensorFlow Library. The trained model is then converted to a TensorFlow graph using the library. The graph can start a service using the tensorflow_model_server instruction. The service serves as a director for deep learning operations. Once the service has been constructed, it performs tasks related to the client requests. All the tasks in the TensorFlow Serving API are performed separately, as shown in Fig. 3.
MySql is a widely used relational database [11]. As a relational database, MySql can store and manage data in a structured structure. To allow the images to be stored in the database and the desired data to be queried and retrieved according to conditions such as the date and class, we decided to store the images in byte form as blobs in the database table. The BytesIO class and base64 module are used to store image data in byte form. The images, detection results, and date information are grouped into a data frame and stored in the database.
The overall system was developed based on a module-view-controller (MVC) pattern. The MVC pattern is a system design pattern used in software engineering to separate the user interface from the process logic so that they can be easily maintained without affecting each other. Each component is described below (Fig. 4):
Controller: Changes the status of the model by sending commands to the model. The server and the deep learning server constitute the controller.
Model: The view or controller reads the status of the model and processes the associated logic. Here, the database constitutes the model.
View: The view generates the results for viewing by the user. The camera interface constitutes the view.
The details of the system development environment are presented below:
CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz
GPU: GeForce RTX 2080 Ti (×2)
OS: Ubuntu 16.04 LTS.
All the components of the system were developed in Anaconda using the Python language. The code was written in a Jupyter Notebook.
Many researchers have analyzed image data using computer vision methods based on deep learning. Attempts have been made to extract features that are invariant to changes in the size and rotation of the image [12] and to extract characteristic vectors by considering the variance of the size and direction of the edges in the image [13]. LeCun et al. proposed a method for training a convolutional neural network through backpropagation [14]. Convolutional neural networks subsequently became mainstream in computer vision studies. In particular, He et al. surpassed human recognition capability with an error rate of only 3.57% in the ImageNet Large Scale Visual Recognition image dataset classification competition in 2015 [15]. However, simple classification still faces the problem of the proposal of wrong answers if there is no correct answer in the answer group. In addition, when there are multiple objects in an image, the results are determined based on only the object with the most influence on the image characteristics. Some approaches to overcome the disadvantages of simple classification consider the locations of the objects in the image using convolutional neural networks [6, 16, 17]. In this study, the object detection was performed using the YOLO model, which has advantages in terms of both speed and accuracy [7, 18, 19]. The training of the YOLO model implemented on the deep learning server described in the previous section is explained in this section. The process of training the model is illustrated in Fig. 5.
A deep learning model requires substantial data to achieve good performance. To train a model which can recognize hand-written digits ranging from 0 to 9 with more than 90% accuracy, 60,000 training sets and 10,000 test sets were required [20]. The ImageNet dataset used in many studies consists of 14,197,122 images belonging to more than 20,000 classes [21, 22]. The amount of data can be increased using the two techniques of acquiring more data and transforming the acquired data to create new data.
In addition to the data obtained at the Seoul Grand Park, we developed an image crawling program to obtain sufficient data to train the deep learning models. Websites such as Google and Yahoo have sections that show a list of images associated with the keywords entered by the user. This section is accessible through several crawling libraries that enable users to acquire images. The number of images gathered from crawling and from zoos was 12,770; therefore, 2,554 images were used for each species. However, only 1,250 images per species were injected for training (see Section 5). We manually annotated the location of every parrot in the images.
The data can be manipulated by using many techniques to create new images, for example, by the rotation or parallel translation of the images or by changing the image chroma, color or brightness. The deliberate mixture of noise with the original data to produce new images has also been studied [23]. Girschick used a technique called image warping to change the size of the original image. He also used another technique in which only the area of the original image that contains objects is cut off [24]. Data augmentation has been applied in many studies because of constraints in obtaining sufficient data for deep learning [25, 26]. In this study, a large amount of data was obtained by applying data augmentation to the original data obtained from websites and zoos. We only applied horizontal and vertical flipping so that the number of training data images per species was 5,000, which is four times that of 1,250. The total number of training data images was 25,000, as shown in Fig. 6.
Images can also be acquired by building a model that creates non-existent images [27]. However, the process is very complex and beyond the scope of this study.
We used MobileNet as the model architecture based on the approach in [28]. MobileNet has the following features:
Channel Reduction: The operation is slowed down if the model size is too large so that an appropriate number of channels is maintained.
Depthwise separable convolution: This reduces the number of parameters in the filters and extracts significant features with independent filters.
The model was constructed based on the MobileNet architecture and the YOLO style, as shown at the bottom of Fig. 5. Predictions were made using three-scale feature maps. This is the concept of the pyramidal feature hierarchy reported in [29]. Objects of varying sizes can be detected because the pyramidal feature hierarchy outputs different scale feature maps for each layer. This allows the convolutional neural network to perform more accurate predictions on each feature map instead of predictions on a single feature map from the last layer, as shown in Fig. 7. The total number of layers in the model was 181, and the total number of model parameters was 7,728,666, of which 7,684,506 parameters were trainable and 44,160 parameters were not. The shapes of the three output layers were 13×13×30, 26×26×30, and 52×52×30. The predictions contain the coordinates of the bounding box and the class probabilities of the object. Because the entire network is too large to be shown, we show a part of the network in our system in Fig. 8.
One of the limitations to creating a well-trained deep-running model is that a vast amount of data is required. The solutions include creating new images by data augmentation and by data crawling, as mentioned in Section 4.A. An additional method that is frequently used is the refinement of convolution layers extracted from entire networks, such as ImageNet, which have been previously trained with large amounts of data from different domains, as shown in Fig. 9. This is known as transfer learning. Transfer learning has been employed in many studies to resolve problems caused by poor image quantity [30-32]. We also exploited the pretrained weights in the ImageNet dataset to perform object detection in images of endangered parrot species by initializing our model layers with the weights [21, 22]. Because the ImageNet dataset contains a variety of bird images, we expect that this would improve the efficiency of our system.
YOLO extracts the feature maps from the final three layers and predicts the area in which the object is included. Each feature map is divided into a N × N grid to identify areas where the objects are likely to be. Each grid contains a number of boxes called anchor boxes that represent the candidate areas. Five values are calculated at this time: the soft-max probabilities for the classes of an object in the bounding box, the objectness score indicating the probability with which the object might exist, and the center coordinates x, y, and the width and height w, h of the bounding box surrounding the object. The calculated values are compared with the ground truth values to obtain the total loss. The network model parameters are then updated through backpropagation using the loss.
Fig. 10 shows
The values of
(The authors of [19] did not provide an explicit formula for the loss function, so there are some differences between implementations.)
The
Approximately half of the 2,554 images for each species was used for the model evaluation. 1,304 images were used to test the model for each species. We did not apply the data augmentation described in the previous section for the test images because it was not meaningful to do so. The pixel values of all images were divided by 255 so that forward propagation and back propagation could be performed quickly. Because the maximum RGB value of each pixel is 255, the pixel values for all images have values between 0 and 1.
We evaluated our model using the custom data described in Section 4-A. The model evaluation was conducted using 2 GPUs; see Section 3-C. The evaluation metrics are presented in Section 5-C.
The numbers of each parrot species in the entire image data set are presented in Table 2. The most common species in the data is Ara Ararauna, of which there are 1657 parrots in the dataset. The least abundant species in the dataset is Cacatua Goffiniana, of which there are only 1406 parrots. The difference between the numbers of the two species is 251. Because there are different numbers of objects for each species, the evaluation must be performed considering the number of objects.
Scientific Name | Number of parrots in data |
---|---|
Ara Chloroptera | 1651 |
Ara Ararauna | 1657 |
Cacatua Galerita | 1456 |
Cacatua Goffiniana | 1406 |
Psittacus erithacus | 1564 |
The main measurements used to evaluate the object detection performance of the models are the mean average precision, the recall, and the frames per second [6, 19]. The average precision is the average class precision of each bounding box. The precision can be obtained by dividing the frequency of the actual positives by the sum of the frequencies of the actual positives and false positives. The true positive, false positive, false negative, and true negative are defined in Table 3. For example, if a predicted value is positive and the associated answer is also positive, then it is a true positive value.
Actual | |||
---|---|---|---|
Positive | Negative | ||
Prediction | Positive | True Positive | False Positive |
Negative | False Negative | True Negative |
(5) presents the expression used to obtain the average class precision of each bounding box. When the IOU threshold was set to 0.5, the mAP was approximately 86.79% (see Table 4). The recall rate is the actual positive frequency divided by the sum of the actual positive and false negative frequencies, as given in (6). The mean recall rate in Table 5 is 93.23%. The FPS is the number of images processed by the network in each second. 13 FPS was achieved in this study.
Scientific Name | Average Precision |
---|---|
Ara Chloroptera | 89% |
Ara Ararauna | 88% |
Cacatua Galerita | 87% |
Cacatua Goffiniana | 86% |
Psittacus erithacus | 84% |
Scientific Name | Recall Rate |
---|---|
Ara Chloroptera | 96% |
Ara Ararauna | 94% |
Cacatua Galerita | 93% |
Cacatua Goffiniana | 93% |
Psittacus erithacus | 91% |
Note that a direct comparison with other detection systems is impossible because of the differences in three aspects, namely, the backbone architecture used to extract the features (MobileNet vs. others), the method used to detect the objects (YOLO vs. others), and the training and inference data. We present the performance reported in each paper in Table 6 (details in [19], [28]). We expect that our system performance will improve with further optimization.
Architecture | Method | Data | mAP | Time |
---|---|---|---|---|
Darknet | YOLO | MS COCO | 51.5 | 22 |
MobileNet | SSDLite | MS COCO | 22.1 | 200 |
MobileNet | YOLO | Custom | 86 | 13 |
Fig. 12 shows some result images.
Fig. 13 shows the results of the response from the server. The name of the image is presented, and the object detection results are shown in parentheses. The detection results comprise the x and y coordinates of the top left corner, the x and y coordinates of the lower right corner, the predicted classes, and the probability value.
Recently, many species of animals have suffered from declining populations or become extinct owing to various factors. Identifying the habits, habitats, and populations of each species is essential for maintaining biodiversity. Unmanned cameras have been installed in previous studies to acquire video data. Recent advancements in deep learning technology and infrastructure have resulted in improved methods for data processing, but the previous studies faced some limitations because of the use of simple classifications. In this study, we employed object detection instead of simple classification to establish a system for receiving, analyzing, and storing image data. In particular, the system was tested with five endangered parrot species from CITIES. The model showed good performance. We expect that this system will increase the efficiency of research on endangered species. We further hope that this system will be used as a part of a larger system, such as a system for monitoring illegal poaching and smuggling.
The system in this study was built by applying only a single architecture and object detection method; however, we are confident that a better detection system can be achieved if we experiment with more architectures and object detection methods in our future research.
Species Information | Picture |
---|---|
Scientific Name | |
Ara Chloroptera | |
Appendix | |
II | |
Scientific Name | |
Ara Ararauna | |
Appendix | |
II | |
Scientific Name | |
Cacatua Galerita | |
Appendix | |
II | |
Scientific Name | |
Cacatua Goffiniana | |
Appendix | |
I | |
Scientific Name | |
Psittacus Erithacus | |
Appendix | |
I |
Scientific Name | Number of parrots in data |
---|---|
Ara Chloroptera | 1651 |
Ara Ararauna | 1657 |
Cacatua Galerita | 1456 |
Cacatua Goffiniana | 1406 |
Psittacus erithacus | 1564 |
Actual | |||
---|---|---|---|
Positive | Negative | ||
Prediction | Positive | True Positive | False Positive |
Negative | False Negative | True Negative |
Scientific Name | Average Precision |
---|---|
Ara Chloroptera | 89% |
Ara Ararauna | 88% |
Cacatua Galerita | 87% |
Cacatua Goffiniana | 86% |
Psittacus erithacus | 84% |
Scientific Name | Recall Rate |
---|---|
Ara Chloroptera | 96% |
Ara Ararauna | 94% |
Cacatua Galerita | 93% |
Cacatua Goffiniana | 93% |
Psittacus erithacus | 91% |
Architecture | Method | Data | mAP | Time |
---|---|---|---|---|
Darknet | YOLO | MS COCO | 51.5 | 22 |
MobileNet | SSDLite | MS COCO | 22.1 | 200 |
MobileNet | YOLO | Custom | 86 | 13 |