Search 닫기

Journal of information and communication convergence engineering 2020; 18(4): 267-277

Published online December 31, 2020

https://doi.org/10.6109/jicce.2020.18.4.267

© Korea Institute of Information and Communication Engineering

Deep Learning-based Image Data Processing and Archival System for Object Detection of Endangered Species

Dea-Gyu Choe, Dong-Keun Kim

Sangmyung University, Sangmyung University

Received: October 31, 2020; Accepted: December 22, 2020

It is important to understand the exact habitat distribution of endangered species because of their decreasing numbers. In this study, we build a system with a deep learning module that collects the image data of endangered animals, processes the data, and saves the data automatically. The system provides a more efficient way than human effort for classifying images and addresses two problems faced in previous studies. First, specious answers were suggested in those studies because the probability distributions of answer candidates were calculated even if the actual answer did not exist within the group. Second, when there were more than two entities in an image, only a single entity was focused on. We applied an object detection algorithm (YOLO) to resolve these problems. Our system has an average precision of 86.79%, a mean recall rate of 93.23%, and a processing speed of 13 frames per second.

Keywords System Design, System Development, Object Detection, Endangered Species

Recently, many animals on the planet have become extinct owing to factors such as climate change. Many studies on the maintenance of biodiversity have been conducted. Unmanned cameras have been installed and video recordings of living creatures obtained to collect the information needed for these studies, but huge amounts of human resources were needed to process the acquired data. The processing was slow and inefficient because it had to be performed using human perception and judgment. This issue has led to the emergence of research on machine systems for automatically processing and distinguishing animal images.

Nguyen et al. proposed a convolutional neural network system for classifying the three most commonly observed animal species in Victoria, Australia [1]. Zhuang et al. introduced a deep learning model that automatically categorizes and annotates marine biological image data without the need for manual processing by experts, and conducted experiments using data from SeaCLEF2017 [2]. In another study, Nguyen et al. considered two experimental scenarios for classifying images of wild animals using model architectures based on Lite AlexNet, VGG-16, and ResNet-50 [3]. In the first scenario, the model was trained from scratch, and in the second, a technique called fine-tuning was used. Weights pre-trained by ImageNet, which contains large-capacity images, were fed into the model to fit the model to the target data. Many pre-training techniques are used for monitoring and classifying large amounts of animal image data to provide a good view of the local image features in big data.

Two problems arise in wildlife image classification when simple classification is used. First, because the classification results indicate the probabilistic similarity of how close the object in the image is to various correct answer candidates, the model suggests the best wrong answer even when the correct answer is not in the candidate group. Second, the answer is given for only one object, even if multiple objects exist within the image. In this study, we propose an object-detection approach to address these problems. We establish a system for processing and storing image data of endangered species using YOLO, a well-known object detection algorithm.

The contributions of this study are as follows: First, we automated the acquisition, processing, and storage of image data for investigating the ecological preservation of endangered species. Second, by replacing the simple classification algorithm with an object detection algorithm in our new system, the system can overcome the limitations of current classification systems. Third, the developed system can be linked to other systems to establish larger systems for monitoring the illegal poaching and smuggling of endangered species.

In this study, we focused on five species of parrots designated as endangered species by the Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES). The illegal trade of endangered parrot species has become rampant internationally [4]. The five types of parrots are listed in Table 1. The remainder of this paper is organized as follows. The relevant literature are reviewed in Section 2 and the originality of the developed system is described. The components and the development of the system are detailed in Section 3. The construction of the object detection model is described in Section 4. The results of the experiments are presented in Section 5. Finally, the main conclusions of the study and future works are summarized in Section 6.

Five CITES endangered parrot species

Species InformationPicture
Scientific Name
Ara Chloroptera
Appendix
II
Scientific Name
Ara Ararauna
Appendix
II
Scientific Name
Cacatua Galerita
Appendix
II
Scientific Name
Cacatua Goffiniana
Appendix
I
Scientific Name
Psittacus Erithacus
Appendix
I

A. Image Data Processing System for Endangered Species

Unmanned cameras are typically used to obtain video recordings of wild animals. However, human labor is required for the determination of animal objects contained in the acquired images. This incurs time cost and is inevitably subjective because of human decisions. Researchers have attempted to resolve this problem by building systems that can process the data automatically. Nguyen et al. proposed a deep learning model that can classify the animal species observed in Victoria, Australia [1].

Zhuang et al. presented a model that can handle images of marine species without human input and tested its performance on the SeaCLEF2017 dataset [2]. Noruzadeh et al. noted that much of the video data from the Snapshot Serengeti project remained unprocessed because the data obtained from the cameras came from volunteer participants. They claimed that a deep learning model can identify the number of objects and even the behavior of each object in an image [3]. Byeong-hyeok and Sun-hyeon described the identification of five species mostly found in the Sobaeksan National Park using an object recognition model [5].

B. Object Detection Process

There are four main domains in computer vision, namely, simple classification of objects in an image, localization of objects in an image, object detection for the classification and location tracking of many objects in the image, and semantic segmentation for detecting objects in the image by classifying their pixels. Among these domains, object detection algorithms can be divided into two categories. The first category consists of two-step detection algorithms which perform detection tasks in regions that may contain objects. The other category consists of one-step detection algorithms which can simultaneously perform detection tasks over the entire image. Ren et al. presented a method that generates region proposals using a region proposal network and performs detection tasks in the proposed regions via the Fast R-CNN framework [6]. Redmon et al. reported a unified convolutional network for classifying and localizing objects in an image [7]. It is well-known amongst researchers that one-step algorithms are faster than two-step algorithms, but not as accurate. Nevertheless, because one-step algorithms have improved to date, we decided to employ YOLO for our system.

In this section, the components of our system are described through several diagrams.

A. System Design

The system developed in this study consists of four main parts, as shown in Fig. 1. In the first part, a camera interface is used to acquire images. In the second part, a server receives the images acquired from the cameras and submits requests for image analysis to a deep learning server. In the third part, the deep learning server analyzes the images. The fourth part is a database to store the analyzed images. The system flow is illustrated in Fig. 2. Images are acquired from the web cameras and sent to the server. The server requests image data analysis from the deep learning server through routing. The deep learning server analyzes the images and sends the results to the server. The database stores the results and image data. Finally, the server returns the results to the origin of the images.

Fig. 1.

System component block diagram.


Fig. 2.

System sequence diagram.


1) Camera Interface

A webcam is used to transmit images to the server in the camera interface instead of using physical storage devices on a batch basis. The Opencv computer vision library is used to retrieve video data from a webcam connected to a laptop [8]. Opencv is an open-source library that contains hundreds of computer vision algorithms. The library typically includes the following functionalities:

  • Image processing (imgproc): Modules that perform image transformations such as geometric image conversion or color changes.

  • Video analysis: Modules for video analysis such as motion measurements, background removal, and object tracking algorithms.

  • Video I/O (video): An interface that facilitates the use of features such as video capture and video codecs.

The Opencv Library can be used to create an object that contains the camera module information through a class called VideoCapture. Invoking the read method for this object will initiate video frame acquisition by the webcam. After obtaining the images, the server receives requests that meet the relevant routing rules and starts a related logic process.

2) Server

The server receives image data from the webcam and sends them to the deep learning server for analysis. The server then receives the results and stores them in the database. Although the deep learning modules can also be loaded on the server, the server and deep learning server are separated for the following reasons:

  • To facilitate maintenance through the separation of modules by function.

  • To build the system in a hierarchical structure to facilitate the addition of new functions in the future.

  • To utilize Tensorflow graph computation, which has speed advantages.

The routing rules are added by the Python decorator in the Flask Library [9]. The image data to be transmitted are first transformed into an array of the pixel values by calling the img_to-array method. Next, the Json-type data that consist of a key word and the pixel array values are sent to the deep learning server using the post method. The results are returned after the operations on the deep learning server are completed.

3) Deep Learning Server for Image Analysis

The TensorFlow Serving API includes a function that responds to requests generated by a model expressed as a Tensorflow graph [10]. The process for converting a model to a Tensorflow graph and starting the service is as follows: First, the object detection model is built and trained using the TensorFlow Library. The trained model is then converted to a TensorFlow graph using the library. The graph can start a service using the tensorflow_model_server instruction. The service serves as a director for deep learning operations. Once the service has been constructed, it performs tasks related to the client requests. All the tasks in the TensorFlow Serving API are performed separately, as shown in Fig. 3.

Fig. 3.

Tensorflow Serving component diagram


4) Database System

MySql is a widely used relational database [11]. As a relational database, MySql can store and manage data in a structured structure. To allow the images to be stored in the database and the desired data to be queried and retrieved according to conditions such as the date and class, we decided to store the images in byte form as blobs in the database table. The BytesIO class and base64 module are used to store image data in byte form. The images, detection results, and date information are grouped into a data frame and stored in the database.

B. System Configuration

The overall system was developed based on a module-view-controller (MVC) pattern. The MVC pattern is a system design pattern used in software engineering to separate the user interface from the process logic so that they can be easily maintained without affecting each other. Each component is described below (Fig. 4):

Fig. 4.

System MVC block diagram.


  • Controller: Changes the status of the model by sending commands to the model. The server and the deep learning server constitute the controller.

  • Model: The view or controller reads the status of the model and processes the associated logic. Here, the database constitutes the model.

  • View: The view generates the results for viewing by the user. The camera interface constitutes the view.

C. System Environment

The details of the system development environment are presented below:

  • CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz

  • GPU: GeForce RTX 2080 Ti (×2)

  • OS: Ubuntu 16.04 LTS.

All the components of the system were developed in Anaconda using the Python language. The code was written in a Jupyter Notebook.

Many researchers have analyzed image data using computer vision methods based on deep learning. Attempts have been made to extract features that are invariant to changes in the size and rotation of the image [12] and to extract characteristic vectors by considering the variance of the size and direction of the edges in the image [13]. LeCun et al. proposed a method for training a convolutional neural network through backpropagation [14]. Convolutional neural networks subsequently became mainstream in computer vision studies. In particular, He et al. surpassed human recognition capability with an error rate of only 3.57% in the ImageNet Large Scale Visual Recognition image dataset classification competition in 2015 [15]. However, simple classification still faces the problem of the proposal of wrong answers if there is no correct answer in the answer group. In addition, when there are multiple objects in an image, the results are determined based on only the object with the most influence on the image characteristics. Some approaches to overcome the disadvantages of simple classification consider the locations of the objects in the image using convolutional neural networks [6, 16, 17]. In this study, the object detection was performed using the YOLO model, which has advantages in terms of both speed and accuracy [7, 18, 19]. The training of the YOLO model implemented on the deep learning server described in the previous section is explained in this section. The process of training the model is illustrated in Fig. 5.

Fig. 5. Model training process diagram.

A. Data Augmentation

A deep learning model requires substantial data to achieve good performance. To train a model which can recognize hand-written digits ranging from 0 to 9 with more than 90% accuracy, 60,000 training sets and 10,000 test sets were required [20]. The ImageNet dataset used in many studies consists of 14,197,122 images belonging to more than 20,000 classes [21, 22]. The amount of data can be increased using the two techniques of acquiring more data and transforming the acquired data to create new data.

In addition to the data obtained at the Seoul Grand Park, we developed an image crawling program to obtain sufficient data to train the deep learning models. Websites such as Google and Yahoo have sections that show a list of images associated with the keywords entered by the user. This section is accessible through several crawling libraries that enable users to acquire images. The number of images gathered from crawling and from zoos was 12,770; therefore, 2,554 images were used for each species. However, only 1,250 images per species were injected for training (see Section 5). We manually annotated the location of every parrot in the images.

The data can be manipulated by using many techniques to create new images, for example, by the rotation or parallel translation of the images or by changing the image chroma, color or brightness. The deliberate mixture of noise with the original data to produce new images has also been studied [23]. Girschick used a technique called image warping to change the size of the original image. He also used another technique in which only the area of the original image that contains objects is cut off [24]. Data augmentation has been applied in many studies because of constraints in obtaining sufficient data for deep learning [25, 26]. In this study, a large amount of data was obtained by applying data augmentation to the original data obtained from websites and zoos. We only applied horizontal and vertical flipping so that the number of training data images per species was 5,000, which is four times that of 1,250. The total number of training data images was 25,000, as shown in Fig. 6.

Fig. 6.

Examples of data augmentation.


Images can also be acquired by building a model that creates non-existent images [27]. However, the process is very complex and beyond the scope of this study.

B. Model Implementation

1) Model Architecture

We used MobileNet as the model architecture based on the approach in [28]. MobileNet has the following features:

  • Channel Reduction: The operation is slowed down if the model size is too large so that an appropriate number of channels is maintained.

  • Depthwise separable convolution: This reduces the number of parameters in the filters and extracts significant features with independent filters.

The model was constructed based on the MobileNet architecture and the YOLO style, as shown at the bottom of Fig. 5. Predictions were made using three-scale feature maps. This is the concept of the pyramidal feature hierarchy reported in [29]. Objects of varying sizes can be detected because the pyramidal feature hierarchy outputs different scale feature maps for each layer. This allows the convolutional neural network to perform more accurate predictions on each feature map instead of predictions on a single feature map from the last layer, as shown in Fig. 7. The total number of layers in the model was 181, and the total number of model parameters was 7,728,666, of which 7,684,506 parameters were trainable and 44,160 parameters were not. The shapes of the three output layers were 13×13×30, 26×26×30, and 52×52×30. The predictions contain the coordinates of the bounding box and the class probabilities of the object. Because the entire network is too large to be shown, we show a part of the network in our system in Fig. 8.

Fig. 7.

Prediction on single feature map and pyramidal feature hierarchy.


Fig. 8.

A part of the object detection network.


2) Transfer Learning

One of the limitations to creating a well-trained deep-running model is that a vast amount of data is required. The solutions include creating new images by data augmentation and by data crawling, as mentioned in Section 4.A. An additional method that is frequently used is the refinement of convolution layers extracted from entire networks, such as ImageNet, which have been previously trained with large amounts of data from different domains, as shown in Fig. 9. This is known as transfer learning. Transfer learning has been employed in many studies to resolve problems caused by poor image quantity [30-32]. We also exploited the pretrained weights in the ImageNet dataset to perform object detection in images of endangered parrot species by initializing our model layers with the weights [21, 22]. Because the ImageNet dataset contains a variety of bird images, we expect that this would improve the efficiency of our system.

Fig. 9.

Transfer learning method [33].


C. Model Training

YOLO extracts the feature maps from the final three layers and predicts the area in which the object is included. Each feature map is divided into a N × N grid to identify areas where the objects are likely to be. Each grid contains a number of boxes called anchor boxes that represent the candidate areas. Five values are calculated at this time: the soft-max probabilities for the classes of an object in the bounding box, the objectness score indicating the probability with which the object might exist, and the center coordinates x, y, and the width and height w, h of the bounding box surrounding the object. The calculated values are compared with the ground truth values to obtain the total loss. The network model parameters are then updated through backpropagation using the loss.

Fig. 10 shows tx, ty, tw, and th, which indicate the variance of the center coordinates, width, and height, respectively. These variance values are predicted to obtain the center coordinates, width, and height of the bounding box for fitting the ground truth box. Next, the logistic or natural exponential function is applied to adjust the degree of variance [18]. However, because of improvements in the YOLO model, the loss function is changed to calculate the t values directly instead of seeking the losses of the bounding box bx, by, bw, and bh[19]. The refined formula is given in Equation (1). Note that the b values are initial prediction target values that are refined by adjusting the t values in the original YOLO model [18]. In the enhanced YOLO model, the method of producing prediction values was changed in [19] such that the t values are calculated directly.

Fig. 10.

Prediction of bounding box coordination [19].


tx=σ1bxcx

ty=σ1bycy

tw=lnbwpwth=lnbhph

The values of t (tx, ty, tw, th) in (1), (2), and (3) can be derived from the expressions on the right in Fig. 10. The new loss function can be inferred to be

λcoordi=0S2j=0BIijobjtxitxi^2+tyityi^2+λcoordi=0S2j=0BIijobjtxitxi^2+tyityi^2+i=0S2j=0BIijobjCiCi^2+λnoobji=0S2j=0BIijobjCiCi^2+i=0S2IiobjcCCpicpic^2.

(The authors of [19] did not provide an explicit formula for the loss function, so there are some differences between implementations.)

The λcoord and λnoobj terms in (4) are parameters for adjusting the weights of the loss for the coordinate values and the loss for the objectness score in the absence of an object. Iijobj checks only the loss for the jth bounding box of cell i. That is, if the j-th bounding box in the i-th cell contains the central coordinates of the object and it has the highest overlap with the ground truth box, its Iijobj value should be set to 1, and 0 otherwise. Ci indicates the objectness score, and whether each cell includes the center coordinates of an object regardless of its class. Pi(c) indicates the probability that an object belongs to each class, as shown in Fig. 11. The loss for YOLO consists of the loss values for the coordinates, width, and height of the bounding boxes, the loss values for the class probabilities, and the objectness scores.

Fig. 11.

Definition of IoU.


A. Experiment Data

Approximately half of the 2,554 images for each species was used for the model evaluation. 1,304 images were used to test the model for each species. We did not apply the data augmentation described in the previous section for the test images because it was not meaningful to do so. The pixel values of all images were divided by 255 so that forward propagation and back propagation could be performed quickly. Because the maximum RGB value of each pixel is 255, the pixel values for all images have values between 0 and 1.

B. Experiment Setup

We evaluated our model using the custom data described in Section 4-A. The model evaluation was conducted using 2 GPUs; see Section 3-C. The evaluation metrics are presented in Section 5-C.

C. Experiment Result

The numbers of each parrot species in the entire image data set are presented in Table 2. The most common species in the data is Ara Ararauna, of which there are 1657 parrots in the dataset. The least abundant species in the dataset is Cacatua Goffiniana, of which there are only 1406 parrots. The difference between the numbers of the two species is 251. Because there are different numbers of objects for each species, the evaluation must be performed considering the number of objects.

The numbers of each parrot species in the datasets.

Scientific NameNumber of parrots in data
Ara Chloroptera1651
Ara Ararauna1657
Cacatua Galerita1456
Cacatua Goffiniana1406
Psittacus erithacus1564

The main measurements used to evaluate the object detection performance of the models are the mean average precision, the recall, and the frames per second [6, 19]. The average precision is the average class precision of each bounding box. The precision can be obtained by dividing the frequency of the actual positives by the sum of the frequencies of the actual positives and false positives. The true positive, false positive, false negative, and true negative are defined in Table 3. For example, if a predicted value is positive and the associated answer is also positive, then it is a true positive value.

Confusion Matrix.

Actual
PositiveNegative
PredictionPositiveTrue PositiveFalse Positive
NegativeFalse NegativeTrue Negative

Precison=True PositiveTrue Positive+False Psitive

Recall=True PositiveTrue Positive+False Negative

(5) presents the expression used to obtain the average class precision of each bounding box. When the IOU threshold was set to 0.5, the mAP was approximately 86.79% (see Table 4). The recall rate is the actual positive frequency divided by the sum of the actual positive and false negative frequencies, as given in (6). The mean recall rate in Table 5 is 93.23%. The FPS is the number of images processed by the network in each second. 13 FPS was achieved in this study.

Results of average precision for each class.

Scientific NameAverage Precision
Ara Chloroptera89%
Ara Ararauna88%
Cacatua Galerita87%
Cacatua Goffiniana86%
Psittacus erithacus84%

Results of recall rate for each class.

Scientific NameRecall Rate
Ara Chloroptera96%
Ara Ararauna94%
Cacatua Galerita93%
Cacatua Goffiniana93%
Psittacus erithacus91%

Note that a direct comparison with other detection systems is impossible because of the differences in three aspects, namely, the backbone architecture used to extract the features (MobileNet vs. others), the method used to detect the objects (YOLO vs. others), and the training and inference data. We present the performance reported in each paper in Table 6 (details in [19], [28]). We expect that our system performance will improve with further optimization.

Comparison with other detection systems. We attribute the higher mAP to the number of classes in each dataset. (Our custom data has only 5 classes, while MS COCO has 80 classes.)

ArchitectureMethodDatamAPTime
DarknetYOLOMS COCO51.522
MobileNetSSDLiteMS COCO22.1200
MobileNetYOLOCustom8613

Fig. 12 shows some result images.

Fig. 12.

Example images of detection results.


C. System Implementation and UI

Fig. 13 shows the results of the response from the server. The name of the image is presented, and the object detection results are shown in parentheses. The detection results comprise the x and y coordinates of the top left corner, the x and y coordinates of the lower right corner, the predicted classes, and the probability value.

Fig. 13.

Response result from server.


Recently, many species of animals have suffered from declining populations or become extinct owing to various factors. Identifying the habits, habitats, and populations of each species is essential for maintaining biodiversity. Unmanned cameras have been installed in previous studies to acquire video data. Recent advancements in deep learning technology and infrastructure have resulted in improved methods for data processing, but the previous studies faced some limitations because of the use of simple classifications. In this study, we employed object detection instead of simple classification to establish a system for receiving, analyzing, and storing image data. In particular, the system was tested with five endangered parrot species from CITIES. The model showed good performance. We expect that this system will increase the efficiency of research on endangered species. We further hope that this system will be used as a part of a larger system, such as a system for monitoring illegal poaching and smuggling.

The system in this study was built by applying only a single architecture and object detection method; however, we are confident that a better detection system can be achieved if we experiment with more architectures and object detection methods in our future research.

This work was supported by the Korea Environmental Industry and Technology Institute (KEITI) under the Ministry of Environment, Korea. Main project code number: 1485016979, detailed project code number: ARQ201805030003
  1. H. Nguyen, S. J. Maclagan, T. D. Nguyen, T. Nguyen, P. Flemons, K. Andrews, E. G. Ritchie, and D. Phung, “Animal Recognition and Identification with Deep Convolutional Neural Networks for Automated Wildlife Monitoring,” in Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, pp. 40-49, 2017. DOI: 10.1109/DSAA.2017.31.
    CrossRef
  2. P. Zhuang, L. Xing, Y. Liu, S. Guo, and Y. Qiao, “Marine Animal Detection and Recognition with Advanced Deep Learning Models,” 2017 CLEF Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, vol. 1866, 2017.
  3. M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson, M. S. Palmer, C. Packer, and J. Clune, “Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning,” Proceedings of theNational Academy of Sciences, vol. 115, no. 25, pp. E5716–E5725, 2018. DOI: 10.1073/pnas.1719367115.
    CrossRef
  4. S. F. Pires, “The illegal parrot trade: a literature review,” Global Crime, vol. 13, no. 3, pp. 176-190, 2012. DOI: 10.1080/17440572.2012. 700180.
    CrossRef
  5. S. H. Kim and B. H. Yu, “Automatic Identification of Wild Animals using Deep Learning,” Proceedings of the Korean Society of Environment and Ecology Conference Korean Society of Environment and Ecology Annual, vol. 2018, no. 1, pp. 34-35, 2018.
  6. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards realtime object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137-1149, 2016. DOI: 10.1109/TPAMI.2016.2577031.
    CrossRef
  7. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 779-788, 2016. DOI: 10.1109/CVPR.2016.91.
    CrossRef
  8. Opencv, Opencv Tutorials [Internet], Available: https://docs.opencv.org/master/d9/df8/tutorial_root.html.
  9. Flask, Flask Documentations [Internet], Available: https://flask.palletsprojects.com/en/1.1.x/.
  10. ORACLE, MySQL [Internet], Available: https://www.mysql.com/.
  11. D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, pp. 91-110, 2004. DOI: 10.1023/B:VISI.0000029664.99615.94.
    CrossRef
  12. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego: CA, pp. 886-893, 2005. DOI: 10.1109/CVPR.2005.177.
    CrossRef
  13. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” in Neural Computation, vol. 1, no. 4, pp. 541-551, Dec. 1989. DOI: 10.1162/neco.1989.1.4.541.
    CrossRef
  14. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas: NV, pp. 770-778, 2016. DOI: 10.1109/CVPR.2016.90.
    CrossRef
  15. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” ECCV 2016: Computer Vision, pp. 21-37, 2016. DOI: 10.1007/978-3-319-46448-0_2.
    CrossRef
  16. K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916, 2014. DOI: 10.1109/TPAMI.2015.2389824.
    CrossRef
  17. J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu: HI, pp. 6517-6525, 2017. DOI: 10.1109/CVPR. 2017.690.
    CrossRef
  18. J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” 2018, [Online] Available: https://arxiv.org/abs/1804.02767.
  19. F. Chen, N. Chen, H. Mao, and H. Hu, “Assessing four Neural Networks on Handwritten Digit Recognition Dataset (MNIST),” 2018, [Online] Available: https://arxiv.org/abs/1811.08278.
  20. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Neural Information Processing Systems. vol. 60, no. 6, 2017 DOI: 10.1145/3065386.
    CrossRef
  21. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami: FL, pp. 248-255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    CrossRef
  22. N. Sun, X. Mo, T. Wei, D. Zhang, and W. Luo, “The Effectiveness of Noise in Data Augmentation for Fine-Grained Image Classification,” in ACPR 2019: Pattern Recognition, Cham, Switzerland : Springer, pp. 779-792, 2020. DOI: 10.1007/978-3-030-41404-7_55.
    CrossRef
  23. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus: OH, pp. 580-587, 2014. DOI: 10.1109/CVPR.2014.81.
    CrossRef
  24. J. K. Min and J. S. Moon, “A Substitute Model Learning Method Using Data Augmentation with a Decay Factor and Adversarial Data Generation Using Substitute Model,” Journal of The Korea Institute of Information Security & Cryptology, vol. 29, no. 6, pp. 1383–1392, 2019. DOI: 10.13089/JKIISC.2019.29.6.1383.
    CrossRef
  25. H. J. Shin, S. I. Lee, H. W. Jeoung, and J. W. Park, “Indoor Plants Image Classification Using Deep Learning and Web Application for Providing Information of Plants,” Journal of Knowledge Information Technology and Systems (JKITS), vol. 15, no. 2, pp. 167-175, 2020. DOI: 10.34163/jkits.2020.15.2.002.
    CrossRef
  26. S. W. Jung, I. S. Kim, I. K. Kim, and K. S. Lim, “A Study on Data Imbalance Problem Using GAN (Generative Adversarial Network),” Proceedings of Symposium of the Korean Institute of Communications and Information Sciences, pp. 1390-1391, Jan. 2019
  27. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017 [Online] Available: https://arxiv.org/abs/1704.04861.
  28. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu: HI, pp. 936-944, 2017. DOI: 10.1109/CVPR.2017.106.
    CrossRef
  29. Z. Huang, Z. Pan, and B. Lei, “Transfer learning with deep convolutional neural network for SAR target classification with limited labeled data,” Remote Sensing, vol. 9, no. 9, pp. 907, 2017. DOI: 10.3390/rs9090907.
    CrossRef
  30. H. S. Lee, J. G. Kim, J. W. Yu, Y. S. Jeong, and S. S. Kim, “Multiclass Classification using Transfer Learning based Convolutional Neural Network,” Journal of Korean Institute of Intelligent Systems, vol. 28, no. 6, pp. 531-537, 2018. DOI: 10.5391/JKIIS.2018.28.6.531.
    CrossRef
  31. S. J. Park, J. H. Yoon, and C. B. Ahn, “Compressed-Sensing Cardiac CINE MRI using Neural Network with Transfer Learning,” Journal of IKEEE, vol. 23, no. 4, pp. 293-299, 2019. DOI: 10.7471/ikeee.2019.23.4.1408.
    CrossRef
  32. D. G. Cheo, E. J. Choi, and D. K. Kim, “The Real-Time Mobile Application for Classifying of Endangered Parrot Species Using the CNN Models Based on Transfer Learning,” Mobile Information. Systems, vol. 2020, pp. 13, 2020. DOI: 10.1155/2020/1475164.
    CrossRef

Dea-Gyu Choe

Received his B.S. degree from the Department of Computer Science, Sangmyung University, Seoul, Korea in 2019. He is currently pursuing his M.S. degree from the Department of Computer Science, Sangmyung University, Seoul, Korea. His current research interests include mobile system development and computer vision.


Dong-Keun Kim

Received his Ph.D. degree from the Department of Biomedical Engineering, Yonsei University, Seoul, Korea in 2008. He is currently serving as associate professor at the Department of Intelligent Engineering Information for human and director of the Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, Korea. His research interests include biomedical engineering and human computer interaction


Article

Journal of information and communication convergence engineering 2020; 18(4): 267-277

Published online December 31, 2020 https://doi.org/10.6109/jicce.2020.18.4.267

Copyright © Korea Institute of Information and Communication Engineering.

Deep Learning-based Image Data Processing and Archival System for Object Detection of Endangered Species

Dea-Gyu Choe, Dong-Keun Kim

Sangmyung University, Sangmyung University

Received: October 31, 2020; Accepted: December 22, 2020

Abstract

It is important to understand the exact habitat distribution of endangered species because of their decreasing numbers. In this study, we build a system with a deep learning module that collects the image data of endangered animals, processes the data, and saves the data automatically. The system provides a more efficient way than human effort for classifying images and addresses two problems faced in previous studies. First, specious answers were suggested in those studies because the probability distributions of answer candidates were calculated even if the actual answer did not exist within the group. Second, when there were more than two entities in an image, only a single entity was focused on. We applied an object detection algorithm (YOLO) to resolve these problems. Our system has an average precision of 86.79%, a mean recall rate of 93.23%, and a processing speed of 13 frames per second.

Keywords: System Design, System Development, Object Detection, Endangered Species

I. INTRODUCTION

Recently, many animals on the planet have become extinct owing to factors such as climate change. Many studies on the maintenance of biodiversity have been conducted. Unmanned cameras have been installed and video recordings of living creatures obtained to collect the information needed for these studies, but huge amounts of human resources were needed to process the acquired data. The processing was slow and inefficient because it had to be performed using human perception and judgment. This issue has led to the emergence of research on machine systems for automatically processing and distinguishing animal images.

Nguyen et al. proposed a convolutional neural network system for classifying the three most commonly observed animal species in Victoria, Australia [1]. Zhuang et al. introduced a deep learning model that automatically categorizes and annotates marine biological image data without the need for manual processing by experts, and conducted experiments using data from SeaCLEF2017 [2]. In another study, Nguyen et al. considered two experimental scenarios for classifying images of wild animals using model architectures based on Lite AlexNet, VGG-16, and ResNet-50 [3]. In the first scenario, the model was trained from scratch, and in the second, a technique called fine-tuning was used. Weights pre-trained by ImageNet, which contains large-capacity images, were fed into the model to fit the model to the target data. Many pre-training techniques are used for monitoring and classifying large amounts of animal image data to provide a good view of the local image features in big data.

Two problems arise in wildlife image classification when simple classification is used. First, because the classification results indicate the probabilistic similarity of how close the object in the image is to various correct answer candidates, the model suggests the best wrong answer even when the correct answer is not in the candidate group. Second, the answer is given for only one object, even if multiple objects exist within the image. In this study, we propose an object-detection approach to address these problems. We establish a system for processing and storing image data of endangered species using YOLO, a well-known object detection algorithm.

The contributions of this study are as follows: First, we automated the acquisition, processing, and storage of image data for investigating the ecological preservation of endangered species. Second, by replacing the simple classification algorithm with an object detection algorithm in our new system, the system can overcome the limitations of current classification systems. Third, the developed system can be linked to other systems to establish larger systems for monitoring the illegal poaching and smuggling of endangered species.

In this study, we focused on five species of parrots designated as endangered species by the Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES). The illegal trade of endangered parrot species has become rampant internationally [4]. The five types of parrots are listed in Table 1. The remainder of this paper is organized as follows. The relevant literature are reviewed in Section 2 and the originality of the developed system is described. The components and the development of the system are detailed in Section 3. The construction of the object detection model is described in Section 4. The results of the experiments are presented in Section 5. Finally, the main conclusions of the study and future works are summarized in Section 6.

Five CITES endangered parrot species

Species InformationPicture
Scientific Name
Ara Chloroptera
Appendix
II
Scientific Name
Ara Ararauna
Appendix
II
Scientific Name
Cacatua Galerita
Appendix
II
Scientific Name
Cacatua Goffiniana
Appendix
I
Scientific Name
Psittacus Erithacus
Appendix
I

II. RELATED WORKS

A. Image Data Processing System for Endangered Species

Unmanned cameras are typically used to obtain video recordings of wild animals. However, human labor is required for the determination of animal objects contained in the acquired images. This incurs time cost and is inevitably subjective because of human decisions. Researchers have attempted to resolve this problem by building systems that can process the data automatically. Nguyen et al. proposed a deep learning model that can classify the animal species observed in Victoria, Australia [1].

Zhuang et al. presented a model that can handle images of marine species without human input and tested its performance on the SeaCLEF2017 dataset [2]. Noruzadeh et al. noted that much of the video data from the Snapshot Serengeti project remained unprocessed because the data obtained from the cameras came from volunteer participants. They claimed that a deep learning model can identify the number of objects and even the behavior of each object in an image [3]. Byeong-hyeok and Sun-hyeon described the identification of five species mostly found in the Sobaeksan National Park using an object recognition model [5].

B. Object Detection Process

There are four main domains in computer vision, namely, simple classification of objects in an image, localization of objects in an image, object detection for the classification and location tracking of many objects in the image, and semantic segmentation for detecting objects in the image by classifying their pixels. Among these domains, object detection algorithms can be divided into two categories. The first category consists of two-step detection algorithms which perform detection tasks in regions that may contain objects. The other category consists of one-step detection algorithms which can simultaneously perform detection tasks over the entire image. Ren et al. presented a method that generates region proposals using a region proposal network and performs detection tasks in the proposed regions via the Fast R-CNN framework [6]. Redmon et al. reported a unified convolutional network for classifying and localizing objects in an image [7]. It is well-known amongst researchers that one-step algorithms are faster than two-step algorithms, but not as accurate. Nevertheless, because one-step algorithms have improved to date, we decided to employ YOLO for our system.

III. SYSTEM DESIGN AND CONFIGURATION

In this section, the components of our system are described through several diagrams.

A. System Design

The system developed in this study consists of four main parts, as shown in Fig. 1. In the first part, a camera interface is used to acquire images. In the second part, a server receives the images acquired from the cameras and submits requests for image analysis to a deep learning server. In the third part, the deep learning server analyzes the images. The fourth part is a database to store the analyzed images. The system flow is illustrated in Fig. 2. Images are acquired from the web cameras and sent to the server. The server requests image data analysis from the deep learning server through routing. The deep learning server analyzes the images and sends the results to the server. The database stores the results and image data. Finally, the server returns the results to the origin of the images.

Figure 1.

System component block diagram.


Figure 2.

System sequence diagram.


1) Camera Interface

A webcam is used to transmit images to the server in the camera interface instead of using physical storage devices on a batch basis. The Opencv computer vision library is used to retrieve video data from a webcam connected to a laptop [8]. Opencv is an open-source library that contains hundreds of computer vision algorithms. The library typically includes the following functionalities:

  • Image processing (imgproc): Modules that perform image transformations such as geometric image conversion or color changes.

  • Video analysis: Modules for video analysis such as motion measurements, background removal, and object tracking algorithms.

  • Video I/O (video): An interface that facilitates the use of features such as video capture and video codecs.

The Opencv Library can be used to create an object that contains the camera module information through a class called VideoCapture. Invoking the read method for this object will initiate video frame acquisition by the webcam. After obtaining the images, the server receives requests that meet the relevant routing rules and starts a related logic process.

2) Server

The server receives image data from the webcam and sends them to the deep learning server for analysis. The server then receives the results and stores them in the database. Although the deep learning modules can also be loaded on the server, the server and deep learning server are separated for the following reasons:

  • To facilitate maintenance through the separation of modules by function.

  • To build the system in a hierarchical structure to facilitate the addition of new functions in the future.

  • To utilize Tensorflow graph computation, which has speed advantages.

The routing rules are added by the Python decorator in the Flask Library [9]. The image data to be transmitted are first transformed into an array of the pixel values by calling the img_to-array method. Next, the Json-type data that consist of a key word and the pixel array values are sent to the deep learning server using the post method. The results are returned after the operations on the deep learning server are completed.

3) Deep Learning Server for Image Analysis

The TensorFlow Serving API includes a function that responds to requests generated by a model expressed as a Tensorflow graph [10]. The process for converting a model to a Tensorflow graph and starting the service is as follows: First, the object detection model is built and trained using the TensorFlow Library. The trained model is then converted to a TensorFlow graph using the library. The graph can start a service using the tensorflow_model_server instruction. The service serves as a director for deep learning operations. Once the service has been constructed, it performs tasks related to the client requests. All the tasks in the TensorFlow Serving API are performed separately, as shown in Fig. 3.

Figure 3.

Tensorflow Serving component diagram


4) Database System

MySql is a widely used relational database [11]. As a relational database, MySql can store and manage data in a structured structure. To allow the images to be stored in the database and the desired data to be queried and retrieved according to conditions such as the date and class, we decided to store the images in byte form as blobs in the database table. The BytesIO class and base64 module are used to store image data in byte form. The images, detection results, and date information are grouped into a data frame and stored in the database.

B. System Configuration

The overall system was developed based on a module-view-controller (MVC) pattern. The MVC pattern is a system design pattern used in software engineering to separate the user interface from the process logic so that they can be easily maintained without affecting each other. Each component is described below (Fig. 4):

Figure 4.

System MVC block diagram.


  • Controller: Changes the status of the model by sending commands to the model. The server and the deep learning server constitute the controller.

  • Model: The view or controller reads the status of the model and processes the associated logic. Here, the database constitutes the model.

  • View: The view generates the results for viewing by the user. The camera interface constitutes the view.

C. System Environment

The details of the system development environment are presented below:

  • CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz

  • GPU: GeForce RTX 2080 Ti (×2)

  • OS: Ubuntu 16.04 LTS.

All the components of the system were developed in Anaconda using the Python language. The code was written in a Jupyter Notebook.

IV. OBJECT DETECTION MODEL DESIGN

Many researchers have analyzed image data using computer vision methods based on deep learning. Attempts have been made to extract features that are invariant to changes in the size and rotation of the image [12] and to extract characteristic vectors by considering the variance of the size and direction of the edges in the image [13]. LeCun et al. proposed a method for training a convolutional neural network through backpropagation [14]. Convolutional neural networks subsequently became mainstream in computer vision studies. In particular, He et al. surpassed human recognition capability with an error rate of only 3.57% in the ImageNet Large Scale Visual Recognition image dataset classification competition in 2015 [15]. However, simple classification still faces the problem of the proposal of wrong answers if there is no correct answer in the answer group. In addition, when there are multiple objects in an image, the results are determined based on only the object with the most influence on the image characteristics. Some approaches to overcome the disadvantages of simple classification consider the locations of the objects in the image using convolutional neural networks [6, 16, 17]. In this study, the object detection was performed using the YOLO model, which has advantages in terms of both speed and accuracy [7, 18, 19]. The training of the YOLO model implemented on the deep learning server described in the previous section is explained in this section. The process of training the model is illustrated in Fig. 5.

Figure 5. Model training process diagram.

A. Data Augmentation

A deep learning model requires substantial data to achieve good performance. To train a model which can recognize hand-written digits ranging from 0 to 9 with more than 90% accuracy, 60,000 training sets and 10,000 test sets were required [20]. The ImageNet dataset used in many studies consists of 14,197,122 images belonging to more than 20,000 classes [21, 22]. The amount of data can be increased using the two techniques of acquiring more data and transforming the acquired data to create new data.

In addition to the data obtained at the Seoul Grand Park, we developed an image crawling program to obtain sufficient data to train the deep learning models. Websites such as Google and Yahoo have sections that show a list of images associated with the keywords entered by the user. This section is accessible through several crawling libraries that enable users to acquire images. The number of images gathered from crawling and from zoos was 12,770; therefore, 2,554 images were used for each species. However, only 1,250 images per species were injected for training (see Section 5). We manually annotated the location of every parrot in the images.

The data can be manipulated by using many techniques to create new images, for example, by the rotation or parallel translation of the images or by changing the image chroma, color or brightness. The deliberate mixture of noise with the original data to produce new images has also been studied [23]. Girschick used a technique called image warping to change the size of the original image. He also used another technique in which only the area of the original image that contains objects is cut off [24]. Data augmentation has been applied in many studies because of constraints in obtaining sufficient data for deep learning [25, 26]. In this study, a large amount of data was obtained by applying data augmentation to the original data obtained from websites and zoos. We only applied horizontal and vertical flipping so that the number of training data images per species was 5,000, which is four times that of 1,250. The total number of training data images was 25,000, as shown in Fig. 6.

Figure 6.

Examples of data augmentation.


Images can also be acquired by building a model that creates non-existent images [27]. However, the process is very complex and beyond the scope of this study.

B. Model Implementation

1) Model Architecture

We used MobileNet as the model architecture based on the approach in [28]. MobileNet has the following features:

  • Channel Reduction: The operation is slowed down if the model size is too large so that an appropriate number of channels is maintained.

  • Depthwise separable convolution: This reduces the number of parameters in the filters and extracts significant features with independent filters.

The model was constructed based on the MobileNet architecture and the YOLO style, as shown at the bottom of Fig. 5. Predictions were made using three-scale feature maps. This is the concept of the pyramidal feature hierarchy reported in [29]. Objects of varying sizes can be detected because the pyramidal feature hierarchy outputs different scale feature maps for each layer. This allows the convolutional neural network to perform more accurate predictions on each feature map instead of predictions on a single feature map from the last layer, as shown in Fig. 7. The total number of layers in the model was 181, and the total number of model parameters was 7,728,666, of which 7,684,506 parameters were trainable and 44,160 parameters were not. The shapes of the three output layers were 13×13×30, 26×26×30, and 52×52×30. The predictions contain the coordinates of the bounding box and the class probabilities of the object. Because the entire network is too large to be shown, we show a part of the network in our system in Fig. 8.

Figure 7.

Prediction on single feature map and pyramidal feature hierarchy.


Figure 8.

A part of the object detection network.


2) Transfer Learning

One of the limitations to creating a well-trained deep-running model is that a vast amount of data is required. The solutions include creating new images by data augmentation and by data crawling, as mentioned in Section 4.A. An additional method that is frequently used is the refinement of convolution layers extracted from entire networks, such as ImageNet, which have been previously trained with large amounts of data from different domains, as shown in Fig. 9. This is known as transfer learning. Transfer learning has been employed in many studies to resolve problems caused by poor image quantity [30-32]. We also exploited the pretrained weights in the ImageNet dataset to perform object detection in images of endangered parrot species by initializing our model layers with the weights [21, 22]. Because the ImageNet dataset contains a variety of bird images, we expect that this would improve the efficiency of our system.

Figure 9.

Transfer learning method [33].


C. Model Training

YOLO extracts the feature maps from the final three layers and predicts the area in which the object is included. Each feature map is divided into a N × N grid to identify areas where the objects are likely to be. Each grid contains a number of boxes called anchor boxes that represent the candidate areas. Five values are calculated at this time: the soft-max probabilities for the classes of an object in the bounding box, the objectness score indicating the probability with which the object might exist, and the center coordinates x, y, and the width and height w, h of the bounding box surrounding the object. The calculated values are compared with the ground truth values to obtain the total loss. The network model parameters are then updated through backpropagation using the loss.

Fig. 10 shows tx, ty, tw, and th, which indicate the variance of the center coordinates, width, and height, respectively. These variance values are predicted to obtain the center coordinates, width, and height of the bounding box for fitting the ground truth box. Next, the logistic or natural exponential function is applied to adjust the degree of variance [18]. However, because of improvements in the YOLO model, the loss function is changed to calculate the t values directly instead of seeking the losses of the bounding box bx, by, bw, and bh[19]. The refined formula is given in Equation (1). Note that the b values are initial prediction target values that are refined by adjusting the t values in the original YOLO model [18]. In the enhanced YOLO model, the method of producing prediction values was changed in [19] such that the t values are calculated directly.

Figure 10.

Prediction of bounding box coordination [19].


tx=σ1bxcx

ty=σ1bycy

tw=lnbwpwth=lnbhph

The values of t (tx, ty, tw, th) in (1), (2), and (3) can be derived from the expressions on the right in Fig. 10. The new loss function can be inferred to be

λcoordi=0S2j=0BIijobjtxitxi^2+tyityi^2+λcoordi=0S2j=0BIijobjtxitxi^2+tyityi^2+i=0S2j=0BIijobjCiCi^2+λnoobji=0S2j=0BIijobjCiCi^2+i=0S2IiobjcCCpicpic^2.

(The authors of [19] did not provide an explicit formula for the loss function, so there are some differences between implementations.)

The λcoord and λnoobj terms in (4) are parameters for adjusting the weights of the loss for the coordinate values and the loss for the objectness score in the absence of an object. Iijobj checks only the loss for the jth bounding box of cell i. That is, if the j-th bounding box in the i-th cell contains the central coordinates of the object and it has the highest overlap with the ground truth box, its Iijobj value should be set to 1, and 0 otherwise. Ci indicates the objectness score, and whether each cell includes the center coordinates of an object regardless of its class. Pi(c) indicates the probability that an object belongs to each class, as shown in Fig. 11. The loss for YOLO consists of the loss values for the coordinates, width, and height of the bounding boxes, the loss values for the class probabilities, and the objectness scores.

Figure 11.

Definition of IoU.


V. EXPERIMENT AND RESULTS

A. Experiment Data

Approximately half of the 2,554 images for each species was used for the model evaluation. 1,304 images were used to test the model for each species. We did not apply the data augmentation described in the previous section for the test images because it was not meaningful to do so. The pixel values of all images were divided by 255 so that forward propagation and back propagation could be performed quickly. Because the maximum RGB value of each pixel is 255, the pixel values for all images have values between 0 and 1.

B. Experiment Setup

We evaluated our model using the custom data described in Section 4-A. The model evaluation was conducted using 2 GPUs; see Section 3-C. The evaluation metrics are presented in Section 5-C.

C. Experiment Result

The numbers of each parrot species in the entire image data set are presented in Table 2. The most common species in the data is Ara Ararauna, of which there are 1657 parrots in the dataset. The least abundant species in the dataset is Cacatua Goffiniana, of which there are only 1406 parrots. The difference between the numbers of the two species is 251. Because there are different numbers of objects for each species, the evaluation must be performed considering the number of objects.

The numbers of each parrot species in the datasets.

Scientific NameNumber of parrots in data
Ara Chloroptera1651
Ara Ararauna1657
Cacatua Galerita1456
Cacatua Goffiniana1406
Psittacus erithacus1564

The main measurements used to evaluate the object detection performance of the models are the mean average precision, the recall, and the frames per second [6, 19]. The average precision is the average class precision of each bounding box. The precision can be obtained by dividing the frequency of the actual positives by the sum of the frequencies of the actual positives and false positives. The true positive, false positive, false negative, and true negative are defined in Table 3. For example, if a predicted value is positive and the associated answer is also positive, then it is a true positive value.

Confusion Matrix.

Actual
PositiveNegative
PredictionPositiveTrue PositiveFalse Positive
NegativeFalse NegativeTrue Negative

Precison=True PositiveTrue Positive+False Psitive

Recall=True PositiveTrue Positive+False Negative

(5) presents the expression used to obtain the average class precision of each bounding box. When the IOU threshold was set to 0.5, the mAP was approximately 86.79% (see Table 4). The recall rate is the actual positive frequency divided by the sum of the actual positive and false negative frequencies, as given in (6). The mean recall rate in Table 5 is 93.23%. The FPS is the number of images processed by the network in each second. 13 FPS was achieved in this study.

Results of average precision for each class.

Scientific NameAverage Precision
Ara Chloroptera89%
Ara Ararauna88%
Cacatua Galerita87%
Cacatua Goffiniana86%
Psittacus erithacus84%

Results of recall rate for each class.

Scientific NameRecall Rate
Ara Chloroptera96%
Ara Ararauna94%
Cacatua Galerita93%
Cacatua Goffiniana93%
Psittacus erithacus91%

Note that a direct comparison with other detection systems is impossible because of the differences in three aspects, namely, the backbone architecture used to extract the features (MobileNet vs. others), the method used to detect the objects (YOLO vs. others), and the training and inference data. We present the performance reported in each paper in Table 6 (details in [19], [28]). We expect that our system performance will improve with further optimization.

Comparison with other detection systems. We attribute the higher mAP to the number of classes in each dataset. (Our custom data has only 5 classes, while MS COCO has 80 classes.)

ArchitectureMethodDatamAPTime
DarknetYOLOMS COCO51.522
MobileNetSSDLiteMS COCO22.1200
MobileNetYOLOCustom8613

Fig. 12 shows some result images.

Figure 12.

Example images of detection results.


C. System Implementation and UI

Fig. 13 shows the results of the response from the server. The name of the image is presented, and the object detection results are shown in parentheses. The detection results comprise the x and y coordinates of the top left corner, the x and y coordinates of the lower right corner, the predicted classes, and the probability value.

Figure 13.

Response result from server.


VI. DISCUSSION AND CONCLUSION

Recently, many species of animals have suffered from declining populations or become extinct owing to various factors. Identifying the habits, habitats, and populations of each species is essential for maintaining biodiversity. Unmanned cameras have been installed in previous studies to acquire video data. Recent advancements in deep learning technology and infrastructure have resulted in improved methods for data processing, but the previous studies faced some limitations because of the use of simple classifications. In this study, we employed object detection instead of simple classification to establish a system for receiving, analyzing, and storing image data. In particular, the system was tested with five endangered parrot species from CITIES. The model showed good performance. We expect that this system will increase the efficiency of research on endangered species. We further hope that this system will be used as a part of a larger system, such as a system for monitoring illegal poaching and smuggling.

The system in this study was built by applying only a single architecture and object detection method; however, we are confident that a better detection system can be achieved if we experiment with more architectures and object detection methods in our future research.

Fig 1.

Figure 1.

System component block diagram.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 2.

Figure 2.

System sequence diagram.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 3.

Figure 3.

Tensorflow Serving component diagram

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 4.

Figure 4.

System MVC block diagram.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 5.

Figure 5.Model training process diagram.
Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 6.

Figure 6.

Examples of data augmentation.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 7.

Figure 7.

Prediction on single feature map and pyramidal feature hierarchy.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 8.

Figure 8.

A part of the object detection network.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 9.

Figure 9.

Transfer learning method [33].

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 10.

Figure 10.

Prediction of bounding box coordination [19].

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 11.

Figure 11.

Definition of IoU.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 12.

Figure 12.

Example images of detection results.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Fig 13.

Figure 13.

Response result from server.

Journal of Information and Communication Convergence Engineering 2020; 18: 267-277https://doi.org/10.6109/jicce.2020.18.4.267

Five CITES endangered parrot species

Species InformationPicture
Scientific Name
Ara Chloroptera
Appendix
II
Scientific Name
Ara Ararauna
Appendix
II
Scientific Name
Cacatua Galerita
Appendix
II
Scientific Name
Cacatua Goffiniana
Appendix
I
Scientific Name
Psittacus Erithacus
Appendix
I

The numbers of each parrot species in the datasets.

Scientific NameNumber of parrots in data
Ara Chloroptera1651
Ara Ararauna1657
Cacatua Galerita1456
Cacatua Goffiniana1406
Psittacus erithacus1564

Confusion Matrix.

Actual
PositiveNegative
PredictionPositiveTrue PositiveFalse Positive
NegativeFalse NegativeTrue Negative

Results of average precision for each class.

Scientific NameAverage Precision
Ara Chloroptera89%
Ara Ararauna88%
Cacatua Galerita87%
Cacatua Goffiniana86%
Psittacus erithacus84%

Results of recall rate for each class.

Scientific NameRecall Rate
Ara Chloroptera96%
Ara Ararauna94%
Cacatua Galerita93%
Cacatua Goffiniana93%
Psittacus erithacus91%

Comparison with other detection systems. We attribute the higher mAP to the number of classes in each dataset. (Our custom data has only 5 classes, while MS COCO has 80 classes.)

ArchitectureMethodDatamAPTime
DarknetYOLOMS COCO51.522
MobileNetSSDLiteMS COCO22.1200
MobileNetYOLOCustom8613

References

  1. H. Nguyen, S. J. Maclagan, T. D. Nguyen, T. Nguyen, P. Flemons, K. Andrews, E. G. Ritchie, and D. Phung, “Animal Recognition and Identification with Deep Convolutional Neural Networks for Automated Wildlife Monitoring,” in Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, pp. 40-49, 2017. DOI: 10.1109/DSAA.2017.31.
    CrossRef
  2. P. Zhuang, L. Xing, Y. Liu, S. Guo, and Y. Qiao, “Marine Animal Detection and Recognition with Advanced Deep Learning Models,” 2017 CLEF Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, vol. 1866, 2017.
  3. M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson, M. S. Palmer, C. Packer, and J. Clune, “Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning,” Proceedings of theNational Academy of Sciences, vol. 115, no. 25, pp. E5716–E5725, 2018. DOI: 10.1073/pnas.1719367115.
    CrossRef
  4. S. F. Pires, “The illegal parrot trade: a literature review,” Global Crime, vol. 13, no. 3, pp. 176-190, 2012. DOI: 10.1080/17440572.2012. 700180.
    CrossRef
  5. S. H. Kim and B. H. Yu, “Automatic Identification of Wild Animals using Deep Learning,” Proceedings of the Korean Society of Environment and Ecology Conference Korean Society of Environment and Ecology Annual, vol. 2018, no. 1, pp. 34-35, 2018.
  6. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards realtime object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137-1149, 2016. DOI: 10.1109/TPAMI.2016.2577031.
    CrossRef
  7. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 779-788, 2016. DOI: 10.1109/CVPR.2016.91.
    CrossRef
  8. Opencv, Opencv Tutorials [Internet], Available: https://docs.opencv.org/master/d9/df8/tutorial_root.html.
  9. Flask, Flask Documentations [Internet], Available: https://flask.palletsprojects.com/en/1.1.x/.
  10. Tensorflow, Tensorflow Serving [Internet], Available: https://tensorflowkorea.gitbooks.io/tensorflow-kr/content/g3doc/tutorials/tfserve/.
  11. ORACLE, MySQL [Internet], Available: https://www.mysql.com/.
  12. D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, pp. 91-110, 2004. DOI: 10.1023/B:VISI.0000029664.99615.94.
    CrossRef
  13. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego: CA, pp. 886-893, 2005. DOI: 10.1109/CVPR.2005.177.
    CrossRef
  14. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” in Neural Computation, vol. 1, no. 4, pp. 541-551, Dec. 1989. DOI: 10.1162/neco.1989.1.4.541.
    CrossRef
  15. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas: NV, pp. 770-778, 2016. DOI: 10.1109/CVPR.2016.90.
    CrossRef
  16. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” ECCV 2016: Computer Vision, pp. 21-37, 2016. DOI: 10.1007/978-3-319-46448-0_2.
    CrossRef
  17. K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916, 2014. DOI: 10.1109/TPAMI.2015.2389824.
    CrossRef
  18. J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu: HI, pp. 6517-6525, 2017. DOI: 10.1109/CVPR. 2017.690.
    CrossRef
  19. J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” 2018, [Online] Available: https://arxiv.org/abs/1804.02767.
  20. F. Chen, N. Chen, H. Mao, and H. Hu, “Assessing four Neural Networks on Handwritten Digit Recognition Dataset (MNIST),” 2018, [Online] Available: https://arxiv.org/abs/1811.08278.
  21. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Neural Information Processing Systems. vol. 60, no. 6, 2017 DOI: 10.1145/3065386.
    CrossRef
  22. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami: FL, pp. 248-255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    CrossRef
  23. N. Sun, X. Mo, T. Wei, D. Zhang, and W. Luo, “The Effectiveness of Noise in Data Augmentation for Fine-Grained Image Classification,” in ACPR 2019: Pattern Recognition, Cham, Switzerland : Springer, pp. 779-792, 2020. DOI: 10.1007/978-3-030-41404-7_55.
    CrossRef
  24. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus: OH, pp. 580-587, 2014. DOI: 10.1109/CVPR.2014.81.
    CrossRef
  25. J. K. Min and J. S. Moon, “A Substitute Model Learning Method Using Data Augmentation with a Decay Factor and Adversarial Data Generation Using Substitute Model,” Journal of The Korea Institute of Information Security & Cryptology, vol. 29, no. 6, pp. 1383–1392, 2019. DOI: 10.13089/JKIISC.2019.29.6.1383.
    CrossRef
  26. H. J. Shin, S. I. Lee, H. W. Jeoung, and J. W. Park, “Indoor Plants Image Classification Using Deep Learning and Web Application for Providing Information of Plants,” Journal of Knowledge Information Technology and Systems (JKITS), vol. 15, no. 2, pp. 167-175, 2020. DOI: 10.34163/jkits.2020.15.2.002.
    CrossRef
  27. S. W. Jung, I. S. Kim, I. K. Kim, and K. S. Lim, “A Study on Data Imbalance Problem Using GAN (Generative Adversarial Network),” Proceedings of Symposium of the Korean Institute of Communications and Information Sciences, pp. 1390-1391, Jan. 2019
  28. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017 [Online] Available: https://arxiv.org/abs/1704.04861.
  29. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu: HI, pp. 936-944, 2017. DOI: 10.1109/CVPR.2017.106.
    CrossRef
  30. Z. Huang, Z. Pan, and B. Lei, “Transfer learning with deep convolutional neural network for SAR target classification with limited labeled data,” Remote Sensing, vol. 9, no. 9, pp. 907, 2017. DOI: 10.3390/rs9090907.
    CrossRef
  31. H. S. Lee, J. G. Kim, J. W. Yu, Y. S. Jeong, and S. S. Kim, “Multiclass Classification using Transfer Learning based Convolutional Neural Network,” Journal of Korean Institute of Intelligent Systems, vol. 28, no. 6, pp. 531-537, 2018. DOI: 10.5391/JKIIS.2018.28.6.531.
    CrossRef
  32. S. J. Park, J. H. Yoon, and C. B. Ahn, “Compressed-Sensing Cardiac CINE MRI using Neural Network with Transfer Learning,” Journal of IKEEE, vol. 23, no. 4, pp. 293-299, 2019. DOI: 10.7471/ikeee.2019.23.4.1408.
    CrossRef
  33. D. G. Cheo, E. J. Choi, and D. K. Kim, “The Real-Time Mobile Application for Classifying of Endangered Parrot Species Using the CNN Models Based on Transfer Learning,” Mobile Information. Systems, vol. 2020, pp. 13, 2020. DOI: 10.1155/2020/1475164.
    CrossRef
JICCE
Sep 30, 2024 Vol.22 No.3, pp. 173~266

Stats or Metrics

Share this article on

  • line

Journal of Information and Communication Convergence Engineering Jouranl of information and
communication convergence engineering
(J. Inf. Commun. Converg. Eng.)

eISSN 2234-8883
pISSN 2234-8255