Journal of information and communication convergence engineering 2023; 21(4): 329-336
Published online December 31, 2023
https://doi.org/10.56977/jicce.2023.21.4.329
© Korea Institute of Information and Communication Engineering
Correspondence to : Soon Ki Jung (E-mail: skjung@knu.ac.kr)
Department of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper introduces an edge AI-based scene-specific object detection system for long-term traffic management, focusing on analyzing congestion and movement via cameras. It aims to balance fast processing and accuracy in traffic flow data analysis using edge computing. We adapt the YOLOv5 model, with four heads, to a scene-specific model that utilizes the fixed camera’s scene-specific properties. This model selectively detects objects based on scale by blocking nodes, ensuring only objects of certain sizes are identified. A decision module then selects the most suitable object detector for each scene, enhancing inference speed without significant accuracy loss, as demonstrated in our experiments.
Keywords Scene-specific System, You Only Look Once Version 5 (YOLOv5), Edge AI, Embedded System
In urban planning, analyzing traffic congestion and movement is vital for effective long-term traffic control [1]. Traditional methods, like monitoring traffic flow through CCTV cameras, are costly and labor-intensive [2]. To address these challenges, AI cameras are increasingly being used as intelligent traffic detection systems [3,4]. These cameras, functioning as edge computing devices with embedded GPUs, can run lightweight deep learning models. Such an Edge AI system enables efficient monitoring of crossroads, providing high-level traffic data, including flow and congestion insights, thus offering a more resource-efficient solution to traffic management [5].
This research focuses on employing Edge AI systems in urban planning and traffic control to analyze traffic congestion and the movement of people and vehicles. Utilizing AI cameras as edge computing devices, equipped with embedded GPUs, we can run lightweight deep learning models for real-time traffic flow analysis. This approach addresses the high cost and resource intensity of monitoring traffic through CCTV cameras. Our Edge AI system analyzes high-level traffic data such as flow and congestion at crossroads, avoiding the use of cloud computing due to concerns over personal information leakage, transmission delays, and increased network traffic [6,7]. Edge computing offers a solution to these issues, making it a more suitable choice for our study.
Edge computing, however, often struggles to match the performance of cloud computing due to limited resources impacting inference speed and accuracy [8]. Examples of AI services applied to edge computing include traffic surveillance and monitoring research using the Faster R-CNN network [9], a study on enhancing power efficiency and security in Intelligent Transportation Systems (ITS) [10], and research on FD-YOLOv5, a YOLOv5 network-based system for detecting safety helmets in operators [4]. However, these studies primarily emphasize detection accuracy over inference speed. While these methods enhance accuracy and object detection efficacy, considering inference speed is crucial for integrating these models with CCTV for an embedded system.
In our research, we’ve enhanced an Edge AI system for analyzing traffic flow by developing a scene-specific model based on YOLOv5, specifically designed to address slow inference speed in embedded systems and move beyond traditional lightweight model approaches. We made significant modifications to the existing model architecture to detect smaller objects more effectively. Our main contributions are as follows:
• In addition to the existing structure, we’ve added layers to the backbone, neck, and head of the model, enabling it to detect smaller objects more efficiently than the standard model.
• We designed the scene-specific model by customizing object size detection for each image grid, and selectively deactivating certain layers in the head and corresponding neck modules. This strategy enhances computational speed while ensuring minimal loss in accuracy for specific object sizes.
• A bespoke decision module was developed to adapt the scene-specific system model to different CCTV environments, further enhancing its applicability and effectiveness.
The structure of this paper is organized as follows: Section II provides an overview of the YOLOv5 model, highlighting its status as a cutting-edge lightweight model. In Section III, we delve into the detailed implementation process specific to scene-based systems. Section IV discusses our experimental results, offering in-depth interpretations. The paper concludes with Section V, which presents our final thoughts and potential future research directions.
In object detection based on deep learning, there are two primary methods. The first method, known as two-stage detection, includes techniques like R-CNN, Fast R-CNN, Faster R-CNN [11], and Mask R-CNN [12]. These methods initially extract region proposals using selective search algorithms or region proposal networks (RPN), followed by object detection based on these proposals. While two-stage detectors are highly accurate, they are characterized by slower inference speeds. The second method involves onestage detectors like the YOLO series [13-16]. These algorithms employ regression to simplify learning the target's generalized characteristics, effectively addressing the challenge of inference speed. In this context, YOLOv5 has been chosen for its suitability in embedded system environments. Among the various models offered by YOLOv5, the lightweight versions are particularly apt for our research, providing satisfactory performance even in embedded system environments.
YOLOv5 encompasses five models: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, each varying in parameters of depth and width. Among these, YOLOv5x offers the highest accuracy but at the slowest speed, while YOLOv5n provides the fastest inference with the least accuracy. As depicted in Fig. 1, the fundamental architecture of YOLOv5 consists of three main components: the backbone, neck, and head. The backbone, using the SPPF layer and CSPDarknet53 [17], primarily extracts features to create a feature map. The neck, utilizing PANet [18], forms a feature pyramid at various scales, linking the backbone and head. The head component is responsible for image classification and bounding box location regression [19]. Our analysis and modifications focus on the neck and head modules of the network.
Given the constraints of computing resources in embedded systems, such as limited memory and processing capacity, optimizing deep learning models for these environments is essential. To deploy models on resource-constrained embedded devices, they need to be made faster and lighter through processes like model light-weighting or model compression. Common methods include knowledge distillation [20], where knowledge from a larger teacher model is transferred to a smaller student model; TensorRT [21], which optimizes the model on embedded GPUs to enhance speed; quantization [22], a technique for minimizing redundant bits in model parameters; and pruning [23], which involves eliminating superfluous parameters from the original network.
Our research introduces a scene-specific method, distinct from the aforementioned techniques. This approach retains submodules related to the head in the model and increases inference speed by selecting a model tailored to each specific scene.
Before implementing our system, we established hypotheses to simplify the model for a specific scene with a stationary camera. These hypotheses are:
• Road images will not contain objects larger than cars or trucks, implying a limit on the object size in input images.
• Due to the geometric relationship between the camera and the ground, objects of the same class appear smaller when they are farther away or positioned higher in the image.
• The variety of object classes detectable in a given environment is limited. For instance, in urban road settings, focusing only on cars and pedestrians simplifies the model, reducing the likelihood of detecting false classes.
Our scene-specific approach is designed in line with these hypotheses, aiming to streamline the detection process by focusing on relevant object sizes and classes.
The system is comprised of two main components: the Server block and the Edge AI block. The Server block handles training and designing the model, while the Edge AI block is responsible for inference. An overview of the scenespecific system is depicted in Fig. 2. In the Server block, the model undergoes training and is subsequently tailored for a specific scene. In the Edge AI block, the scene-specific model received from the server processes the test image. The model's prediction is then sent to the Comparison block along with the model itself. The Comparison block determines the most effective object detector (OD) by evaluating both the input model and its prediction results. The chosen model then receives the test image, initiating the inference process.
In our experiments, we observed a challenge in detecting distant, small-sized objects using a stationary camera. To address this, inspired by successful instances of enhanced small object detection [24,25], we not only added an extra tiny head for detecting these small objects but also incorporated corresponding layers in the backbone and neck of the model. This comprehensive update, integrating the tiny head with aligned layers in the backbone and neck, results in a 4- head structure. This structure efficiently handles variations in object scales and improves the detection of smaller objects. However, it’s important to note that these enhancements in detection capabilities are balanced with an increase in computational requirements and memory consumption.
In YOLOv5, prediction heads are differentiated by their roles as tiny, small, medium, and large, based on the object scale they detect. We propose that in certain scenes, not all four prediction heads are necessary due to the limited scale of objects in the input image. For instance, in high-altitude environments where objects appear smaller, only the tiny and small prediction heads might be required. Consequently, in such scenarios, the medium and large prediction heads, along with their corresponding neck nodes, are disabled.
Our model is designed to offer four modes of access through a single trained model. Fig. 3 visualizes this redesigned Scene-specific model. It consists of four detectors— each specialized for different object sizes (micro, mini, middle, and big)—determined by the deactivated nodes in the neck and prediction head. Each Object Detector (OD) is a hierarchically structured detector that accumulates itself and smaller variations, except for the OD-Micro. For clarity, we will use the term Accumulated Object Detector (AOD) to denote this feature. AOD-Big corresponds to the complete, unmodified base model. OD-Micro utilizes only the tiny prediction heads, AOD-Mini employs both tiny and small heads, and AOD-Middle combines tiny, small, and medium heads. Each detector activates only its defined prediction heads, blocking others and their related neck nodes to optimize performance for specific scene requirements.
To enhance the selection of scene-specific model detectors in terms of efficiency and accuracy, our system incorporates an automated decision module. This module evaluates each scene to identify the most fitting detector, taking into account both inference speed and accuracy, with the accuracy benchmarked against the original XLarge (AOD-Big) model, presumed as the ground truth. The selection process involves a comparative analysis of both the inference speed and the accuracy of each detector against the AOD-Big model. The aim is to select a detector that maintains high inference speed while keeping the accuracy loss within an acceptable threshold, preferably less than a specified percentage when compared to the original XLarge model. This approach, grounded in experimental findings, ensures a balanced consideration of speed and accuracy for each scenespecific application. Detailed procedures and outcomes of this evaluative process are detailed in Section IV.
In the initial phase of our study, we designed 12 different models by applying scene-specific detectors to three base models: YOLOv5 Nano, YOLOv5 Small, and YOLOv5 Medium. This approach, however, proved to be challenging due to the necessity of training 12 distinct models and the limited reusability resulting from the removal of nodes related to the model's neck and head. To address these issues, we revised our strategy by partially blocking nodes instead of removing them, thus reducing the number to only three models. This adjustment increased the models' flexibility, eased the training burden, and enhanced reusability.
For our training and experimental data, we used the 2021 AI City Challenge dataset (www.aicitychallenge.org), which includes data on vehicle and pedestrian movement captured by stationary cameras at various altitudes. We processed this dataset by extracting video segments, converting them into images frame by frame, and then using these images as input for a pre-trained YOLOv5 Xlarge model to establish ground truth data. In extracting Ground Truth data, we focused on two labels—pedestrian and vehicle—to minimize false positives. The dataset comprised 9,266 images for training, 2,926 for validation, and 1,030 for testing.
The training of the models was conducted on a server in a PyTorch 1.10.2 and Torchvision 0.11.3 environment, utilizing an Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz and an NVIDIA GeForce RTX 2080 Super GPU. Key training parameters included a momentum of 0.937, an initial learning rate of 0.01, a batch size of 4, and 300 epochs. We also employed an auto-anchor algorithm during training to ensure the anchors were optimally suited to the current data set.
Inference experiments were performed on embedded systems, simulated using a Jetson AGX Xavier. The GPU experiment environment, set up with JetPack 4.6.2 version through SDKmanager, included PyTorch 1.8 and Torchvision 0.9.
Our experiments aimed to assess various model optimization methods, focusing on their effects on model size, inference speed, and accuracy. Initially, we experimented with unstructured pruning but found that it did not significantly reduce the model file size. This lack of size reduction can be attributed to the pruned (zeroed) filters still being stored in the weight file, and without acceleration techniques like skipping zeros during computation, no inference speed improvement was observed. As Table 1 shows, while we conducted accuracy assessments, no notable speed improvements were recorded.
Table 1 . Comparison of accuracy after pruning of each model
Percent | Precision | Recall | mAP_05 | mAP_05:.95 | |
---|---|---|---|---|---|
Nano | 0 | 0.913 | 0.897 | 0.949 | 0.828 |
10% | 0.913 | 0.897 | 0.948 | 0.813 | |
20% | 0.894 | 0.864 | 0.933 | 0.754 | |
30% | 0.837 | 0.767 | 0.857 | 0.593 | |
40% | 0.832 | 0.396 | 0.644 | 0.348 | |
Small | 0 | 0.931 | 0.909 | 0.957 | 0.867 |
10% | 0.931 | 0.909 | 0.957 | 0.867 | |
20% | 0.931 | 0.903 | 0.956 | 0.816 | |
30% | 0.911 | 0.883 | 0.948 | 0.696 | |
40% | 0.859 | 0.772 | 0.886 | 0.516 | |
Medium | 0 | 0.944 | 0.917 | 0.962 | 0.89 |
10% | 0.943 | 0.917 | 0.963 | 0.889 | |
20% | 0.943 | 0.914 | 0.962 | 0.855 | |
30% | 0.941 | 0.906 | 0.96 | 0.761 | |
40% | 0.921 | 0.872 | 0.926 | 0.58 |
Subsequently, we conducted TFlite quantization experiments on the same model, initially testing on a server. Results in Table 2 reveal that post-quantization with FP16 and INT8 types, the model's speed actually decreased compared to the baseline. This reduced speed was likely due to the server's Intel CPU, as TFlite quantization is optimized for ARM CPUs, rendering it less effective for our setup.
Table 2 . Evaluated results of each model after quantization
Type | Precision | Recall | mAP_05 | mAP_05:.95 | Speed CPU (ms) | |
---|---|---|---|---|---|---|
Nano | Base | 0.915 | 0.916 | 0.969 | 0.844 | 57.9 |
FP16 | 0.91 | 0.916 | 0.968 | 0.835 | 99.4 | |
INT8 | 0.814 | 0.883 | 0.922 | 0.645 | 99.3 | |
Small | Base | 0.932 | 0.928 | 0.977 | 0.883 | 118.6 |
FP16 | 0.932 | 0.926 | 0.976 | 0.875 | 318.3 | |
INT8 | 0.846 | 0.892 | 0.942 | 0.683 | 242.1 | |
Medium | Base | 0.946 | 0.938 | 0.98 | 0.906 | 257 |
FP16 | 0.943 | 0.937 | 0.979 | 0.899 | 892.6 | |
INT8 | 0.828 | 0.89 | 0.938 | 0.658 | 591.3 |
Before experimenting with the Scene-specific method, for the analysis of people and vehicle movement through the embedded system, we set a standard inference speed of 30 frames per second (FPS), which is the speed perceptible in real-time by humans. Table 3 presents the results of applying the scene-specific model method to the Nano, Small, and Medium models. In these experiments, accuracy and inference speed were measured using scene (d) from Figure 4. When comparing the models to identify the most suitable for our current embedded system, there was little difference in total accuracy (mAP_0.5 and mAP_0.5:0.95) across the three models. However, in terms of inference speed (FPS), both the Nano and Small models approached our standard, but the Medium model did not. Therefore, considering both accuracy and inference speed, the Small model is the most suitable for our current embedded system.
Table 3 . Comparison of accuracy and inference speed for each model
OD-Micro | AOD-Mini | AOD-Middle | AOD-Big | ||
---|---|---|---|---|---|
Nano | mAP_0.5 | 0.899 | 0.938 | 0.966 | 0.966 |
mAP_0.5:0.95 | 0.766 | 0.811 | 0.842 | 0.843 | |
FPS | 37 | 33 | 30 | 27 | |
Small | mAP_0.5 | 0.904 | 0.945 | 0.97 | 0.97 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.875 | 0.874 | |
FPS | 35 | 32 | 29 | 27 | |
Medium | mAP_0.5 | 0.904 | 0.945 | 0.971 | 0.98 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.874 | 0.874 | |
FPS | 19 | 17 | 16 | 15 |
In Section III, we established a 1% accuracy loss threshold as the selection criterion for detectors. This decision is based on the accuracy analysis of each detector in the Small model, as seen in Table 3, where the scene-specific model was applied. The accuracy difference between the AOD-Mini and AOD-Middle of the Small model was approximately 2.58% for mAP_0.5 (0.945 vs. 0.97) and 3.2% for mAP_0.5:0.95 (0.847 vs. 0.875). However, due to some objects not being properly detected in the actual prediction results, we set the selection criterion within 1% to mitigate this issue. Furthermore, we compared the accuracy of AOD-Middle and AODBig to determine the most suitable detector for the current scene. Since their accuracy was comparable, AOD-Middle was chosen.
To validate our proposed method, experiments were conducted across six different scenes. Fig. 4 illustrates these results using the scene-specific model method with images captured by a fixed camera. Images (a) to (f) in Figure 4 display prediction results ranging from OD-Micro (left) to AOD-Big (right). The experiments revealed that in scenes (a), (c), and (e), middle-scale objects were not present, suggesting the suitability of using AOD-Mini. Conversely, in scenes (b), (d), and (f), where big-scale objects were absent, AOD-Middle was deemed appropriate. This confirmed that applying the scene-specific mode across various scenes can increase inference speed while maintaining model accuracy.
In tackling the complexities of real-time traffic analysis with edge AI devices, this study introduced a scene-specific system designed for cameras in fixed environments. We adapted this system to the YOLO network in three variants, enabling the use of four distinct detectors via a single model. A specialized module was developed to select the most appropriate detector for each scene. Our experimental evaluation, conducted across six different scenes, demonstrated that AOD-Mini is the optimal choice for three scenes, while AOD-Middle is more suitable for the other three. This approach successfully improved inference speeds without sacrificing accuracy. The positive results from the Scene- Specific System indicate its potential for broader application and its benefits to other neural network architectures. We anticipate that our research will contribute to the development of innovative lightweight methods, offering alternatives to traditional approaches in model lightweight.
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2023-RS-2022-00156389) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and also was supported by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).
Jin HO Lee received his bachelor’s degree in mechanical engineering from Dong-A University, South Korea. He is currently doing his master's degree at the Virtual Reality Lab in the Department of Computer Science at Kyungpook National University. His research interest is conducting model compression research for Object Detection.
In Su Kim received his bachelor's and master's degrees from Kyungpook National University's Department of Computers. He is currently doing his Ph.D. at the Virtual Reality Lab in the Department of Computer Science at Kyungpook National University. His research interests include machine learning, computer vision, and virtual reality.
Hector Acosta currently holds a master's degree in the Virtual Reality Lab of the Department of Computer Science at Kyungpook National University. His research interest is conducting model compression research for Object Detection.
Hyeong Bok Kim is currently working as a researcher for the Korean Testworks company(Testworks, Inc., Seoul 01000, Korea).
Swung Won Lee is currently working as a researcher for the Korean Testworks company(Testworks, Inc., Seoul 01000, Korea).
Soon Ki Jung Dr. Jung serves as a Full-time Professor in the School of Computer Science and Engineering at Kyungpook National University, Korea. His research spans a range of topics including Augmented Reality (AR), 3D computer graphics, Computer Vision, Human-Computer Interaction (HCI), mobile application development, wearable computing, and other related fields.
Journal of information and communication convergence engineering 2023; 21(4): 329-336
Published online December 31, 2023 https://doi.org/10.56977/jicce.2023.21.4.329
Copyright © Korea Institute of Information and Communication Engineering.
Jin Ho Lee 1, In Su Kim 1, Hector Acosta 1, Hyeong Bok Kim 2, Seung Won Lee2, and Soon Ki Jung1*
1School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
2Testworks, Inc., Seoul 01000, Korea
Correspondence to:Soon Ki Jung (E-mail: skjung@knu.ac.kr)
Department of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper introduces an edge AI-based scene-specific object detection system for long-term traffic management, focusing on analyzing congestion and movement via cameras. It aims to balance fast processing and accuracy in traffic flow data analysis using edge computing. We adapt the YOLOv5 model, with four heads, to a scene-specific model that utilizes the fixed camera’s scene-specific properties. This model selectively detects objects based on scale by blocking nodes, ensuring only objects of certain sizes are identified. A decision module then selects the most suitable object detector for each scene, enhancing inference speed without significant accuracy loss, as demonstrated in our experiments.
Keywords: Scene-specific System, You Only Look Once Version 5 (YOLOv5), Edge AI, Embedded System
In urban planning, analyzing traffic congestion and movement is vital for effective long-term traffic control [1]. Traditional methods, like monitoring traffic flow through CCTV cameras, are costly and labor-intensive [2]. To address these challenges, AI cameras are increasingly being used as intelligent traffic detection systems [3,4]. These cameras, functioning as edge computing devices with embedded GPUs, can run lightweight deep learning models. Such an Edge AI system enables efficient monitoring of crossroads, providing high-level traffic data, including flow and congestion insights, thus offering a more resource-efficient solution to traffic management [5].
This research focuses on employing Edge AI systems in urban planning and traffic control to analyze traffic congestion and the movement of people and vehicles. Utilizing AI cameras as edge computing devices, equipped with embedded GPUs, we can run lightweight deep learning models for real-time traffic flow analysis. This approach addresses the high cost and resource intensity of monitoring traffic through CCTV cameras. Our Edge AI system analyzes high-level traffic data such as flow and congestion at crossroads, avoiding the use of cloud computing due to concerns over personal information leakage, transmission delays, and increased network traffic [6,7]. Edge computing offers a solution to these issues, making it a more suitable choice for our study.
Edge computing, however, often struggles to match the performance of cloud computing due to limited resources impacting inference speed and accuracy [8]. Examples of AI services applied to edge computing include traffic surveillance and monitoring research using the Faster R-CNN network [9], a study on enhancing power efficiency and security in Intelligent Transportation Systems (ITS) [10], and research on FD-YOLOv5, a YOLOv5 network-based system for detecting safety helmets in operators [4]. However, these studies primarily emphasize detection accuracy over inference speed. While these methods enhance accuracy and object detection efficacy, considering inference speed is crucial for integrating these models with CCTV for an embedded system.
In our research, we’ve enhanced an Edge AI system for analyzing traffic flow by developing a scene-specific model based on YOLOv5, specifically designed to address slow inference speed in embedded systems and move beyond traditional lightweight model approaches. We made significant modifications to the existing model architecture to detect smaller objects more effectively. Our main contributions are as follows:
• In addition to the existing structure, we’ve added layers to the backbone, neck, and head of the model, enabling it to detect smaller objects more efficiently than the standard model.
• We designed the scene-specific model by customizing object size detection for each image grid, and selectively deactivating certain layers in the head and corresponding neck modules. This strategy enhances computational speed while ensuring minimal loss in accuracy for specific object sizes.
• A bespoke decision module was developed to adapt the scene-specific system model to different CCTV environments, further enhancing its applicability and effectiveness.
The structure of this paper is organized as follows: Section II provides an overview of the YOLOv5 model, highlighting its status as a cutting-edge lightweight model. In Section III, we delve into the detailed implementation process specific to scene-based systems. Section IV discusses our experimental results, offering in-depth interpretations. The paper concludes with Section V, which presents our final thoughts and potential future research directions.
In object detection based on deep learning, there are two primary methods. The first method, known as two-stage detection, includes techniques like R-CNN, Fast R-CNN, Faster R-CNN [11], and Mask R-CNN [12]. These methods initially extract region proposals using selective search algorithms or region proposal networks (RPN), followed by object detection based on these proposals. While two-stage detectors are highly accurate, they are characterized by slower inference speeds. The second method involves onestage detectors like the YOLO series [13-16]. These algorithms employ regression to simplify learning the target's generalized characteristics, effectively addressing the challenge of inference speed. In this context, YOLOv5 has been chosen for its suitability in embedded system environments. Among the various models offered by YOLOv5, the lightweight versions are particularly apt for our research, providing satisfactory performance even in embedded system environments.
YOLOv5 encompasses five models: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, each varying in parameters of depth and width. Among these, YOLOv5x offers the highest accuracy but at the slowest speed, while YOLOv5n provides the fastest inference with the least accuracy. As depicted in Fig. 1, the fundamental architecture of YOLOv5 consists of three main components: the backbone, neck, and head. The backbone, using the SPPF layer and CSPDarknet53 [17], primarily extracts features to create a feature map. The neck, utilizing PANet [18], forms a feature pyramid at various scales, linking the backbone and head. The head component is responsible for image classification and bounding box location regression [19]. Our analysis and modifications focus on the neck and head modules of the network.
Given the constraints of computing resources in embedded systems, such as limited memory and processing capacity, optimizing deep learning models for these environments is essential. To deploy models on resource-constrained embedded devices, they need to be made faster and lighter through processes like model light-weighting or model compression. Common methods include knowledge distillation [20], where knowledge from a larger teacher model is transferred to a smaller student model; TensorRT [21], which optimizes the model on embedded GPUs to enhance speed; quantization [22], a technique for minimizing redundant bits in model parameters; and pruning [23], which involves eliminating superfluous parameters from the original network.
Our research introduces a scene-specific method, distinct from the aforementioned techniques. This approach retains submodules related to the head in the model and increases inference speed by selecting a model tailored to each specific scene.
Before implementing our system, we established hypotheses to simplify the model for a specific scene with a stationary camera. These hypotheses are:
• Road images will not contain objects larger than cars or trucks, implying a limit on the object size in input images.
• Due to the geometric relationship between the camera and the ground, objects of the same class appear smaller when they are farther away or positioned higher in the image.
• The variety of object classes detectable in a given environment is limited. For instance, in urban road settings, focusing only on cars and pedestrians simplifies the model, reducing the likelihood of detecting false classes.
Our scene-specific approach is designed in line with these hypotheses, aiming to streamline the detection process by focusing on relevant object sizes and classes.
The system is comprised of two main components: the Server block and the Edge AI block. The Server block handles training and designing the model, while the Edge AI block is responsible for inference. An overview of the scenespecific system is depicted in Fig. 2. In the Server block, the model undergoes training and is subsequently tailored for a specific scene. In the Edge AI block, the scene-specific model received from the server processes the test image. The model's prediction is then sent to the Comparison block along with the model itself. The Comparison block determines the most effective object detector (OD) by evaluating both the input model and its prediction results. The chosen model then receives the test image, initiating the inference process.
In our experiments, we observed a challenge in detecting distant, small-sized objects using a stationary camera. To address this, inspired by successful instances of enhanced small object detection [24,25], we not only added an extra tiny head for detecting these small objects but also incorporated corresponding layers in the backbone and neck of the model. This comprehensive update, integrating the tiny head with aligned layers in the backbone and neck, results in a 4- head structure. This structure efficiently handles variations in object scales and improves the detection of smaller objects. However, it’s important to note that these enhancements in detection capabilities are balanced with an increase in computational requirements and memory consumption.
In YOLOv5, prediction heads are differentiated by their roles as tiny, small, medium, and large, based on the object scale they detect. We propose that in certain scenes, not all four prediction heads are necessary due to the limited scale of objects in the input image. For instance, in high-altitude environments where objects appear smaller, only the tiny and small prediction heads might be required. Consequently, in such scenarios, the medium and large prediction heads, along with their corresponding neck nodes, are disabled.
Our model is designed to offer four modes of access through a single trained model. Fig. 3 visualizes this redesigned Scene-specific model. It consists of four detectors— each specialized for different object sizes (micro, mini, middle, and big)—determined by the deactivated nodes in the neck and prediction head. Each Object Detector (OD) is a hierarchically structured detector that accumulates itself and smaller variations, except for the OD-Micro. For clarity, we will use the term Accumulated Object Detector (AOD) to denote this feature. AOD-Big corresponds to the complete, unmodified base model. OD-Micro utilizes only the tiny prediction heads, AOD-Mini employs both tiny and small heads, and AOD-Middle combines tiny, small, and medium heads. Each detector activates only its defined prediction heads, blocking others and their related neck nodes to optimize performance for specific scene requirements.
To enhance the selection of scene-specific model detectors in terms of efficiency and accuracy, our system incorporates an automated decision module. This module evaluates each scene to identify the most fitting detector, taking into account both inference speed and accuracy, with the accuracy benchmarked against the original XLarge (AOD-Big) model, presumed as the ground truth. The selection process involves a comparative analysis of both the inference speed and the accuracy of each detector against the AOD-Big model. The aim is to select a detector that maintains high inference speed while keeping the accuracy loss within an acceptable threshold, preferably less than a specified percentage when compared to the original XLarge model. This approach, grounded in experimental findings, ensures a balanced consideration of speed and accuracy for each scenespecific application. Detailed procedures and outcomes of this evaluative process are detailed in Section IV.
In the initial phase of our study, we designed 12 different models by applying scene-specific detectors to three base models: YOLOv5 Nano, YOLOv5 Small, and YOLOv5 Medium. This approach, however, proved to be challenging due to the necessity of training 12 distinct models and the limited reusability resulting from the removal of nodes related to the model's neck and head. To address these issues, we revised our strategy by partially blocking nodes instead of removing them, thus reducing the number to only three models. This adjustment increased the models' flexibility, eased the training burden, and enhanced reusability.
For our training and experimental data, we used the 2021 AI City Challenge dataset (www.aicitychallenge.org), which includes data on vehicle and pedestrian movement captured by stationary cameras at various altitudes. We processed this dataset by extracting video segments, converting them into images frame by frame, and then using these images as input for a pre-trained YOLOv5 Xlarge model to establish ground truth data. In extracting Ground Truth data, we focused on two labels—pedestrian and vehicle—to minimize false positives. The dataset comprised 9,266 images for training, 2,926 for validation, and 1,030 for testing.
The training of the models was conducted on a server in a PyTorch 1.10.2 and Torchvision 0.11.3 environment, utilizing an Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz and an NVIDIA GeForce RTX 2080 Super GPU. Key training parameters included a momentum of 0.937, an initial learning rate of 0.01, a batch size of 4, and 300 epochs. We also employed an auto-anchor algorithm during training to ensure the anchors were optimally suited to the current data set.
Inference experiments were performed on embedded systems, simulated using a Jetson AGX Xavier. The GPU experiment environment, set up with JetPack 4.6.2 version through SDKmanager, included PyTorch 1.8 and Torchvision 0.9.
Our experiments aimed to assess various model optimization methods, focusing on their effects on model size, inference speed, and accuracy. Initially, we experimented with unstructured pruning but found that it did not significantly reduce the model file size. This lack of size reduction can be attributed to the pruned (zeroed) filters still being stored in the weight file, and without acceleration techniques like skipping zeros during computation, no inference speed improvement was observed. As Table 1 shows, while we conducted accuracy assessments, no notable speed improvements were recorded.
Table 1 . Comparison of accuracy after pruning of each model.
Percent | Precision | Recall | mAP_05 | mAP_05:.95 | |
---|---|---|---|---|---|
Nano | 0 | 0.913 | 0.897 | 0.949 | 0.828 |
10% | 0.913 | 0.897 | 0.948 | 0.813 | |
20% | 0.894 | 0.864 | 0.933 | 0.754 | |
30% | 0.837 | 0.767 | 0.857 | 0.593 | |
40% | 0.832 | 0.396 | 0.644 | 0.348 | |
Small | 0 | 0.931 | 0.909 | 0.957 | 0.867 |
10% | 0.931 | 0.909 | 0.957 | 0.867 | |
20% | 0.931 | 0.903 | 0.956 | 0.816 | |
30% | 0.911 | 0.883 | 0.948 | 0.696 | |
40% | 0.859 | 0.772 | 0.886 | 0.516 | |
Medium | 0 | 0.944 | 0.917 | 0.962 | 0.89 |
10% | 0.943 | 0.917 | 0.963 | 0.889 | |
20% | 0.943 | 0.914 | 0.962 | 0.855 | |
30% | 0.941 | 0.906 | 0.96 | 0.761 | |
40% | 0.921 | 0.872 | 0.926 | 0.58 |
Subsequently, we conducted TFlite quantization experiments on the same model, initially testing on a server. Results in Table 2 reveal that post-quantization with FP16 and INT8 types, the model's speed actually decreased compared to the baseline. This reduced speed was likely due to the server's Intel CPU, as TFlite quantization is optimized for ARM CPUs, rendering it less effective for our setup.
Table 2 . Evaluated results of each model after quantization.
Type | Precision | Recall | mAP_05 | mAP_05:.95 | Speed CPU (ms) | |
---|---|---|---|---|---|---|
Nano | Base | 0.915 | 0.916 | 0.969 | 0.844 | 57.9 |
FP16 | 0.91 | 0.916 | 0.968 | 0.835 | 99.4 | |
INT8 | 0.814 | 0.883 | 0.922 | 0.645 | 99.3 | |
Small | Base | 0.932 | 0.928 | 0.977 | 0.883 | 118.6 |
FP16 | 0.932 | 0.926 | 0.976 | 0.875 | 318.3 | |
INT8 | 0.846 | 0.892 | 0.942 | 0.683 | 242.1 | |
Medium | Base | 0.946 | 0.938 | 0.98 | 0.906 | 257 |
FP16 | 0.943 | 0.937 | 0.979 | 0.899 | 892.6 | |
INT8 | 0.828 | 0.89 | 0.938 | 0.658 | 591.3 |
Before experimenting with the Scene-specific method, for the analysis of people and vehicle movement through the embedded system, we set a standard inference speed of 30 frames per second (FPS), which is the speed perceptible in real-time by humans. Table 3 presents the results of applying the scene-specific model method to the Nano, Small, and Medium models. In these experiments, accuracy and inference speed were measured using scene (d) from Figure 4. When comparing the models to identify the most suitable for our current embedded system, there was little difference in total accuracy (mAP_0.5 and mAP_0.5:0.95) across the three models. However, in terms of inference speed (FPS), both the Nano and Small models approached our standard, but the Medium model did not. Therefore, considering both accuracy and inference speed, the Small model is the most suitable for our current embedded system.
Table 3 . Comparison of accuracy and inference speed for each model.
OD-Micro | AOD-Mini | AOD-Middle | AOD-Big | ||
---|---|---|---|---|---|
Nano | mAP_0.5 | 0.899 | 0.938 | 0.966 | 0.966 |
mAP_0.5:0.95 | 0.766 | 0.811 | 0.842 | 0.843 | |
FPS | 37 | 33 | 30 | 27 | |
Small | mAP_0.5 | 0.904 | 0.945 | 0.97 | 0.97 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.875 | 0.874 | |
FPS | 35 | 32 | 29 | 27 | |
Medium | mAP_0.5 | 0.904 | 0.945 | 0.971 | 0.98 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.874 | 0.874 | |
FPS | 19 | 17 | 16 | 15 |
In Section III, we established a 1% accuracy loss threshold as the selection criterion for detectors. This decision is based on the accuracy analysis of each detector in the Small model, as seen in Table 3, where the scene-specific model was applied. The accuracy difference between the AOD-Mini and AOD-Middle of the Small model was approximately 2.58% for mAP_0.5 (0.945 vs. 0.97) and 3.2% for mAP_0.5:0.95 (0.847 vs. 0.875). However, due to some objects not being properly detected in the actual prediction results, we set the selection criterion within 1% to mitigate this issue. Furthermore, we compared the accuracy of AOD-Middle and AODBig to determine the most suitable detector for the current scene. Since their accuracy was comparable, AOD-Middle was chosen.
To validate our proposed method, experiments were conducted across six different scenes. Fig. 4 illustrates these results using the scene-specific model method with images captured by a fixed camera. Images (a) to (f) in Figure 4 display prediction results ranging from OD-Micro (left) to AOD-Big (right). The experiments revealed that in scenes (a), (c), and (e), middle-scale objects were not present, suggesting the suitability of using AOD-Mini. Conversely, in scenes (b), (d), and (f), where big-scale objects were absent, AOD-Middle was deemed appropriate. This confirmed that applying the scene-specific mode across various scenes can increase inference speed while maintaining model accuracy.
In tackling the complexities of real-time traffic analysis with edge AI devices, this study introduced a scene-specific system designed for cameras in fixed environments. We adapted this system to the YOLO network in three variants, enabling the use of four distinct detectors via a single model. A specialized module was developed to select the most appropriate detector for each scene. Our experimental evaluation, conducted across six different scenes, demonstrated that AOD-Mini is the optimal choice for three scenes, while AOD-Middle is more suitable for the other three. This approach successfully improved inference speeds without sacrificing accuracy. The positive results from the Scene- Specific System indicate its potential for broader application and its benefits to other neural network architectures. We anticipate that our research will contribute to the development of innovative lightweight methods, offering alternatives to traditional approaches in model lightweight.
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2023-RS-2022-00156389) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and also was supported by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).
Table 1 . Comparison of accuracy after pruning of each model.
Percent | Precision | Recall | mAP_05 | mAP_05:.95 | |
---|---|---|---|---|---|
Nano | 0 | 0.913 | 0.897 | 0.949 | 0.828 |
10% | 0.913 | 0.897 | 0.948 | 0.813 | |
20% | 0.894 | 0.864 | 0.933 | 0.754 | |
30% | 0.837 | 0.767 | 0.857 | 0.593 | |
40% | 0.832 | 0.396 | 0.644 | 0.348 | |
Small | 0 | 0.931 | 0.909 | 0.957 | 0.867 |
10% | 0.931 | 0.909 | 0.957 | 0.867 | |
20% | 0.931 | 0.903 | 0.956 | 0.816 | |
30% | 0.911 | 0.883 | 0.948 | 0.696 | |
40% | 0.859 | 0.772 | 0.886 | 0.516 | |
Medium | 0 | 0.944 | 0.917 | 0.962 | 0.89 |
10% | 0.943 | 0.917 | 0.963 | 0.889 | |
20% | 0.943 | 0.914 | 0.962 | 0.855 | |
30% | 0.941 | 0.906 | 0.96 | 0.761 | |
40% | 0.921 | 0.872 | 0.926 | 0.58 |
Table 2 . Evaluated results of each model after quantization.
Type | Precision | Recall | mAP_05 | mAP_05:.95 | Speed CPU (ms) | |
---|---|---|---|---|---|---|
Nano | Base | 0.915 | 0.916 | 0.969 | 0.844 | 57.9 |
FP16 | 0.91 | 0.916 | 0.968 | 0.835 | 99.4 | |
INT8 | 0.814 | 0.883 | 0.922 | 0.645 | 99.3 | |
Small | Base | 0.932 | 0.928 | 0.977 | 0.883 | 118.6 |
FP16 | 0.932 | 0.926 | 0.976 | 0.875 | 318.3 | |
INT8 | 0.846 | 0.892 | 0.942 | 0.683 | 242.1 | |
Medium | Base | 0.946 | 0.938 | 0.98 | 0.906 | 257 |
FP16 | 0.943 | 0.937 | 0.979 | 0.899 | 892.6 | |
INT8 | 0.828 | 0.89 | 0.938 | 0.658 | 591.3 |
Table 3 . Comparison of accuracy and inference speed for each model.
OD-Micro | AOD-Mini | AOD-Middle | AOD-Big | ||
---|---|---|---|---|---|
Nano | mAP_0.5 | 0.899 | 0.938 | 0.966 | 0.966 |
mAP_0.5:0.95 | 0.766 | 0.811 | 0.842 | 0.843 | |
FPS | 37 | 33 | 30 | 27 | |
Small | mAP_0.5 | 0.904 | 0.945 | 0.97 | 0.97 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.875 | 0.874 | |
FPS | 35 | 32 | 29 | 27 | |
Medium | mAP_0.5 | 0.904 | 0.945 | 0.971 | 0.98 |
mAP_0.5:0.95 | 0.802 | 0.847 | 0.874 | 0.874 | |
FPS | 19 | 17 | 16 | 15 |