Abstract
A real-time target detection method based on center points is proposed for infrared imaging systems equipped with CPUs. Following the lightweight design principles, a backbone with low computational cost is first introduced for feature extraction. Correspondingly, an efficient feature fusion module is designed to exploit spatial and contextual information extracted from multi-stages. In addition, an auxiliary background suppression module is proposed to predict foreground regions to enhance the feature representation. Finally, a simple detection head predicts the target center point and its associated properties. Evaluations on the infrared aerial target dataset show that our proposed method achieves 90.24% mAP at a speed of 21.69 ms per frame on the CPU. It surpasses the state-of-the-art Tiny-YOLOv3 by 10.16% mAP with only 21% FLOPs and 11% parameters while also runs 10.02 ms faster. The results demonstrate its great potential for real-time infrared applications.
Target detection is one of the most critical yet challenging tasks in infrared (IR) imaging systems, as it involves a combination of target classification and localizatio
From the perspective of detection methods, current CNN-based detection models can be roughly divided into anchor-based detectors and anchor-free detectors. Anchor-based detectors start with setting a huge number of pre-defined rectangle bounding boxes (anchors) with different ratios and scales on high-level feature maps extracted from images. Taking these anchors as proposal candidates, two-stage detectors such as Faster R-CN
By avoiding the intricate design and heavy computation of anchors, anchor-free detectors based on key points have drawn much attention recentl
To alleviate the resource consumption of CNNs, a lot of efficient architectures have been designed, including SqueezeNe
To achieve a better balance between detection accuracy and speed for CPU-only IR systems, we propose a real-time infrared target detection model inspired by both the neatly anchor-free detector CenterNet and the lightweight units introduced by ShuffleNetV
In this section, we present the details of TCPD, including the network design and workflow. Although our model is designed mainly focusing on detection efficiency, its accuracy still reaches a high level.

Fig. 1 The overall architecture of TCPD
图1 TCPD整体框架
Feature extraction module, commonly called the backbone network, is the heaviest part of a detection model in terms of computation. Therefore, designing a lightweight backbone with strong representation power is fundamental to accurate fast detection. Starting from ShuffleNetV2, we build a new lightweight FEM. It only requires 365 million FLOPs when the input resolution is 384×384 pixels. The detailed structure of FEM is listed in
As listed, FEM consists of five stages in total. After the process of each stage, the feature resolution is halved while the feature channel increases. In “Stage1” and “Stage2”, FEM first quickly down-samples the input resolution to 1/4 and expands the feature channel to 24 through a simple 3×3 convolution and a 3×3 max pooling. From “Stage3” to “Stage5”, each stage is stacked by several repeated blocks shown in

Fig. 2 The structure of blocks in FEM
图2 FEM的单元结构
Image features extracted by FEM at different stages represent different levels of information. Low-level features in early-stage feature describe more spatial details. By contrast, high-level features in late-stage feature maps capture more contextual information. As a result, localization is more sensitive to larger early-stage feature maps, while classification relies more on smaller late-stage feature maps. To better leverage both spatial and contextual information for detection, a simple feature fusion module is designed.

Fig. 3 The network structure of FFM
图3 FFM的网络结构
Starting from “Stage5”, FFM combines four stages of FEM through a “bottom-to-up” structure. As the dimension of feature maps (size and channel) varies between two adjacent stages, two steps are needed to complete a single feature fusion. The first step is channel compression. It is through “Block3” for the first two times, while through a 3×3 convolution for the last time. As shown in
Generally speaking, a high-performance network is expected to focus on features in the foreground region rather than the background counterparts. To achieve this goal, we design a computation-friendly Background Suppression Module (BSM) to guide the network to learn proper feature distribution explicitly.

Fig. 4 The network structure of BSM
图4 BSM的网络结构
BSM has two functions: predicting foreground regions and re-weighting feature maps over spatial dimensions. Foreground prediction is the basis of feature re-weighting. During training, BSM first passes the input from FEM to a single-layer detection head through two convolutional layers. The detection head then predicts foreground regions within one heatmap. Ground-truth foreground regions are defined as the combination of all ground-truth targets mapped to the heatmap. The region of each ground-truth target is produced by a 2D-Gaussian kernel, formulated as:
, | (1) |
where is the center point of the mapped ground-truth box, and , which are determined by width and height of the ground-truth target and the hyper-parameter . is set to 0.95 by default. All points inside the kernel are regarded as positive samples. If two kernels overlap, the element-wise maximum is taken. Focal los
, | (2) |
where is the total number of ground-truth targets, specifies the ground-truth foreground regions, denotes the estimated probability for the foreground regions.
As the trained BSM has the ability to predict foreground regions, the intermediate layer before the detection head can guide the feature distribution. For computational efficiency, only an element-wise convolution followed by the sigmoid function is used to re-weight the input feature maps over the spatial dimensions.
Target prediction module is the last module of TCPD. It is responsible for predicting all information that is needed to localize and classify targets. To match the light-weight design of other modules, a unified structure including only one 3×3 convolutional layer is used in TPM, as shown in

Fig. 5 The network structure of TPM
图5 TPM的网络结构
TPM treats target detection as its center localization and size regression. For center localization, it predicts center confidence scores of different target categories on corresponding center heatmaps. The ground-truth heatmaps are produced by the same Gaussian kernel defined in
, | (3) |
where is the total number of ground-truth center points, is the center confidence score of class at location , is its corresponding ground-truth value. Additional to center heatmaps, TPM also predicts coordinate offsets to compensate for the discretization error caused by downsampling. Center locations are adjusted slightly by offsets when remapping from the heatmap to the original image. L1 loss is adopted for training defined as:
, | (4) |
where is also the total number of ground-truth center points, is the predicted offset, is the center coordinate on the original image, is the downsampling factor (default is 4), is the center coordinate of discretization on the heatmap.
For size regression, TPM directly predicts the target size on the center point with width and height. The target size is also trained with L1 loss:
, | (5) |
where is the predicted -th target size, and is the ground-truth -th target size.
Combined with the loss in BSM, the total loss for training is:
, | (6) |
where and are weights for loss and respectively. They are set to 0.1 in our experiments unless otherwise specified.
Different from training, a simple post-processing method is introduced to generate the final predictions during inference. Instead of using IoU-based NMS, a 3×3 max-pooling layer is used on the center heatmaps to select the top 100 center points with the highest confidence scores. After adjusting by coordinate offsets, all selected center points and their corresponding target sizes are remapped to the original image. The final results are those with confidence scores above a manual threshold.
In this section, we first evaluate the performance of TCPD on both the self-built infrared aerial target dataset and the public visible dataset PASCAL VOC. An ablation study is then conducted to evaluate our design furthermore.
In our experiments, an infrared aerial target dataset is built for training and testing. There are 2 758 images with 3 000 labeled infrared targets in the dataset. All images are captured from ground-to-air infrared videos. The labeled targets consist of five categories: bird, helicopter, airliner, trainer, and fighter. The ratio of the training set and test set is 7:3. Results on the public dataset PASCAL VOC are also reported to verify the generalization ability of TCPD. PASCAL VOC dataset has natural images from 20 categories. The VOC 2007 and 2012 trainval sets are combined for training, while the VOC 2007 test set is used for testing.
We implement TCPD with Pytorch. It is trained on a single GPU 1080ti and tested on CPU 9900ks. During training, the input resolution is set to 384×384. Standard data augmentation is applied, including random flipping, random scaling, cropping, and color jittering. Adam is adopted to optimize the total loss. By default, TCPD is trained with a batch size of 32 for 150 epochs. The learning rate starts from 1.25e-3 and decays by a factor of 0.1 at the 70th epoch, and 120th epoch.
Accuracy is one of the most critical metrics for a detection model. A good light-weight model requires accurate classification and localization while keeping efficiency. We first evaluate our model on the infrared dataset. The results are shown in
As shown in

Fig. 6 Examples on infrared dataset
图6 红外数据集图例
In addition to evaluating TCPD on the infrared dataset, the model is also trained on the VOC dataset to verify its generalization ability. The network and all training hyperparameters keep the same as those used on the infrared dataset. The results are reported in
As the VOC dataset contains more types of targets and more complex scenarios, it is reasonable that large GPU-based models with more powerful representation abilities perform better than TCPD. However, TCPD still achieves 66.76% mAP, which is close to CenterNet while two times faster. Compared with Tiny-YOLOv3 and FKPD, TCPD surpasses them by 8.26% and 5.05%, respectively. As for the latest Tiny-YOLOv4, TCPD still outperforms it by 1.05%. The results demonstrate that TCPD can adapt target detection better in different applications. Some examples are shown in

Fig. 7 Examples on the VOC dataset
图7 VOC数据集图例
As discussed, inference speed plays a significant role in determining whether the model can be applied in most IR systems without GPU acceleration. Computational cost (FLOPs) and model size (Parameters) are two key metrics to evaluate a light-weight model. The computational cost has a direct influence on the inference speed. Lower FLOPs always mean faster detection. While the model size directly affects the storage cost. A model with fewer parameters makes it easier to deploy and has lower FLPOs.
With only 0.49 billion FLOPs and 0.95 million parameters, TCPD achieves real-time single frame detection on the CPU at a speed of 21.69 ms. It is 10.02 ms and 4.17 ms faster than Tiny-YOLOv3 and FKPD, with merely 21% and 34% FLOPs. The speed of Tiny-YOLOv4 is on par with FKPD, which is 4.54 ms slower than TCPD. Compared with the other two GPU-based models, the speed advantage of TCPD is more significant. Combined with the discussion in subsection 2.2, TCPD achieves a better performance, which keeps the balance of accuracy and speed. As a result, it is more suitable for the application in CPU-only IR systems, which requires accurate target detection at a real-time speed.
In this subsection, we first evaluate the network design of TCPD. Experiments include varying input resolution, compressing the feature channel, and module ablation. The results are shown in
Input resolution is an important factor that has a notable influence on the performance of TCPD. Smaller images mean low-resolution feature maps, which leads to the loss of detailed features. Larger images can improve detection accuracy while slows down the inference speed. Line 2 and line 3 in
In addition to the network design, we also investigate the influence of the Gaussian kernel defined in
Ranging from 0.35 to 0.95, the variation of actually affects the scale of negative samples with penalization other than 0 inside the ground-truth box. An appropriate can improve the detection accuracy. For the infrared dataset with more small targets, the choice is more flexible. While for the VOC dataset with larger targets, the impact is significant. As a result, the choice should be more careful.
We proposed a new real-time infrared target detection model TCPD based on center points. With the benefit of lightweight design, its computational cost is low, and it can keep the fast inference speed on CPU-only devices. In addition to fundamental feature extraction and target prediction, the Feature Fusion Module and Background Suppression Module are designed to improve feature representation. Evaluations on both infrared and VOC dataset demonstrate the outstanding performance of TCPD as it achieves a better balance between accuracy and speed. In summary, it provides a new choice for real-time detection in IR systems. In the future, we plan to investigate methods such as network pruning to speed up the model while keeping detection accuracy and finally deploy it as a key module in real infrared tracking systems.
References
Wu S C, Zuo Z R. Small target detection in infrared images using deep convolutional neural networks[J]. J. Infrared Millim.Waves, 2019, 38(3): 371-380. [Baidu Scholar]
Xie J R, Li F M, Wei H, et al. Enhancement of single shot multibox detector for aerial infrared target detection[J]. Acta Optica Sinica, 2019, 39(6): 0615001. 10.3788/aos201939.0615001 [Baidu Scholar]
Miao Z, Zhang Y, Chen R M, et al. Method for fast detection of infrared targets based on key points[J]. Acta Optica Sinica, 2020, 40(23): 2312006. 10.3788/aos202040.2312006 [Baidu Scholar]
Ren S Q, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems (NIPS), 2015: 91-99. [Baidu Scholar]
He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]. Proceedings of the IEEE international conference on computer vision (CVPR), IEEE, 2017: 2961-2969. 10.1109/iccv.2017.322 [Baidu Scholar]
Cai Z W, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C]//Proceedings of the IEEE international conference on computer vision (CVPR), IEEE, 2018: 6154-6162. 10.1109/cvpr.2018.00644 [Baidu Scholar]
Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2016: 779-788. 10.1109/cvpr.2016.91 [Baidu Scholar]
Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2017: 7263-7271. 10.1109/cvpr.2017.690 [Baidu Scholar]
Redmon J, Farhadi A. Yolov3: An incremental improvement[EB/OL]. (2018-04-08)[2021-02-10]. https://arxiv.org/abs/1804.02767. 10.1109/cvpr.2018.00430 [Baidu Scholar]
Bochkovskiy A, Wang C Y, Liao H Y. YOLOv4: Optimal Speed and Accuracy of Object Detection[EB/OL]. (2020-04-23)[2021-02-10]. https://arxiv.org/abs/2004.10934. 10.1109/cvpr46437.2021.01283 [Baidu Scholar]
Law H, Deng J. Cornernet: Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 116-131. 10.1007/978-3-030-01264-9_45 [Baidu Scholar]
Duan K W, Bai S, Xie L X, et al. Centernet: Keypoint triplets for object detection[C]//Proceedings of the IEEE International Conference on Computer Vision (CVPR), IEEE, 2019: 6569-6578. 10.1109/iccv.2019.00667 [Baidu Scholar]
Zhou X Y, Wang D Q, Krähenbühl P. Objects as points[EB/OL]. (2019-05-25)[2021-02-10]. https://arxiv.org/abs/1904.07850. 10.1007/978-3-030-58548-8_28 [Baidu Scholar]
Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size[EB/OL]. (2016-11-4)[2021-02-10]. https://arxiv.org/abs/1602.07360. [Baidu Scholar]
Howard J, Zhu M L, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017-05-17)[2021-02-10]. https://arxiv.org/abs/1704.04861. [Baidu Scholar]
Sandler M, Howard J, Zhu M L, et al. Mobilenetv2: Inverted residuals and linear bottlenecks [C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2018: 4510-4520. 10.1109/cvpr.2018.00474 [Baidu Scholar]
Zhang X Y, Zhou X Y, Lin M X, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CPVR), IEEE, 2018: 6848-6856. 10.1109/cvpr.2018.00716 [Baidu Scholar]
Ma N N, Zhang X Y, Zheng H T, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 116-131. 10.1007/978-3-030-01264-9_8 [Baidu Scholar]
Wen W, Wu C, Wang Y, et al. Learning structured sparsity in deep neural networks[C]//Advances in neural information processing systems (NIPS), 2016, 2074-2082. [Baidu Scholar]
Li C, Shi C J. Constrained optimization based low-rank approximation of deep neural networks[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 732-747. 10.1007/978-3-030-01249-6_45 [Baidu Scholar]
Rastegari M, Ordonez V, Redmon J, et al. Xnor-net: Imagenet classification using binary convolutional neural networks[C]//European conference on computer vision (ECCV), 2016: 525-542. 10.1007/978-3-319-46493-0_32 [Baidu Scholar]
Hubara I, Courbariaux M, Soudry D, et al. Quantized neural networks: Training neural networks with low precision weights and activations[J]. The Journal of Machine Learning Research, 2017, 18(1): 6869-6898. [Baidu Scholar]
Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2017: 2980-2988. 10.1109/iccv.2017.324 [Baidu Scholar]