Abstract
Due to the long distance and complex background, it is hard for the infrared detecting and tracking system to find and locate the dim-small targets in time. The proposed method, ACE-STDN, aims to tackle this difficult task and improve the detection accuracy. First of all, an adaptive contrast enhancement subnetwork preprocesses the input infrared image, which is conducive for the low-contrast dim targets. Next, a detection subnetwork with a hybrid backbone takes advantage of both convolution and self-attention mechanisms. Besides, the regression loss is designed based on 2D Gaussian distribution representation instead of Intersection over Union measurement. To verify the effectiveness and efficiency of our method, we conduct extensive experiments on two public infrared small target datasets. The experimental results show that the model trained by our method has a significant improvement in detection accuracy compared with other traditional and data-based algorithms, with the average precision reaching 93.76%. In addition, ACE-STDN achieves outstanding detection performance in a multiclass object dataset and a general small object dataset, verifying the effectiveness and robustness.
Single-frame infrared small target detection (SIRST) aims at locating small targets from complex backgrounds using the infrared radiation difference, which is one of the research hot spots in many applications. Infrared acquisition technology surpasses visible and radar detecting systems in several aspects, i.e., the strong shielding property of passive detecting, weather-protected long imaging range, and high sensitivity to boosting missiles plumes. Therefore, it is widely used in the military field, such as long-range precision strikes, aerospace defense confrontation, battlefield intelligence, and reconnaissance. Besides, it also makes remarkable achievements in remote sensing, medical imaging, and industrial flaw detection.
The bottleneck of the infrared acquisition system and its application lies in the capability to detect small-dim targets. It is a very challenging problem since an infrared small-dim target only contains less than 9 ×9 pixels (of a 256 ×256 image) and the images have low contrast and low signal-to-noise ratio (SNR). Therefore, the performance and efficiency of detecting small objects are far from satisfactory, and the technical roadblocks are as follows. (1) Targets lack structural features like fine texture and shape. (2) Small objects have similar characteristics to unpredictable background clutter. (3) Infrared imagery is frequently accompanied by smog and waves. (4) There are a few public infrared small target datasets.
Up to now, the research in the single-frame detection domain can be categorized into four groups. Filter-based algorithms
Small object detection has been a hot topic in computer vision for years. Multiple methods are proposed to solve this difficult but important task. The first kind of method goes deep into multi-scale feature learning. Feature pyramid network (FPN)
In recent years, some awesome learning-based methods are proposed for infrared small target detection. MDvsFA-cGAN
This work is motivated by the following thought that we gain from SIRST tasks. Due to the limited pixels of the targets, producing high-quality detection performance requires an efficient method to extract features and make full use of the information from the infrared image. However, such satisfactory performance may not be sufficiently achieved by a single detection network, especially when the small target is dim as well. Inspired by the traditional background suppression-based methods, we find that a better approach may be adding an adjunctive preprocessing module to reduce noise and improve contrast.
Following the above idea, we propose an end-to-end deep learning framework to improve the performance of SIRST. In this framework, the pipeline can be divided into two phrases, i.e., enhancing the contrast and detecting small targets. Two deep neural networks are constructed to focus on the two stages respectively. The contrast enhancement subnetwork works to improve the SNR and the relative local thermal contrast adaptively, then its output passes to the small target detection subnetwork. Existing CNN-based detection networks can extract features by convolution within a small neighborhood, but the limited receptive field makes it hard to capture global dependency. Another framework, Transformer, becomes more popular and dominates among various vision tasks for the capacity to capture long-range dependency and learn global contextual information via self-attention. However, it induces costly computation at the same time, which means a pure transformer structure is not a wise choice. In addition, these state-of-the-art networks are designed for generic image datasets. Directly using them for infrared small target detection can fail catastrophically due to the large difference in the data distribution. Thus, it is of great importance to re-design the structure to handle infrared small target detection tasks. To make the best use of these two dominant frameworks and alleviate their respective deficiencies, we design a hybrid detection subnetwork that integrates the self-attention mechanism into deep layers and applies a novel transformer-styled convolution block. In this way, the network can extract long-range information and overcome redundancy and dependency simultaneously.
The contributions of this paper can be summarized as follows. Firstly, we propose a novel framework for infrared small object detection, using a two-stage learning paradigm. Compared with the existing learning-based methods that use a single network for detection, our approach operates in favor of the dim targets. Secondly, an adaptive contrast enhancement subnetwork is proposed to preprocess the input infrared image by suppressing the complex background and highlighting the target. Thirdly, a hybrid backbone is designed in the small target detection subnetwork, which is beneficial to find and locate small targets under difficult circumstances of long distances and complex backgrounds. Besides, this backbone proves the availability and superiority of the mix of both self-attention and convolution. Last, we use a loss function that measures the similarity between bounding boxes of tiny objects by the distance of their corresponding Gaussian distributions instead of IoU-based measurement series.
In this section, we introduce the proposed infrared dim-small target detection pipeline at length. Ordered by the workflow, the contrast enhancement subnetwork adjusts the thermal contrast adaptively to suppress the complex background and highlight the target. Then the hybrid backbone extracts features combining the advantages of both convolution and self-attention mechanisms. At last, the small-target-oriented detector predicts infrared dim-small targets with two detection heads based on feature maps from different layers. Besides, we apply a novel regression loss to elevate accuracy and speed up convergence.

Fig. 1 The training pipeline of the proposed ACE-STDN framework. Our method consists of two subnetworks to preprocess the infrared image and detect small targets respectively. The contrast enhancement subnetwork aids the small target detection subnetwork to achieve better performance, especially for dim targets.
图 1 本文提出的ACE-STDN的网络框架与训练流程
The image captured by the infrared imaging system usually has a low signal-to-noise ratio, and lacks in the relative local thermal contrast, causing some small targets with indistinctive characteristic. These infrared dim-small targets bring difficulties in object detection. To tackle this challenge, we design an adaptive contrast enhancement subnetwork (ACESN) to preprocess the input infrared image before detection. Given an infrared image , the preprocessing procedure can be modeled as:
, | (1) |
where represents the enhanced infrared image. is the enhancement network with trainable parameter , which is illustrated in

Fig. 2 The adaptive contrast enhancement subnetwork for infrared images. This network consists of three main modules, where gray arrows denote convolution layers, and the green ones are deconvolution layers
图2 红外图像自适应对比度增强网络
The ACESN can be divided into three modules. First of all, the multi-level feature extraction module (MLFEM) is a simple 4-layer CNN, while in each convolution layer, the kernel is in size and 1 in stride. Besides, it applies ReLU as the activation function. The input of MLFEM is the low-contrast infrared image, and each feature map is the input to its corresponding feature enhancement sub-module as well as the input of the next layer. Secondly, the branch-independent enhancement module (BIEM) is composed of 4 feature enhancement sub-modules. The output of each branch is an enhanced image . Each sub-module has an identical symmetric architecture, operating downsampling and upsampling. Except for the kernel size, these layers have the same settings in MLFEM, stride 1, and ReLU nonlinearity. The last one is the fusion module (FM), which concatenates the 4 output images from BIEM to produce the final enhanced image using a convolution kernel. This merging equals a weighted sum with learnable weights.
The feature extraction network, known as the backbone, is the bedrock of detection tasks. It conducts convolution on input images and provides concise semantic information for the subsequent detectors. However, in most classic networks, the convolution block has a relatively small receptive field, which leads to poor performance and needs a proper solution. Impressed by the effectiveness of vision transformer, we propose a novel hybrid backbone (HB) with transformer-styled convolution (TSConv) blocks and a transformer encoder block.
Instead of simply connecting transformer encoders with convolution blocks, we design the TSConv block to uniform them. Following the instruction in ConvNeX

Fig. 3 The structure of the transformer encoder block and TSConv block.
图 3 特征提取网络中组件的结构
We apply a transformer encoder block to capture long-range dependency and learn global contextual information. As shown in
Because the self-attention mechanism is inefficient to encode local features in the shallow layers. It simply captures detailed visual features, resembling the feature extraction result of convolution. Moreover, the self-attention applied on high-resolution shallow feature maps brings a large and unnecessary computation burden. In contrast, the convolution operation focuses on local dependency within a small neighborhood. It can obtain similar features in the shallow layers while reducing local redundancy and avoiding unnecessary computation. Therefore, as shown in

Fig. 4 The architecture of the detection subnetwork
图 4 目标检测子网络的网络架构
In the definition by SPIE, an infrared dim-small target only occupies less than 9 × 9 pixels of a 256 × 256 image. After multiple feature extraction layers, the feature map fed into the prediction head lacks enough information of small objects. Therefore, we need to redesign the prediction head using low-level, high-resolution feature maps.
First of all, we use a weighted bi-directional feature pyramid network (BiFPN)
Secondly, in order to improve the detection performance for infrared dim-small targets, we apply the coordinate attention mechanism to adaptively enhance multi-level features, which embeds smaller-scale details into high-level coarse feature maps. The coordinate attention embeds positional information into channel attention, aggregating features along the two spatial directions. One spatial direction captures long-range dependencies and the other preserves precise position information. In this way, the resulting feature maps can be applied to the input feature map to augment the representations of the infrared dim-small targets.
At the end of the detection pipeline, we remove the detection head for large-scale and medium-scale objects in YOLOv5. At the same time, we conduct one more upsampling and add a new prediction head for tiny infrared targets, illustrated in

(a)

(b)
Fig. 5 Two different frameworks:(a) the framework of YOLOv5; (b) our improved framework
图 5 两种不同的检测框架:(a) YOLOv5的检测框架;(b) 本文改进的检测框架
Interfered with atmospheric scattering, atmospheric refraction, optics compensation, etc., the signals of small objects received by the infrared imaging system are extremely weak. As illustrated in
, | (2) |

(a)

(b)
Fig. 6 Infrared small-dim targets in the real world and their local intensity distribution:(a) simple background; (b) complex background
图 6 真实红外弱小目标的领域强度分布:(a) 简单背景; (b) 复杂背景
where and are the scale parameters of horizontal and vertical respectively. is the gray value of the object. And denotes the gray-level spatial cumulative distribution function of this dim-small target.
Inspired by the formula above, we investigate the infrared dim-small target dataset and find that these small instances do not correspond to the rectangle shape perfectly. A bounding box contains both target and background information, while the background information is distributed near the boundaries. Therefore, we can use a 2D Gaussian distribution to describe the bounding box, modeling the importance of pixels inside the bounding box by weights. Use a random bounding box as an example, where , and denote the center coordinates, width, and height respectively. Its inscribed ellipse can be represented as:
, | (3) |
where is the center coordinates of the ellipse, and are the lengths of semi-axes along and axes. Accordingly, , , , . This ellipse will be a density contour of the 2D Gaussian distribution. Therefore, the mentioned bounding box can be modeled as:
, | (4) |
Therefore, we can measure the similarity between two bounding boxes by the distribution distances of their corresponding Gaussian distributions, replacing the common measurement, IoU. Both measurements are illustrated in

Fig. 7 The schematic diagram of measurements using a discrete bounding box and 2D Gaussian Distribution
图 7 使用分布距离衡量包围框相似性
Following the Ref. [
, | (5) |
Next, the distance is changed into its exponential form in which the value is constrained within :
, | (6) |
Therefore, the regression loss function can be redesigned as:
, | (7) |
The main advantage of Normalized Wasserstein Distance (NWD) is that it can provide a gradient for optimizing the network in two special cases. One is when there is no overlap between the predicted bounding box and the ground truth. The other is the predicted box contains the ground truth box completely or vice versa. Therefore, NWD-loss is suitable for our infrared dim-small target detector.
In this section, we evaluate the effectiveness of our method in the scenarios under complex backgrounds. First of all, we describe the experimental setting, which includes the datasets, evaluation metrics, network implementation details and comparison methods. Next, we conduct ablation studies to examine the effectiveness and practicability of each module in our proposed framework. Then, the visual and numerical comparison between ACE-STDN and state-of-the-art methods further demonstrates that ACE-STDN can accurately detect infrared small targets. Finally, we show detection results on visible images to verify the generalization ability of our ACE-STDN.
Dataset. We use the public dataset
Evaluation Metrics. The evaluation metrics used in this paper are Precision, Recall, F1-score, and Average Precision (AP). Precision represents the credibility of detection results, while Recall reflects whether the detection algorithm locates all the infrared small targets. F1-score is used to measure the relationship between them.
, | (8) |
, | (9) |
, | (10) |
TP means that a positive target is predicted as real. FP means that a negative target is predicted as real. TF means that a positive target is predicted as fake.
Implement Details. We adopt the training protocol of YOLOv5 in our proposed ACE-STDN algorithm. During training, the data augmentation methods like image flipping, mosaic, and random perspective are applied to expand the training dataset. Our ACE-STDN approach is trained by the Adam optimizer with 200 epochs. The start learning rate is and the batch size is 8. ACE-STDN predicts the bounding boxes at two scales (tiny and small), and three anchors at each scale. We implement ACE-STDN on Pytorch 1.11.0 and run it on the NVIDIA TITAN V for training and testing.
Comparison Methods. This paper chooses seven detection methods based on deep learning to compare with ACE-STDN in two aspects. The first group aims to prove the availability and superiority of the combination of self-attention mechanism and convolution operation. We compare our method with a pure CNN-based network – YOLOv5, a pure transformer framework – ViT
To examine the effectiveness of each module in our proposed framework, we conduct ablation experiments on different settings, including hybrid backbone, adaptive contrast enhancement module, and 2D Gaussian Distribution Regression Loss. As shown in
Model | HB | ACESN | NWD | AP |
---|---|---|---|---|
A | √ | √ | 88.25% | |
B | √ | √ | 93.02% | |
C | √ | √ | 92.48% | |
D | √ | √ | √ | 93.76% |
In this section, we compare ACE-STDN with state-of-the-art methods through visual and numerical evaluation to further verify the effectiveness of our method.
The effectiveness of hybrid usage of self-attention and convolution. As shown in

Fig. 8 Illustration of detection results on ATDT
图 8 在ATDT数据集上的检测结果
According to
Model | Average Precision | Inference Time |
---|---|---|
YOLOv5 | 87.27% | 5.3 ms |
ViT | 59.63% | 5.5 ms |
TPH-YOLOv5 | 74.95% | 8.7 ms |
ACE-STDN | 93.76% | 4.8 ms |
Comparison with specialized methods for infrared small target detection. As shown in

Fig. 9 Illustration of detection results on SIRST
图 9 在SIRST数据集上的检测结果
In order to illustrate the effectiveness of ACE-STDN straightforwardly, we use numerical methods for quantitative evaluation. From
Method | Precision | Recall | F1-score |
---|---|---|---|
TopHat | 0.6873 | 0.0818 | 0.1461 |
LCM | 0.6201 | 0.1443 | 0.2341 |
NRAM | 0.7549 | 0.1544 | 0.2563 |
IPI | 0.7640 | 0.1813 | 0.2931 |
ACE-STDN | 0. 8537 | 0.8362 | 0.8448 |
In addition to evaluating ACE-STDN on infrared datasets containing targets of just one type, the model is also trained on the multiclass dataset to verify its outstanding detection ability. The results are shown in Fig.

Fig. 10 Illustration of detection results on a multiclass infrared dataset
图 10 在多类别红外数据集上的检测结果

Fig. 11 Illustration of detection results on a multiclass RGB dataset
图 11 在多类别可见光数据集上的检测结果
We proposed a new infrared target detection model ACE-STDN with contrast enhancement adaptively. With the benefit of preprocessing subnetwork, it can work well in detecting dim small targets in infrared images. In addition, the hybrid backbone is designed to improve feature representation, which proves that proper mix-usage of self-attention and convolution is superior to the pure mechanisms. And the 2D Gaussian distribution-based regression loss is suitable for infrared small target detection concerning the relative position between two bounding boxes. Evaluations on both single-class and multiclass datasets, both infrared and visible datasets demonstrate the outstanding performance of ACE-STDN as it achieves a better balance between precision and recall. In summary, ACE-STDN provides a new choice for small-dim target detection in IP systems. In the future, we plan to speed up the network for real-time detection tasks.
References
Tom V T, Peli T, Leung M, et al. Morphology-based algorithm for point target detection in infrared backgrounds[C]. Signal & Data Processing of Small Targets. International Society for Optics and Photonics, 1993. 10.1117/12.157758 [Baidu Scholar]
Wang X, Peng Z, Zhang P, et al. Infrared Small Target Detection via Nonnegativity-Constrained Variational Mode Decomposition[J]. IEEE Geoscience and Remote Sensing Letters, 2017, 14(10): 1700–1704. 10.1109/lgrs.2017.2729512 [Baidu Scholar]
Chen, Philip C L, Li H, et al. A Local Contrast Method for Small Infrared Target Detection[J]. IEEE Transactions on Geoscience & Remote Sensing, 2014, 52(1): 574–581. 10.1109/tgrs.2013.2242477 [Baidu Scholar]
Bai X, Bi Y. Derivative Entropy-Based Contrast Measure for Infrared Small-Target Detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, PP (99): 1-15. 10.1109/tgrs.2017.2781143 [Baidu Scholar]
Gao C, Meng D, Yang Y, et al. Infrared Patch-Image Model for Small Target Detection in a Single Image[J]. IEEE Transactions on Image Processing, 2013, 22(12): 4996–5009. 10.1109/tip.2013.2281420 [Baidu Scholar]
Zhang L, Peng L, Zhang T, et al. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint l2,1 Norm[J]. Remote Sensing, 2018, 10(11): 1821. 10.3390/rs10111821 [Baidu Scholar]
Lin T Y, Dollar P, Girshick R, et al. Feature Pyramid Networks for Object Detection[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 936-944. 10.1109/cvpr.2017.106 [Baidu Scholar]
Li Y, Chen Y, Wang N, et al. Scale-Aware Trident Networks for Object Detection[C]. IEEE/CVF International Conference on Computer Vision (ICCV), 2019: 6053-6062. 10.1109/iccv.2019.00615 [Baidu Scholar]
Bai Y, Zhang Y, Ding M, et al. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network[C]. European Conference on Computer Vision (ECCV), 2018. 10.1007/978-3-030-01261-8_13 [Baidu Scholar]
Li J, Liang X, Wei Y, et al. Perceptual Generative Adversarial Networks for Small Object Detection[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 1951-1959. 10.1109/cvpr.2017.211 [Baidu Scholar]
Noh J, Bae W, Lee W, et al. Better to Follow, Follow to Be Better: Towards Precise Supervision of Feature Super-Resolution for Small Object Detection[C]. IEEE/CVF International Conference on Computer Vision (ICCV), 2019: 9724-9733. 10.1109/iccv.2019.00982 [Baidu Scholar]
Bell S, Zitnick C L, Bala K, et al. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 2874-2883. 10.1109/cvpr.2016.314 [Baidu Scholar]
Hu H, Gu J, Zhang Z, et al. Relation Networks for Object Detection[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 3588-3597. 10.1109/cvpr.2018.00378 [Baidu Scholar]
Xu T, Du D K, He Z, et al. PyramidBox: A Context-assisted Single Shot Face Detector[C]. European Conference on Computer Vision (ECCV), 2018. 10.1007/978-3-030-01240-3_49 [Baidu Scholar]
Zhang S, Zhu X, Lei Z, et al.
Zhang S, Zhu X, Lei Z, et al. FaceBoxes: A CPU real-time face detector with high accuracy[C]. IEEE International Joint Conference on Biometrics (IJCB), 2017: 1-9. 10.1109/btas.2017.8272675 [Baidu Scholar]
Yu J, Jiang Y, Wang Z, et al. UnitBox: An Advanced Object Detection Network[C]. ACM Proceedings of the 24th ACM international conference on Multimedia, 2016: 516-520. 10.1145/2964284.2967274 [Baidu Scholar]
Zheng Z, Wang P, Liu W, et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression[C]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 10.1609/aaai.v34i07.6999 [Baidu Scholar]
Xu C, Wang J, Yang W, et al. Dot Distance for Tiny Object Detection in Aerial Images[C]. IEEE/CVF Computer Vision and Pattern Recognition Workshops (CVPRW), 2021: 1192-1201. 10.1109/cvprw53098.2021.00130 [Baidu Scholar]
Xu C, Wang J, Yang W, et al. Detecting Tiny Objects in Aerial Images: A Normalized Wasserstein Distance and a New Benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 190: 79-93. 10.1016/j.isprsjprs.2022.06.002 [Baidu Scholar]
Wang H, Zhou L, Wang L. Miss Detection vs. False Alarm: Adversarial Learning for Small Object Segmentation in Infrared Images[C]. IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 10.1109/iccv.2019.00860 [Baidu Scholar]
Li B, Xiao C, Wang L, et al. Dense Nested Attention Network for Infrared Small Target Detection[J]. IEEE Transactions on Image Processing, 2022. 10.1109/tip.2022.3199107 [Baidu Scholar]
Dai Y, Wu Y, Zhou F, et al. Asymmetric Contextual Modulation for Infrared Small Target Detection[C]. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021: 950-959. 10.1109/wacv48630.2021.00099 [Baidu Scholar]
Tong X, Sun B, Wei J. EAAU-Net: Enhanced Asymmetric Attention U-Net for Infrared Small Target Detection [J]. Remote Sensing, 2021, 13. 10.3390/rs13163200 [Baidu Scholar]
Liu F, Gao C, Chen F, et al. Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds[J]. arXiv e-prints, 2021. 10.48550/arXiv.2109.14379 [Baidu Scholar]
Liu Z, Mao H, Wu C Y, et al. A ConvNet for the 2020s[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022: 11966-11976. 10.1109/cvpr52688.2022.01167 [Baidu Scholar]
Tan M, Pang R, Le Q V. EfficientDet: Scalable and Efficient Object Detection[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10778-10787. 10.1109/cvpr42600.2020.01079 [Baidu Scholar]
Sungho Kim, Joo-Hyoung Lee. Robust Scale Invariant Target Detection Using the Scale-Space Theory and Optimization for IRST[J]. Pattern Analysis & Applications, 2011, 14: 57-66. 10.1007/s10044-010-0190-x [Baidu Scholar]
Hui B, Song Z, Fan H, et al. A Dataset for Infrared Image Dim-Small Aircraft Target Detection and Tracking under Ground / Air Background[DS/OL]. Science Data Bank, 2019. [Baidu Scholar]
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]. International Conference on Learning Representations. 2021. [Baidu Scholar]
Zhu X, Lyu S, Wang X, et al. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios[C]. IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021: 2778-2788. 10.1109/iccvw54120.2021.00312 [Baidu Scholar]