Abstract
Point target detection in Infrared Search and Track (IRST) is a challenging task. due to less information. Traditional methods based on hand-crafted features are hard to finish detection intelligently. A novel deep spatial-temporal convolution neural network is proposed to suppress background and detect point targets. The proposed method is realized based on fully convolution network. So input of arbitrary size can be put into the network and correspondingly-sized output can be obtained. In order to meet the requirement of real time for practical application, the factorized technique is adopted. 3D convolution is decomposed into 2D convolution and 1D convolution, and it leads to significantly less computation. Multi-weighted loss function is designed according to the relation between prediction error and detection performance for point target. Number-balance weight and intensity-balance weight are introduced to deal with the imbalanced sample distribution and imbalanced error distribution. The experimental results show that the proposed method can effectively suppress background clutters, and detect point targets with less runtime.
Infrared Search and Track (IRST) systems were developed to automatically search, capture and track small incoming targets from infrared sequences. They have been wildly applied to many important fields, including unmanned aerial vehicle (UAV) defense, territory surveillance, space situation awareness (SSA), precise guidance, and so o
Many methods have been presented in the past decades, such as mean subtraction filter
Generally, those assumptions often adopted to support hand-crafted filters include: (1) point target can be modeled as 2D Gauss mode
In this paper, a deep spatial-temporal convolution neural network based on deep learning theory is proposed to detect point targets intelligently. To detect targets from any image size, the fully convolution is adopted. Thus, the proposed network and be trained by small-size images, and tested by large-size images without any modification. In order to meet the requirement of real time, the 3D convolution in the proposed method is factorized into 2D spatial convolution and 1D temporal convolution, fewer parameters are needed and the computing burden is greatly decreased. Additional, the specified loss function is introduced to take the number imbalance and error imbalance into account simultaneously. Simulation results demonstrate that the proposed approach can robustly and effectively suppress background clutters and detect weak point targets in infrared sequences.
The rest of the paper is organized as follows: Section 1 reviews the related work about point target detection and deep learning based methods. Section 2 shows the deep spatial-temporal convolution neural network. The network architecture, factorized 3D convolution, fully convolution and novel loss function are introduced in detail. The overall performance of the proposed method and comparison results with other methods are presented in Section 3. Finally, conclusions are drawn in Section 4.
Benefiting from the enhancement of computer, deep learning techniques have been recently used in object detection, visual recognition, and time series classification. Especially, the deep convolutional neural network (CNN) achieved impressive results at 2015 ImageNet contest. After that, many deep learning based methods are proposed in the field of target location and identification, such as R-CNN
Although many deep learning based methods are studied in recent years, the targets involved in the above methods are large targets, called as area targets, such as human face, vehicle, and animal. The features of area targets are distinctly different from point targets in IRST. In fact, the area target may extend to hundreds of pixels with abundant texture information, geometry information, and color information, which can provide plenty of details for processing. However, the point target is extremely not obvious in shape, size and color characteristics, because they are generated from point source at long distance, and only gray information is obtained by infrared sensor. As a result, the existing detection methods based on deep learning for area targets are not suitable for point target detection in IRST.
Some methods inspired by deep neural networks to detect point targets have been proposed recently. In brief, these methods can be divided into three categories.
(1) Some methods simply convert the detection problem into pattern recognition problem. Ming LIU used traditional 2D CNN networks to judge whether there are infrared small targets in infrared patches
(2) Many researchers try to combine traditional pre-detection and deep learning based recognition. At May 2018, DDP-CNN was proposed based on data-driven proposal and deep learning-based classification by Ryu
(3) 2D CNN networks based on spatial features are used to detect small targets by some researchers. Lin’s method is designed to detect infrared targets in oversampling images
This work will focus on detecting point targets with high performance and less runtime by deep spatial-temporal convolution network.
For some reasons, exiting CNN networks cannot be directly used to detect point targets. Firstly, the traditional CNN networks are used to detect large area targets, and accurate spatial coordinates are thrown away due to pooling layers and fully connected layers. However, the targets in IRST are point targets, which must be located precisely by the pixel or sub-pixel. Secondly, for point target detection, each pixel may present a small target, and infrared image should be processed pixel by pixel. Thus, batch detection method is wanted. Thirdly, the input of traditional networks usually is fixed-size because of fully connected layers, which limits the flexibility in practical application. Thus, the special network must be designed based on the characters of point target.
Fortunately, we have found some characters of point target, and they can make it possible to overcome the above shortcomings. For example, the main features of point target can be obtained by statistical analysis based on its small neighboring region. Thus, smaller receptive filed (RF) of network is enough for point targets in comparison with area targets. As a result, one hand, fewer stacked layers in network are needed, which can reduce the complexity and computation. On the other hand, pooling layers and fully connected layers, used to enlarge receptive filed by image compression and feature integration, can be given up.
For the above reasons, the proposed network is hierarchically constructed by stacking different convolutional layers. The network architecture is shown in

Fig. 1 The proposed network architecture.
图1 本文所提出网络架构
The proposed point target detection network takes video clip as input, and produces a residual image with the same size of the input. The residual image represents the estimated point target intensity after background suppression. The final target index can be obtained by threshold segmentation easily.
In
This method consists of three parts. Firstly, the bottom of this network (C-1 to C-4) is a stack of 3D convolutional layers, which are focus on low-order spatiotemporal features. To improve their efficiency, the factorized 3D convolution is adopted, as introduced in Section 2.2. In second part, the 3D spatiotemporal feature maps generated from video clip are compressed into one 2D hybrid feature map. This operation is carried out by the convolution over the whole video clip in time dimension. In third part, the high-order hybrid features are intensively learned by more convolution kernels. Finally, feature fusion across different channels is achieved by convolution in channel dimension, which makes sure that the output size is same as that of input. The convolution is equivalent to cross-channel parametric pooling layer, and allows complex interactions of cross channel information.
The process is modeled with a fully convolutional network. Convolutional layer is architecture with shared parameters, so all pixels can be processed by the same operation. The feature maps, input and output of each convolutional layer, can be modeled as feature results with size , where , , and are depth, the number of channels, height, width, respectively. For the first convolutional layer, the input is the infrared video clip, and the size is , the length of video clip is ( is set to 5 in this paper), the number of channel is 1 (because the output of infrared sensor is gray image). The output feature map indicates a particular feature representation extracted based on all locations on the input, which can be obtained via convolving the input feature map with a trainable kernel and adding a trainable bias parameter. In this work, the input feature map is denoted as . The weights and bias of convolution kernel are represented by and . Thus, the output feature map can be computed by
, | (1) |
where denotes the convolution operation with stride ( is 1 in this work). Feature representation ability can be enhanced by point-wise nonlinearity operation following with each convolutional layer, and ReLU is adopted in this method. Unlike traditional methods, non-linear down-sampling operation (e.g., max pooling) is thrown away, because pixel-wise prediction is very important for point target detection.
Point target can be detected by 2D convolutional neural networks under smooth background. However, these methods do not provide robust detection in complex background, especially when the clutters are strong as shown in Section 3.2. The reason is that only spatial information is not enough to discriminate true or false targets. Thus, both spatial and temporal information must be fully utilized.
2D convolution is performed only spatially, and temporal information of the input is lost. 3D convolution is done spatiotemporally, and both spatial information and temporal information of the input are preserved. Thus, 3D convolution is well-suited for spatiotemporal feature learning, and it is adopted in the proposed method.
Traditional 3D convolution can be regarded as that 2D spatial convolution and 1D depth projections are performed simultaneously. Thus, the cost of computational complexity is exceptionally high, even higher than the peak of common computers. Although many studies have proven that deep 3D convolutional neural networks can obtain spatiotemporal features even better than human level accuracy, it is beyond the applicable level. To solve the problem of real time and limit memory space in applicable application, the factorized 3D is adopted, which unravels spatial and temporal convolutions apar

Fig. 2 The sketches of 3D covolution and factorized 3D convolution: (a) 3D covolution; (b) factorized 3D convolution.
图2 3D卷积和分解3D卷积示意图:(a) 3D卷积;(b) 3D卷积分解
As shown in
, | (2) |
where represents the Keonecker product. is the 3D convolution. and denote the 2D convolution (spatial convolution) and 1D convolution (temporal convolution), respectively.
The computation comparison of 3D convolution and factorized 3D convolution is listed as
Pooling layer and fully connected layer take important role in traditional deep-learning methods for area target detection.
Typical pooling operations include average pooling and max pooling. They can be considered as non-linear down-sampling. For example, the size of feature map is reduced to a quarter of original after pooling layer. Obviously, the compressed feature maps are coarse and reduced-resolution, and lots of detailed information has been lost
Furthermore, the fully connected layer brings multiple local features from different regions together, but it limits the input size. It means that the size (height and width) of train image must be same as the size of test image.
To overcome those shortages, the proposed network only consists of many convolution layers, called fully convolution method. The 2D convolution layers with kernel are used to extract spatial features, while the 1D convolution layers with kernel are adopted for feature extraction in time domain. Different convolution layers are connected in series, and it makes the features from both spatial convolution and temporal convolution to be deeply integrated. By choosing slip step=1, the intensity generated from point target is estimated pixel by pixel. So, pixel-wise processing is carried out by the proposed fully convolution network.
Additionally, it has been demonstrated that small receptive fields of convolution kernels can better learn complex features with deeper architecture
The multi-weighted norm loss function is proposed in this work, and can be expressed as
, | (3) |
where is the number of training samples, is the number of pixels in training sample, is the ground truth for the th pixel in th training sample. is the output of the proposed network. is the number-balance weighting parameter, and is the intensity-balance weighting parameter. The different weighting parameters are jointly used to trade off the false alarms and missing alarms. The loss function is minimized during the training, and it indicates that the predicted target intensity gradually reaches the truth.
The sample imbalance encountered in training of point target detection may bring extreme error. The imbalance between target samples and background samples can overwhelm training and lead to degenerate model, especially when background samples are far more than target samples. In fact, the background samples generally belong to majority class, while the target samples are in the minority. In practical application, the background images can be easy and often obtained, however the true targets are relatively rare. If the sample imbalance cannot be solved, the training is inefficient. For example, the extreme sample imbalance may lead to true target can be completely ignored. In order to alleviate the bias in performance caused by imbalanced sample distribution, the number-balance weight is assigned to each sample to weaken the relative impact of background samples, while strengthen the impact of target samples. can be calculated by
(4) |
(5) |
where and are the number of background samples and the number of target samples, respectively. The training sample includes the background samples and the target samples, i.e. .
For point target, the detection result can be obtained by threshold segmentation based on the output after background suppression. Although the error sum may be same for different predicted results, the detection results are obviously different as shown in

Fig. 3 The example of different error distributions: (a) the ground truth; (b) 1th predicted result with uniform error; (c) 2th predicted result with concentrated error.
图3 不同误差分布示例:(a)真值;(b) 误差均匀分布的预测结果;(c)误差聚集的预测结果
In this work, the intensity-balance weight is represented by , and can be calculated by
(6) |
The calculation of can be shown as

Fig. 4 The function of intensity weighting parameter.
图4 能量权重因子函数
As described in
In this section, three experiments are performed to evaluate the performance of the proposed method. A large amount of infrared samples are generated based on point target model and real background images shown as

Fig. 5 The example of samples: (a) the target samples; (b) the background samples.
图5 样本示例:(a)目标样本,(b)背景样本
In this paper, the weak target is regarded as point target, because of long-range observation. The point targets are generated by 2D Gaussian function as following
, | (7) |
where is target projection on image planar, and represent the spatial coordinates, represents the target center position, denotes the peak intensity, and are the variance in row and column direction.
Furthermore, the observation of optical image embedded with dim point target can be obtained as following
, | (8) |
where represents the output image, denotes cloud background, is noise.
For the background samples, the ground truth is a fully zero-value image. If there are target pixels and background pixels at the same time in the training sample, the ground truth only contains the gray values of target pixels, and the others are set to zeros. The preprocessing before entering the network is necessary for all training and testing samples. In this work, the preprocessing is carried out by de-averaging and normalization. It should be noted that the same average is used for all samples, because the sequences are obtained by same sensor. As a result, the absolute intensity of point targets can be better preserves to support distinction. Besides, in these experiments.
In the first experiment, the point targets are detected by the proposed method, Lin’s metho
In the second experiment, the availability of the proposed method with different input size is proved. It demonstrates the advantage inheriting from fully convolution.
In the third experiment, the detection performance of the proposed method is evaluated under different conditions including original signal-to-clutter ratio (SCR) and jitter of sensor.
In order to measure the performance, the following evaluation metrics are introduced.
To measure the ability of removing background, the background suppression factor (BSF) is introduce
, | (9) |
where represents the standard deviation of original image, and denotes the one of output image, respectively.
The quality of image can also be indicated by SCR. It is defined as
, | (10) |
where is the target intensity without background, and represents the standard deviation of local background region
The comprehensive detection result is evaluated by Receiver Operating Characteristic (ROC). The ROC curve can describe the detection result by a function with the probability of detection () and the probability of false alarms ()
, | (11) |
where denotes the number of detection reports from true point targets, and denotes the number of true targets. is defined by
, | (12) |
where represents the number of detection reports from false alarms, and is the number of pixels of all testing images.
The simulation environment in this work is shown in the
In this simulation, 10000 training samples are obtained based on 10 real background sequences. Meanwhile, 10000 testing samples are obtained from another 10 real background sequences. The point taregets in samples are randomly added based on point target model. Mean SCR of original image is about 6. The jitter of sensor is simulated as random Gaussian distribution with . The size of infrared image is pixels,

Fig. 6 The original image and results of different methods for
图6 在第1组背景中不同原始图像和不同方法的处理结果:(a)原始输入; (b)本文处理结果; (c)Lin方法处理结果; (d)Max-Mean处理结果; (e)TopHat处理结果; (f) STDA处理结果

Fig. 7 The original image and results of different methods for
图 7 在第2组背景中不同原始图像和不同方法的处理结果: (a)原始输入; (b)本文处理结果; (c)Lin方法处理结果; (d)Max-Mean处理结果; (e)TopHat处理结果; (f) STDA处理结果

Fig. 8 The original image and results of different methods for Target 1: (a) the original input; (b)the result of our method; (c) the result of Lin’s method; (d) the result of Max-Mean; (e) the result of TopHat; (f) the result of STDA.
图8 目标1的原始图像和不同方法处理结果: (a)原始输入; (b)本文处理结果; (c)Lin方法处理结果; (d)Max-Mean处理结果; (e)TopHat处理结果; (f) STDA处理结果

Fig. 9 The original image and results of different methods for Target 2: (a) the original input; (b)the result of our method; (c) the result of Lin’s method; (d) the result of Max-Mean; (e) the result of TopHat; (f) the result of STDA.
图9 目标2的原始图像和不同方法处理结果: (a)原始输入; (b)本文处理结果; (c)Lin方法处理结果; (d)Max-Mean处理结果; (e)TopHat处理结果; (f) STDA处理结果
The original image and standard deviation in the time domain of Target 2 are shown in

Fig. 10 The display of Target 2: (a) the original gray image; (b) the standard deviation in the time domain.
图 10 目标2展示:(a)原始灰度图像;(b)时域标准差分布
In order to intuitively show the background suppression performance, the comparisons of SCR and BSF for two point targets are listed in
The ROC curves of different methods are depicted in

Fig. 11 The ROC curves of different methods.
图11 不同方法的ROC曲线
For IRST application, the high real time is required. The comparison of average runtime is listed in
In the above experiment, the input size is set to pixels for convenience. But, it is not fixed. Benefiting from fully convolution, each pixel can be processed using the same parameters at the same time. As a result, though the proposed network is trained with the input size, input of arbitrary size can be put into the network and correspondingly-sized output can be obtained at once. On the contrary, the patch-wise method is fixed.
The image of

Fig. 12 The result of different input size: (a) the input image with 35×35 pixels; (b)the result of image with35×35 pixels; (c) the input image with 45×45 pixels; (b)the result of image with45×45 pixels.
图12 不同尺度输入测试结果:(a)输入35×35图像;(b)35×35图像对应的输出;(c)输入45×45图像;(d)45×45图像对应的输出
In order to illustrate the detection performance of the proposed method under different conditions, the infrared sequences with different jitters and original SCRs are tested.
In

Fig. 13 The ROC curves with different jitters.
图13 不同抖动条件下的ROC曲线
The detection performances of the proposed method for point targets with different original SCRs are analyzed, and jitter is fixed to 0.2.

Fig. 14 The ROC curves with different mean original SCRs.
图14 不同均值SCR下的ROC曲线
In IRST, point detection is still a great challenge for some reasons. Traditional methods can’t robustly and intelligently detect point targets in complex background. In this work, a deep spatial-temporal convolution neural network is proposed to address this problem. The network is built based on fully convolution without pooling layer and fully connected layer, factorized 3D convolution and multi-weighted loss function are adopted to enhance the performance. The proposed method is compared to other four methods, including traditional methods (e.g., Max-Mean filter TopHat filter and Spatial-Temporal Accumulative Difference method) and deep learning based method (e.g., Lin’s method). The detection performance is evaluated by different metrics, such as signal-to-clutter ratio, background suppression factor. Meanwhile, ROC curves are drew to confirm the robustness of the proposed approach. Additionally, the comparison under different conditions is carried out for the proposed method, and the affections of original SCR and sensor’s jitter are demonstrated in detail. Consequently, the deep spatial-temporal convolution neural network can effectively detect point targets using less runtime.
References
Junhwan R, Sungho K. Small infrared target detection by data-driven proposal and deep learning-based classification [C]. Proc. of SPIE on infrared technology and application, 2018: 10621J-10624J. [百度学术]
Sui X, Chen Q, Bai L. Detection algorithm of targets for infrared search system based on area infrared focal plane array under complicated background [J]. Optik. 2012, 123: 235-239. [百度学术]
Zhao J, Tang Z, Yang J, et al. Infrared small target detection using sparse representation [J]. Journal of systems engineering and electronics. 2011, 22(6): 897-904. [百度学术]
Li M, Lin Z, Long Y, et al. Joint detection and tracking of size-varying infrared targets based on block-wise sparse decomposition [J]. Infrared Physics and Technology. 2016, 76: 131-138. [百度学术]
Gao J, Lin Z, An W. Infrared small target detection using a temporal variance and spatial patch contrast filter [J]. IEEE Access. 2019, 7: 32217-32226. [百度学术]
Zhang W, Cong M, Wang L. Algorithms for optical weak small targets detection and tracking: Review [C]. International conference on neural networks and signal, 2003: 643-647. [百度学术]
Chen C, Li H, Wei Y, Xia T, et al. A local contrast method for small infrared target detection [J]. IEEE transactions on geoscience and remote sensing. 2014, 52: 574-581. [百度学术]
Warren R. Detection of distant airborne targets in cluttered backgrounds in infrared image sequences [D]. University of south australia, 2002. [百度学术]
Barnett J. Statistical analysis of median subtraction filtering with application to point target detection in infrared backgrounds [C], Proc. of SPIE on infrared system and components III, 1989: 10-18. [百度学术]
Tom V, Peli T, Leung M, Bondaryk J. Morphology-based algorithm for point target detection in infrared backgrounds [C]. Proc. of SPIE on signal and data processing of small targets, 1993: 2-11. [百度学术]
Deshpande S D, Meng H E, Venkateswarlu R, et al. Max-mean and max-median filters for detection of small targets [C]. Proc. of SPIE on international symposium on optical science, engineering, and instrumentation, 1999: 74-83. [百度学术]
Yu Q, Huang S, Zhao W, et al. A fusion detection algorithm of small infrared target based on spatial-temporal accumulative difference [J]. [百度学术]
Journal of projectiles, rockets, missiles and guidance 于强, 黄树彩, 赵炜, 等. 一种基于时空域累积差分的红外小目标融合检测算法. 弹箭与制导学报), 2014, 34(6): 181-189. [百度学术]
Schmidt W. Modified matched filter for cloud clutter suppression [J]. IEEE transactions on pattern analysis and machine intelligence. 1990, 12: 594-600. [百度学术]
Gao C, Meng, D, Yang Y, et al. Infrared patch-image model for small target detection in a single image [J]. IEEE transactions on image processing. 2013, 22(12): 5996-5009. [百度学术]
Girshick R, Donahue J, Darrell T, Region-based convolutional networks for accurate object detection and segmentation [J]. IEEE Transactions on pattern analysis and machine intelligence. 2016, 38: 142-158. [百度学术]
Girshick R. Fast R-CNN [C]. Proc. of IEEE on computer vision, 2015: 1440-1448. [百度学术]
Ren S, He K, Girshick R, et al. Faster RCNN: Towards real-time object detection with region proposal networks [C]. Advances in neural information processing systems, 2015: 91-99. [百度学术]
He K, Gkioxari G, Dollar P. Mask R-CNN [C]. Proc. of IEEE on computer vision, 2017: 2980-2988. [百度学术]
Redmon J, Divvala S, Gishick R. You only look once: Unified, real-time object detection [C]. Proc. of IEEE on computer vision and pattern recognition, 2016: 779-788. [百度学术]
Joseph R, Farhadi A. YOLO3: An incremental improvement [J]. ArXiv preprint. 2018: 1-6. [百度学术]
Liu M, Du H, Zhao Y, et al. Image small target detection based on deep learning with SNR controlled sample generation [C], Proc. of CSMA, 2017: 211-220. [百度学术]
Lin L, Wang S, Tang Z. Using deep learning to detect small targets in infrared oversampling images [J], Journal of Systems Engineering and Electronics. 2018, 5: 947-952. [百度学术]
Ian G, Yoshua B, Aaron C. Deep learning [M]. MIT Press, 2017. [百度学术]
Wang M, Liu B, Hassan F. Factorized convolutional neural networks [C]. Proc. of IEEE on computer vision workshop, 2017: 1-10. [百度学术]
Sun L, Jia K, Yeung D, et al. Human action recognition using factorized spatio-temporal convolutional networks [C], Proc. of IEEE on computer vision, 2015: 4597-4605. [百度学术]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition [J]. ArXiv preprint. 2014: 1409-1556. [百度学术]
Long J, Evan S, Trevor D. Fully convolutional networks for semantic segmentation [C]. Proc. of IEEE on computer vision and pattern recognition, 2015: 1-10. [百度学术]
Kim S. Double layered-background removal filter for detecting small infrared targets in heterogenous backgrounds [J]. Journal of infrared, millimeter and terahertz waves. 2011, 32: 79-101. [百度学术]
Guo C, Deyu M, Yi Y, et al. Infrared patch-image model for small target detection in a single image [J]. IEEE transactions on image processing. 2013, 22: 4996-5009. [百度学术]
He Y, Li M, Zhang J, et al. Small infrared target detection based on low-rank and sparse representation [J]. Infrared physics & technology. 2015, 68: 98-109. [百度学术]