Abstract
In response to the scarcity of infrared aircraft samples and the tendency of traditional deep learning to overfit, a few-shot infrared aircraft classification method based on cross-correlation networks is proposed. This method combines two core modules: a simple parameter-free self-attention and cross-attention. By analyzing the self-correlation and cross-correlation between support images and query images, it achieves effective classification of infrared aircraft under few-shot conditions. The proposed cross-correlation network integrates these two modules and is trained in an end-to-end manner. The simple parameter-free self-attention is responsible for extracting the internal structure of the image while the cross-attention can calculate the cross-correlation between images further extracting and fusing the features between images. Compared with existing few-shot infrared target classification models, this model focuses on the geometric structure and thermal texture information of infrared images by modeling the semantic relevance between the features of the support set and query set, thus better attending to the target objects. Experimental results show that this method outperforms existing infrared aircraft classification methods in various classification tasks, with the highest classification accuracy improvement exceeding 3%. In addition, ablation experiments and comparative experiments also prove the effectiveness of the method.
In recent years, the rapid advancement of infrared detection and imaging technology has led to an expanding application scope for infrared image
In contrast, the human visual system has the remarkable ability to rapidly form cognitive frameworks for new entities based on a few example
Infrared images possess unique characteristics, such as low contrast and low signal-to-noise ratio. Moreover, apart from the target objects, infrared images may also contain various background interference, such as buildings and clouds. Therefore, designing a network model that can focus more on the target objects in infrared images under the constraint of extremely limited samples is crucial for our research. Recent advancements in few-shot learning have seen widespread application of meta-learning and transfer learning. Chen et al
In the task of few shot classification, test images in the query set come from novel classes, making it challenging for the extracted features to focus on the target object
We propose a few-shot infrared aircraft classification method based on cross-correlation networks, which integrates two crucial attention modules and is trained in an end-to-end manner. Firstly, by utilizing the parameter-free self-attention module(SAM), we extract the intra-correlation within each image to acquire feature representations in both spatial and channel dimensions. Subsequently, the cross-attention module(CA) is employed to generate cross-attention between support and query images, thereby enhancing the model's generalization capability. By efficiently fusing features within and between images with minimal parameters, the model reduces computational complexity. In contrast to current models for few-shot infrared aircraft classification, our approach enhances the focus on the infrared imagery’s geometric and textural details. It achieves this by establishing a semantic connection between the feature sets of the support and query samples, thereby improving the model’s ability to accurately identify target objects. The proposed model receives robust support and validation from classification experiments and ablation studies, all achieved without the introduction of excessive parameters.
In this section, we provide a detailed introduction to the Cross-Correlation Network (CCNet) proposed in this paper for few-shot infrared aircraft classification. The overall architecture of CCNet is illustrated in

Fig. 1 The overall architecture of CCNet model
图1 CCNet模型整体架构
Attention mechanisms allocate different weights to the importance of key information contained within channels, thereby enhancing the network's focus on important information. Common attention mechanisms are typically composed of convolutional layers, pooling layers, activation functions, etc., introducing additional parameters to the network. To improve network performance without increasing computational complexity, we introduce a simple, parameter-free attention mechanism module called SA

Fig. 2 Parameter-free self-attention model
图2 无参数自注意力模块
Building upon this, SAM defines an energy function to measure the difference between each feature and other features, thereby evaluating the importance of each feature. The definition of the energy function is as
![]() | (1) |
where and , and represent the target neuron and other neurons within a single channel of input feature . Here, i denotes the index of spatial dimension, M=H×W represents the number of neurons in that channel, while ωt and bt stand for the weight and bias, respectively. As shown in
![]() | (2) |
where and
represent the mean and variance of all neurons in the channel, λ is a hyperparameter used for balancing.
, | (3) |
where E represents the grouping of all in both channel and spatial dimensions, and ☉ denotes the calculation of the Hadamard product. Adding a sigmoid function aims to constrain excessively large values in E, thereby ensuring the relative importance of each neuron. Therefore, employing the parameter-free self-attention as a three-dimensional weighting module allows for the assignment of unique weights to each neuron, enhancing the attentional importance of each neuron in the feature maps. This three-dimensional weighting mechanism effectively exploits and highlights the structural features and background information within infrared images, facilitating efficient and reliable identification tasks for infrared targets.
In contrast to previous methods that independently extract features from support sets and query samples, we introduce a cross attention module to compute the cross-correlation between support and query images. The CA module enhances the model's focus on the target object by modeling the semantic relevance between class features and query features, thereby improving the efficiency and accuracy of the subsequent matching process. The cross attention module first takes the self-correlation representations of the support set and query samples ( and ) as inputs, then produces the corresponding cross-correlation representations ( and ). The architecture of the cross attention module, as shown in

Fig.3 The architecture of cross attention
图3 交叉注意力架构
In order to reduce computational complexity and obtain a more effective feature representation, we first employ a convolutional layer to decrease the channel dimensions of and from to , resulting in the outputs and . Subsequently, the cross-correlation representation of and is computed using
![]() | (4) |
where denotes a spatial location in the feature map, T signifies matrix transposition, and represents the four-dimensional cross-correlation tensor. This computation method not only yields a reliable cross-correlation representation but also reduces the computational load.
In the process of fine-grained classification of infrared aircraft, due to the similarity of some target shapes, the cross-correlation tensor may contain unreliable correlations. Therefore, we adopt a convolution matching process to obtain a more reliable cross-correlation representation. Specifically, we use four-dimensional convolution, which enhances the expression ability of target features and improves the accuracy of classification by analyzing the consistency of adjacent matches in the four-dimensional space and achieving geometric matching on the tenso
After obtaining the reliable cross-correlation tensor, it is necessary to generate the common attention maps and for the support set and query set. Taking the calculation of the query attention map as an example, the calculation method is shown in
![]() | (5) |
where represents the position in the feature map, σ is the temperature factor, lower temperature will lead to lower entropy, making the distribution concentrate on a few positions with higher confidenc
Unlike many recent few-shot learning methods that adopt a 'pre-training + fine-tuning ' two-stage training scheme, we propose an end-to-end training strategy for CCNet. This strategy jointly trains the designed modules and the backbone network by combining the metric loss and the global classification loss . In this process, the calculation of is based on the cosine similarity between the query prototype feature vector and the support prototype feature vector. The calculation method of the metric loss is shown in
, | (6) |
where denotes the calculation of cosine similarity, and represent the prototype vectors of the nth category, N indicates the total number of categories, and τ is the temperature factor.
The global classification loss is computed using a fully connected layer followed by a softmax function, in order to classify each query sample among all available training categories. The specific calculation method is shown in
, | (7) |
where represents the weights of the fully connected layer, represents the corresponding bias, and represents the number of training categories. The overall classification loss is defined as:
, | (8) |
where λ is the weight that balances the effects of different losses. By optimizing the overall loss L, the network can be trained end-to-end using the gradient descent algorithm.
All experiments were conducted in a hardware environment based on the Intel i7 13700 processor, NVIDIA GTX4080 graphics card, and DDR4 64G memory, as well as a software environment with the Win10 system and Pytorch deep learning framework. During the training phase, we adopted a training strategy based on N-way K-shot meta-tasks. Specifically, in each training cycle, N categories are randomly selected from the training data, and then K labeled samples are selected from each category to construct the support set. Subsequently, a certain number of samples are randomly selected from the other samples of these N categories, and these samples constitute the query set. Finally, the model predicts the category labels of the query samples. In the validation and testing phases, we still use the aforementioned meta-task form for evaluation. It should be noted that the data in the validation set, test set, and training set all come from different categories, which means that .
In order to validate the effectiveness of the model proposed in this study, we conducted experiments using two datasets: the miniImageNe

(a)

(b)
Fig. 4 (a) Samples of miniImageNet dataset;(b) samples of miniInfra dataset
图4 (a)miniImageNet数据集示例;(b)miniInfra数据集示例
In this study, we employ ResNet1
The miniImageNet dataset is composed of 100 categories, each containing 600 images, totaling 60 000 visible light images. Following the partitioning standards of previous literatur
During the experiments on the miniImageNet dataset, we use the Stochastic Gradient Descent (SGD) optimizer for 80 epochs of training, each epoch consisting of 300 meta-tasks. The initial learning rate is set to 0.1, and a learning rate decay strategy is adopted. At the 60th and 70th epochs, the learning rate is multiplied by a decay factor of 0.05. In the experiments, the temperature factor τ of the metric loss function is set to 0.2, and the hyperparameter λ for balancing the loss weight is set to 0.25.
Method | Backbone | 5-way 1-shot | 5-way 5-shot |
---|---|---|---|
MAM | ConvNet | 48.70±0.84 | 63.11±0.92 |
RelationNe | ConvNet | 50.44±0.82 | 65.32±0.70 |
CA | ResNet12 | 63.85±0.48 | 79.44±0.34 |
AFH | ResNet18 | 62.38±0.72 | 78.16±0.56 |
PSS | WRN-28-10 | 64.05±0.49 | 80.24±0.45 |
NC | ResNet12 | 62.55±0.12 | 78.27±0.09 |
Mata-baselin | ResNet12 | 63.17±0.23 | 79.26±0.17 |
MIA | ResNet12 | 64.27±0.35 | 81.24±0.26 |
TF | ResNet18 | 64.49±0.84 | 79.94±0.60 |
CCNet(ours) | ResNet12 | 66.20±0.43 | 81.82±0.31 |
The miniInfra dataset comprises 33 classes of terrestrial targets and 8 classes of aircraft targets. Terrestrial targets encompass various categories such as buildings, bicycles, pedestrians, cars, animals, and boats, with each class containing 100 to 200 infrared images. The 8 classes of aircraft targets include trainer aircraft, civil aviation aircraft, three types of helicopters (Z-8, Z-9, Z-15), and three types of jet aircraft (J-7, J-8, J-11), with each class containing 40 to 80 images. The granularity of aircraft target classification is finer than that of terrestrial targets.
Given the severe shortage of infrared aircraft data and to validate the model's ability to recognize fine-grained targets, we selected 25 types of ground targets as the training set, 8 types of ground targets as the validation set, and finally select 8 types of aircraft targets as the test set. The experiments include two standard few shot classification tasks: 5-way 1-shot and 5-way 5-shot. Considering that there are 8 types of aircraft, we added two specific classification tasks: 8-way 1-shot and 8-way 5-shot to test the model's generalization ability for few shot infrared aircraft in a real environment. Consistent with the experimental setup in the miniImageNet dataset, the experiment still uses the SGD optimizer and adopts a learning rate decay strategy. Since the size of the miniInfra dataset is much smaller than the miniImageNet dataset, to prevent overfitting, in the infrared aircraft classification task, we adjusted the number of training epochs to 20 and set the initial learning rate to 0.01.
We compared the experimental results with the existing infrared aircraft classification method
Method | Pre-train | 5-way 1-shot | 5-way 5-shot | 8-way 1-shot | 8-way 5-shot |
---|---|---|---|---|---|
Improved Relation Networ | No | 84.37±1.31 | 93.66±0.76 | 77.56±1.46 | 90.58±0.64 |
Yes | 82.79±0.75 | 94.51±0.82 | 78.47±0.94 | 89.82±1.02 | |
MLF | No | — | — | 78.58±0.97 | 91.12±0.37 |
Yes | — | — | 81.27±0.91 | 92.74±0.35 | |
CCNet(ours) | No | 85.58±0.97 | 95.09±0.46 | 81.95±0.62 | 93.26±0.38 |
To delve deeper into the impact of the core modules in CCNet, we conducted a series of ablation experiments on the miniImageNet and miniInfra datasets. These experiments included scenarios where two core modules were missing simultaneously, as well as cases where only one of the modules was used independently. We constructed a baseline model that only contains the backbone network and does not include any additional modules, to evaluate the effectiveness of the core modules in CCNet. We carried out 5-way 1-shot ablation experiments on the miniImageNet and miniInfra datasets. As can be seen from

(a)

(b)
Fig. 5 (a) Training and validation accuracy curves of the baseline model and CCNet model on miniImageNet dataset;(b) training and validation accuracy curves of the baseline model and CCNet model on miniInfra dataset
图5 (a)基线模型和CCNet模型在miniImageNet数据集训练和验证准确率曲线;(b)基线模型和CCNet模型在miniInfra数据集训练和验证准确率曲线
In this study, further ablation experiments are conducted on the 5-way 1-shot tasks of two datasets to individually validate the effectiveness of the SAM module and the CA module. When only the CA module is used, the basic representation Zq is taken as input; when only the SAM module is used, its output is directly utilized for classification. The results of the ablation experiments are presented in

Fig. 6 Ablation experiment results on miniImageNet and miniInfra dataset
图6 在miniImageNet和miniInfra数据集上的消融实验结果
We also present the results of class activation mapping (CAM) feature visualization using our CCNet, encompassing both visible and infrared images, as illustrated in

(a) cat

(b)

(c) dog

(d)

(e) airplane

(f)
Fig. 7 The class activation mapping (CAM) feature visualization of CCNet
图7 CCNet 的类激活映射 (CAM) 特征可视化
In this study, we replaced different attention modules in the proposed CCNet network model to compare the accuracy and parameter scale of the proposed modules with existing attention modules. Firstly, we evaluated the self-attention and cross-attention methods based on feature similarity, which focus on the correlation of the image spatial structure features.
As shown in
Module | Self | Cross | miniImageNet | miniInfra | Add params |
---|---|---|---|---|---|
Baseline | √ | × | 64.86 | 81.27 | 0 k |
S | √ | × | 66.37 | 81.99 | 102.4 k |
SC | √ | × | 62.96 | 79.80 | 89.2 k |
LS | √ | × | 64.77 | 80.62 | 1 644.16 k |
NLS | √ | × | 65.67 | 82.34 | 822.1 k |
CBA | √ | × | 64.77 | 82.79 | 102.5 k |
SC | √ | × | 64.43 | 78.80 | 157.3 k |
CC | × | √ | 66.00 | 84.26 | 45.8 k |
SAM | √ | × | 65.84 | 83.31 | 0 k |
CA | × | √ | 65.69 | 84.30 | 9.41 k |
In this study, we have proposed a few-shot infrared aircraft classification method based on the cross-correlation network, which can effectively solve the classification problem of infrared aircraft when the number of samples is severely insufficient. In the research process, in order to reduce model parameters and specifically target the structural features of infrared aircraft target images, we introduce a parameter-free self-attention mechanism to analyze the self-correlation within images. Meanwhile, we design a cross-attention mechanism to investigate the self-correlation between images, which effectively enhances the model's capability to extract features from infrared images. The experimental results show that our method significantly outperforms existing methods in aerial target classification accuracy on the infrared dataset, with an improvement of up to 3% in classification accuracy for specific tasks. Furthermore, the tests on the public miniImageNet dataset and the ablation experiments further verify the effectiveness and contributions of the proposed modules. The method proposed in this paper not only has broad application potential in aircraft detection, but also has great application value in civilian fields where data is scarce, such as medical. But at the same time in the research tasks of this paper it only involves the single task of aircraft classification. However, in actual application scenarios of the infrared detection system it involves a series of complex tasks such as target detection target recognition and target tracking. Therefore, how to deploy the few-shot model to these actual application scenarios and maintain good performance under multiple tasks will be the focus of the next stage of work.
References
Ning C, Liu W, Wang X. Infrared Object Recognition Based on Monogenic Features and Multiple Kernel Learning[C]//2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). 2018: 204-208. 10.1109/icivc.2018.8492872 [Baidu Scholar]
Chen R M, Liu S J, Miao Z, et al. Infrared aircraft few-shot classification method based on meta learning[J]. Journal of Infrared and Millimeter Waves, 2021, 40(4): 554-560. [Baidu Scholar]
Li W, Chen Q, Gu G, et al. Visible-infrared image matching based on parameter-free attention mechanism and target-aware graph attention mechanism[J]. Expert Systems with Applications, 2024, 238: 122038. 10.1016/j.eswa.2023.122038 [Baidu Scholar]
Jin L, Liu S J, Wang X, et al. Infrared aircraft classification method with small samples based on improved relation network[J]. Acta Optica Sinica, 2020, 40(8): 0811005. 10.3788/aos202040.0811005 [Baidu Scholar]
Luo X, Wu H, Zhang J, et al. A closer look at few-shot classification again[C]//Proceedings of the 40th International Conference on Machine Learning. 2023, 202: 23103-23123. [Baidu Scholar]
Li X, Yang X, Ma Z, et al. Deep metric learning for few-shot image classification: A Review of recent developments[J]. Pattern Recognition, 2023, 138: 109381. 10.1016/j.patcog.2023.109381 [Baidu Scholar]
Shi B, Li W, Huo J, et al. Global- and local-aware feature augmentation with semantic orthogonality for few-shot image classification[J]. Pattern Recognition, 2023, 142: 109702. 10.1016/j.patcog.2023.109702 [Baidu Scholar]
Hou R, Chang H, Ma B, et al. Cross Attention Network for Few-shot Classification[C]//Advances in Neural Information Processing Systems. 2019, 32. [Baidu Scholar]
Kang D, Kwon H, Min J, et al. Relational Embedding for Few-Shot Classification[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 8802-8813. 10.1109/iccv48922.2021.00870 [Baidu Scholar]
Li J. EACCNet: Enhanced Auto-Cross Correlation Network for Few-Shot Classification[C]//Knowledge Science, Engineering and Management. 2023, 14117: 354-365. 10.1007/978-3-031-40283-8_30 [Baidu Scholar]
Kwon H, Kim M, Kwak S, et al. Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 13045-13055. 10.1109/iccv48922.2021.01282 [Baidu Scholar]
Lee S, Lee S, Seong H, et al. Revisiting Self-Similarity: Structural Embedding for Image Retrieval[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023: 23412-23421. 10.1109/cvpr52729.2023.02242 [Baidu Scholar]
Wang L, Lei S, He J, et al. Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation[C]//Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems. 2023: 1-10. 10.1145/3589132.3625570 [Baidu Scholar]
Zhong Y, Su Y, Zhao H. Self-similarity feature based few-shot learning via hierarchical relation network[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(12): 4237-4249. 10.1007/s13042-023-01892-9 [Baidu Scholar]
Yang L, Zhang R-Y, Li L, et al. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks[C]//Proceedings of the 38th International Conference on Machine Learning. 2021: 11863-11874. 10.1109/mlbdbi51377.2020.00079 [Baidu Scholar]
Wen X, Cao C, Li Y, et al. DRSN with Simple Parameter-Free Attention Module for Specific Emitter Identification[C]//2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 2022: 192-200. 10.1109/trustcom56396.2022.00036 [Baidu Scholar]
Tan S, Zhang L, Shu X, et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks[J]. Frontiers of Computer Science, 2023, 17(6): 176338. 10.1007/s11704-022-2126-1 [Baidu Scholar]
Webb BS, Dhruv NT, Solomon SG, et al. Early and Late Mechanisms of Surround Suppression in Striate Cortex of Macaque[J]. Journal of Neuroscience, 2005, 25(50): 11666-11675. 10.1523/jneurosci.3414-05.2005 [Baidu Scholar]
Vinyals O, Blundell C, Lillicrap T, et al. Matching Networks for One Shot Learning[C]//Advances in Neural Information Processing Systems. 2016, 29. [Baidu Scholar]
Liu Q, Li X, Yuan D, et al. LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Single Object Tracking Benchmark[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023: 1-14. 10.1109/tnnls.2023.3236895 [Baidu Scholar]
He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016: 770-778. 10.1109/cvpr.2016.90 [Baidu Scholar]
Ravi S, Larochelle H. Optimization as a model for few-shot learning[C]//International Conference on Learning Representations. 2017. [Baidu Scholar]
Finn C, Abbeel P, Levine S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks[C]//Proceedings of the 34th International Conference on Machine Learning. 2017: 1126-1135. 10.1109/icra.2016.7487173 [Baidu Scholar]
Hou R, Chang H, MA B, et al. Cross Attention Network for Few-shot Classification[C]//Advances in Neural Information Processing Systems. 2019, 32. [Baidu Scholar]
Li K, Zhang Y, Li K, et al. Adversarial Feature Hallucination Networks for Few-Shot Learning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020: 13467-13476. 10.1109/cvpr42600.2020.01348 [Baidu Scholar]
Chen Z, Ge J, Zhan H, et al. Pareto Self-Supervised Training for Few-Shot Learning[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021: 13658-13667. 10.1109/cvpr46437.2021.01345 [Baidu Scholar]
Laenen S, Bertinetto L. On Episodes, Prototypical Networks, and Few-Shot Learning[C]//Advances in Neural Information Processing Systems. 2021, 34: 24581-24592. [Baidu Scholar]
Chen Y, Liu Z, Xu H, et al. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning[C]//2021: 9062-9071. 10.1109/iccv48922.2021.00893 [Baidu Scholar]
Qin Z, Wang H, Mawuli CB, et al. Multi-instance attention network for few-shot learning[J]. Information Sciences, 2022, 611: 464-475. 10.1016/j.ins.2022.07.013 [Baidu Scholar]
Lazarou M, Stathaki T, Avrithis Y. Tensor feature hallucination for few-shot learning[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2022: 2050-2060. 10.1109/wacv51458.2022.00211 [Baidu Scholar]
Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141. 10.1109/cvpr.2018.00745 [Baidu Scholar]
Huang S, Wang Q, Zhang S, et al. Dynamic Context Correspondence Network for Semantic Alignment[C]//2019: 2010-2019. 10.1109/iccv.2019.00210 [Baidu Scholar]
Ramachandran P, Parmar N, Vaswani A, et al. Stand-Alone Self-Attention in Vision Models[C]//Advances in Neural Information Processing Systems. 2019, 32. [Baidu Scholar]
Wang X, Girshick R, Gupta A, et al. Non-local Neural Networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 7794-7803. 10.1109/cvpr.2018.00813 [Baidu Scholar]
Woo S, Park J, Lee J-Y, et al. CBAM: Convolutional Block Attention Module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19. 10.1007/978-3-030-01234-2_1 [Baidu Scholar]