Abstract
People easily get distracted or tired after long-duration actions such as driving and online classes, which can lead to accidents or poor efficiency. To detect such human behaviors, a head motion detection method based on low-resolution infrared array sensors is proposed with the protection of personal privacy. First, prominent areas of the human body are extracted based on image processing techniques. Then a 3D image fusion algorithm is developed to extract the change information in the spatiotemporal domain. Finally, an improved residual network is developed to achieve head motion classification. Ten head movements are designed for driving and online classroom scenarios. Experimental results show that in the detection range of 50 cm to 100 cm, our average recognition rate is 96.76%, and the processing speed is 9 frames per second, which is better than the existing state-of-the-art algorithms. The accuracy of the system is 93.7% when it is applied to the vehicle experiment.
Head motion detection is widely used in face recognitio
One of the most popular methods of head motion detection is based on wearable devices. Han et al.
However, wearable devices require a user to wear or carry the device all the time, which makes it inconvenient. Visible light cameras are not too restrictive to the user, but they raise concerns about privacy invasion, and these devices usually do not work in the dark environments. The infrared array sensor performs imaging by detecting the infrared radiation temperature of the target without contact, and it is not sensitive to light conditions. Low-resolution infrared sensors also have the advantage of protecting privacy. Therefore, infrared sensors have a wide application field, including gesture recognitio
In this study, we propose a head motion detection system based on a low-pixel (32×32) infrared array sensor. It detects head motion at short distances ranging from 50 cm to 100 cm, where generally only the head and a small portion of the shoulders fall into view. Our algorithm implements distance adaptation, which is more suitable for application scenarios such as driving and online classrooms, where there is typically a small motion between the user’s head and a fixed sensor.
Our detection system is built with the HTPA32×32d (HEIMANN corporation) infrared array sensor, as shown in
To build the head motion detection system, the infrared sensor is connected to the Raspberry Pi 3B through the I2C interface, and the collected temperature data is transmitted to the host computer through the WIFI module for final identification processing. For the application environment of driving and online classroom, the sensor is fixed in front of the user through a tripod, as shown in
Item | Specification |
---|---|
Infrared sensor model | HTPA 32×32 |
Camera | 1 |
Temperature range of object | -40~85℃ |
Viewing angle | 66° |
Number of pixels | 1024(32×32) |
Temperature output mode | |
Frame rate | 5 frames/s |

Fig. 1 Infrared array sensor, (a) physical view of the sensor, (b)infrared temperature image(palm)
图1 红外阵列传感器,(a)传感器实物图,(b)红外温度图像(手掌)

Fig. 2 System composition design
图2 系统构成设计
Our detection algorithm includes three modules: temperature image preprocessing, image fusion, and classification, as shown in

Fig. 3 Flow chart of the head motion detection algorithm
图3 头部运动检测算法流程图
There is noise interference in the data acquired by the IR array sensor.

Fig. 4 Original image and pseudo-color image (the region of human head and shoulder) (a)origin image (32×32), (b) pseudo-color image (32×32), (c) pseudo-color image(64×64)
图4 原始图像与伪彩色图像(人体头肩部区域成像)(a)原始图像(32×32),(b)伪彩色图像 (32×32),(c)伪彩色图像(64×64)
We propose a human salient region extraction method, as described in
Algorithm 1 Adaptive Threshold |
---|
Input: the image of Canny edge detection: |
the local window size |
1. Obtain the threshold(C) value by the OTSU method
|
2. Obtain the imageafter mean filtering:
|
3. Obtain the continuous boundary image()
|
Output: the continuous boundary image |

Fig. 5 Flowchart of the head salient region extraction algorithm
图5 头部显著区域提取算法流程图
We compare the extraction results of our algorithm with some classical methods such as OTSU, Prewitt operator, and Log operator, as shown in

Fig. 6 Comparison of preprocessed results
图6 预处理结果对比图
Our infrared sensor outputs data in the form of a sequence. The head movements are also continuous, such as bowing the head, raising the head, turning the head, etc. Therefore, each action’s information is reflected simultaneously in the current frame (spatial information) and multiple adjacent frames (temporal information). To this end, we design a 3D image fusion algorithm to synthesize the spatial and temporal information of each action, as shown in
, | (1) |
where as the 3D color image, fused by B channel CnB(x,y), G channel CnG(x,y) and R channel CnR(x,y). In(x,y) denotes the nth frame in the original temperature sequence image, and the subscript n represents the frame number of the sequence image; Bn(x,y) and Bn-1(x,y) denote the binary mask image at frame n and n-1 respectively obtained by the preprocessing method described in Sect. 2.2. Bref(x,y) represents the binary mask image corresponding to the image frame with the head in front. We use it as the reference frame since this orientation is the most common action in driving and online learning. CnB(x,y) represents the spatial information extracted from the current frame image by multiplying with the binary mask image; CnG(x,y) and CnR(x,y) represent the time information, which is obtained by extracting the difference between the binary mask image of the current frame and the reference image and the previous frame, respectively.

Fig. 7 3D image fusion of a sequence of frames
图7 一组序列帧的三维图像融合
Design an improved model based on the residual network and call it Pro-ResNeXt50. Deep networks can theoretically achieve higher accuracy. However, as the depth of the network increases, the gradient will disappear or explode, and the accuracy will decline. Therefore, Dr. He proposed the deep residual network ResNe

Fig. 8 Residual learning structure of ResNeXt network, (a) BottleNeck structure of ResNet network, (b) split-transform-merge structure of Inception network, (c) block structure of ResNeXt network
图8 ResNext网络残差学习结构,(a)ResNet网络的BottleNect结构,(b)Inception网络的split-transform-merge结构,(c)ResNeXt网络的Block结构
Our network is improved based on the ResNeXt50 and named Pro-ResNeXt50. Our improvement aims to allow the network to achieve higher accuracy at a faster speed. The Pro-ResNeXt50 network model is shown in
The feature maps of a CNN can be considered as features of the input image. As shown on the left of
The number of block modules of each convolutional layer of the ResNeXt50 network is reduced from [3,4,6,3] to [3,3,5,3]. Then the last fully-connected layer is modified to a 1×1 convolutional layer so that the network becomes a fully convolutional network. We compared the parametric quantities and floating-point numbers of ResNet50, ResNeXt50, and Pro-ResNeXt50, as shown in
Network | ResNet50 | ResNeXt50 | Pro-ResNeXt50 |
---|---|---|---|
#params. | 25.5 | 25.0 | 22 |
FLOPs | 4 | 4 | 4 |

Fig. 9 Pro-ResNeXt50 network
图9 Pro-ResNeXt50网络

Fig. 10 The training accuracy and loss of three network
图10 三个网络的训练准确率与损失曲线图

Fig. 11 Experience scenarios (a) Experiments in an indoor hall, (b) Experiments in a car:Experiments were conducted in an indoor hall to simulate a driving and online learning environment, as shown in Fig. 11(a). The test user is sitting on a chair, and the sensor is fixed at the height of 1.2 m above the ground by a tripod so that it is aligned with the user’s front face. The collection distance ranges of 0.5 m to 1 m, and the collected lighting environment includes both day and night conditions.
图11 实验场景(a)室内大厅实验,(b)车内实验
For driving and online classroom scenarios, we designed 10 head movements, including (1)facing forward(Front), (2)bowing(Bow), (3)turning left 45 degrees(L45°), (4)turning left 90 degrees(L90°), (5)turning right 45 degrees(R45°), (6)turning right 90 degrees(R90°), (7)turning left with shoulders (TL), (8)turning right with shoulders(TR), (9)making phone calls with the left hand(CL) and (10)making phone calls with the right hand(CR). Among them, facing forward(Front) is considered to be a normal posture for concentrating on driving and listening to lectures, and the other 9 actions are considered to be distracting behaviors.
The temperature data acquired by the sensor is saved as a video file. When collecting data, we let the user perform 10 actions continuously. Our training and testing data are formed by extracting single-frame data from video files. When collecting training data, we asked the users to move their heads slower and hold for 6 seconds at the end of each movement. This way, the sensor can acquire enough data frames for each action. When collecting testing data, the users perform movements such as moving the head at a normal speed and keeping it for 2 sin the final state of each movement. The training data were done at three acquisition distances, including 50 cm, 75 cm, and 100cm. In addition to 50 cm, 75 cm and 100 cm, the collection distance of testing data also randomly added 60 cm, 70 cm, 80 cm and 90 cm. The test users consisted of three men and two women. About 960 frames of training data are collected for each action of the 5 users, and a total of 9 600 frames of training data are collected for 10 actions, where the dataset at night is 4 600 frames. About 1 900 frames of data are used as the test dataset, where the dataset at night is 800 frames.
We implement the proposed model in PyTorch. Adam is used as the optimizer, Relu is used as the activation function, and the learning rate is set as 0.000 1.
Use accuracy (Acc) as the performance evaluation method of algorithm, and its calculation formula is shown in the following formula:
, | (2) |
where ni denotes the correctly identified samples, and N denotes the total number of samples tested. The classification results are shown in

Fig. 12 Accuracy using different methods
图12 不同方法的准确率
We randomly selected an action sequence at a distance of 100 cm for testing, and the results are shown in
Activity | Front | Bow | L45° | L90° | LC | LT | R45° | R90° | RC | RT |
---|---|---|---|---|---|---|---|---|---|---|
Total | 300 | 280 | 159 | 154 | 160 | 170 | 173 | 167 | 171 | 182 |
Precision | 0.947 | 0.989 | 0.962 | 0.961 | 0.981 | 0.970 | 0.948 | 0.964 | 0.982 | 0.962 |

Fig. 13 Random continuous head movement steering
图13 随机连续头部运动转向
First, tried to input the original temperature image directly into the network for training and recognition, and the accuracy was only 87.73%. Next, we introduced an attention mechanism (CBAM
Method | Original+ Pro-ResNeXt50 | Original +CBAM +ResNeXt50 | Channel(1)+ Pro-ResNeXt50 | Channel(1,2) + Pro-ResNeXt50 | Channel(1,3) + Pro-ResNeXt50 | Channel(1,2,3) + Pro-ResNeXt50 |
---|---|---|---|---|---|---|
Accuracy | 87.73% | 94.47% | 89.35% | 92.06% | 87.31% | 96.76% |
Compare the performance of different network models, and the results are shown in
Method | Accuracy | Times |
---|---|---|
ResNet50 | 94.10% | 7f/s |
ResNeXt50 | 94.83% | 8f/s |
Pro-ResNeXt50 | 96.76% | 9f/s |

Fig. 14 Recognition accuracy in different detection distances and light conditions
图14 不同检测距离和光照环境下的识别精度
We further analyze the recognition accuracy of the model in different detection distances and light conditions, and the results are shown in
To verify the recognition accuracy of the model in a real-world application, we further tested it in a car driving environment. The infrared array sensor is fixed above the vehicle dashboard through a bracket, facing the driver's face area, as shown in
Activity | Front | Bow | L90° | R45° | RC | RT | Average Acc |
---|---|---|---|---|---|---|---|
Total | 267 | 184 | 205 | 195 | 189 | 192 | — |
Precision | 0.948 | 0.902 | 0.942 | 0.910 | 0.963 | 0.958 | 0.937 |
A head motion detection algorithm based on a low-resolution infrared array sensor is proposed, which can be applied to behavior monitoring and analysis in driving and online classroom learning. First, the sensor acquires the head temperature image with a resolution of 32×32. The extremely low resolution effectively protects the privacy of users but also poses challenges for action recognition. Therefore, we design a salient region extraction algorithm for low-resolution images to extract the head region. Further, a 3D image fusion algorithm is developed to fully extract spatiotemporal information of continuous head movements. Finally, we propose an improved network Pro-ResNeXt50 for the recognition of ten kinds of head movements. The experimental results show that our algorithm can achieve an average recognition accuracy of 96.76%, and can be adaptive for detection distances within 0.5 m to 1 m. The processing speed of 9 frames per second can meet the needs of real-time applications. Besides, the accuracy of the in-car experiment was 93.7%. In the future, we will focus on the recognition of movement transition stages, especially for some fast movements of the head and hands. In addition, the lightweight model is also a direction that can be further studied.
References
Badave H, Kuber M. Head pose estimation based robust multicamera face recognition[C]. In:2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), 2021, 492-495. 10.1109/icais50930.2021.9395954 [Baidu Scholar]
Agrawal A, Raj R, Porwal S. Vision-based multimodal human-computer interaction using hand and head gestures[C]. In:IEEE Conference on Information & Communication Technologies, 2013, 1288-1292. 10.1109/cict.2013.6558300 [Baidu Scholar]
Qin Z, Shelton C R. Social grouping for multi-target tracking and head pose estimation in video[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(10):2082-2095. 10.1109/tpami.2015.2505292 [Baidu Scholar]
Sherstyuk A , Treskunov A. Natural head motion for 3D social games[C]. In:2013 IEEE Virtual Reality (VR), 2013, 69-70. 10.1109/vr.2013.6549367 [Baidu Scholar]
Tiwari C, Hanmandlu M, Vasikarla S. Suspicious face detection based on eye and other facial features movement monitoring[C]. In:2015 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), 2015, 1-8. 10.1109/aipr.2015.7444523 [Baidu Scholar]
Manogna, Vaishnavi, Geethanjali. Head movement based assist system for physically challenged[C]. In:2010 4th International Conference on Bioinformatics and Biomedical Engineering, 2010, 1-4. 10.1109/icbbe.2010.5517790 [Baidu Scholar]
Ito M, Fukumi M, Sato K. Analysis of safety verification behavior and classification of driver’s head posture[C]. In:2013 IEEE International Conference on Mechatronics and Automation, 2013, 884-889. 10.1109/icma.2013.6618032 [Baidu Scholar]
Han H, Jang H, Sang W Y. Driver head posture monitoring using MEMS magnetometer and neural network for long-distance driving fatigue analysis[C]. In:2019 IEEE Sensors, 2019, 1-4. 10.1109/sensors43011.2019.8956799 [Baidu Scholar]
Ruzaij M F, Neubert S, Stoll N, et al. Auto calibrated head orientation controller for robotic-wheelchair using MEMS sensors and embedded technologies[C]. In:2016 IEEE Sensors Applications Symposium (SAS), 2016, 1-6. 10.1109/sas.2016.7479886 [Baidu Scholar]
Wu C W, Yang H Z, Chen Y A, et al. Applying machine learning to head gesture recognition using wearables[C]. In:2017 IEEE 8th International Conference on Awareness Science and Technology (iCAST), 2017, 436-440. 10.1109/icawst.2017.8256495 [Baidu Scholar]
Pangestu G, Utaminingrum F, Bachtiar F. Eye state recognition using multiple methods for applied to control smart wheelchair [J]. International Journal of Intelligent Engineering and Systems, 2019, 12(1):232-241. 10.22266/ijies2019.0228.23 [Baidu Scholar]
Ugurlu Y. Head posture detection using skin and hair information[C]. In:Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012, 1-4. [Baidu Scholar]
Li S P, Wang Z F, Wang J. Study of face orientation recognition based on neural network [J]. International journal of pattern recognition and artificial intelligence, 2018, 32(11):1856015.1. 10.1142/s0218001418560153 [Baidu Scholar]
Tateno S, Zhu Y, Meng F. Hand gesture recognition system for in-car device control based on infrared array sensor[C]. In:2019 58th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), 2019, 701-706. 10.23919/sice.2019.8859832 [Baidu Scholar]
Qu D N, Yang B, Gu N H. Indoor multiple human targets localization and tracking using thermopile sensor[J]. Infrared Physics & Technology, 2019, 97:349-359. 10.1016/j.infrared.2019.01.011 [Baidu Scholar]
Chen Z J, Wang Y. Remote recognition of in-bed postures using a thermopile array sensor with machine learning [J]. IEEE Sensors Journal, 2021, 21(9):10428-10436. 10.1109/jsen.2021.3059681 [Baidu Scholar]
WANG Zhao-Jun, XU Zhi-Meng, CHEN Liang-Qin . Research on human behaviour recognition system based on infrared array sensors[J]. Infrared technology(王召军, 许志猛, 陈良琴. 基于红外阵列传感器的人体行为识别系统研究。红外技术), 2020, 42(3):7. 10.3724/sp.j.7101182184 [Baidu Scholar]
Chen Z J, Wang Y, Liu H W . Unobtrusive sensor-based occupancy facing direction detection and tracking using advanced machine learning algorithm[J]. IEEE Sensors Journal, 2018, 18(15): 6360-6368. 10.1109/jsen.2018.2844252 [Baidu Scholar]
He K M, Zhang X Y, Ren SQ, et al. Deep residual learning for image recognition[C]. In;IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 770-778. 10.1109/cvpr.2016.90 [Baidu Scholar]
Woo S, Park J, Lee J Y, et al. CBAM: Convolutional block attention module[C]. In:Proceedings of the European Conference on Computer Vision (ECCV). 2018, 3-19. 10.1007/978-3-030-01234-2_1 [Baidu Scholar]