A Recognition Model for Passenger Boarding and Alighting Action Based on Improved Temporal Pyramid Network
-
摘要: 传统基于图像处理的违法载客识别算法依赖人工制定的人车交互规则以确定上下车行为的发生。然而,由于交通场景的复杂性,人工制定的规则集不够完善,导致算法识别效果较差。因此,引入基于时间金字塔网络(temporal pyramid network,TPN) 的深度学习模型进行上下车动作识别,通过大量样本集的训练提取较为完备的出租车乘客上下车行为特征,提升识别准确性。针对TPN模型无法区别司乘角色身份的问题,重新设计基于车门区域感知的模型输出层,增强模型多维度特征提取效率;针对上下车行为时空跨度大,模型易受无关动作干扰问题,加入一种基于动态窗口权重的滑窗机制,捕捉动作关键视频帧,提高识别效率。综合上述改进措施,提出了基于车门区域感知和动态权重的出租车乘客上下车动作识别模型(boarding and alighting neural network,BANN),实现高效准确的违法载客行为识别。基于首都机场监控视频构建包含4 047段带标注视频的训练集和810段未标注视频的测试集对模型进行验证。实验结果表明:BANN模型的查准率和查全率分别达到90.21%和88.53%,较基准TPN模型分别提升了9.78%和11.04%,能够较好满足枢纽场站交通秩序监管的需要。Abstract: Traditional algorithms for identifying illegal passenger-carrying behavior, which rely on image processing techniques, utilize manually crafted human-vehicle interaction rules to discern boarding and alighting actions. However, these rule sets often fall short due to the intricate nature of traffic scenarios, resulting in suboptimal recognition performance. Therefore, a deep learning model based on a temporal pyramid network(TPN) is introduced for boarding and alighting action recognition. By training on a large dataset, more complete features of taxi passenger boarding and alighting behaviors are extracted to improve recognition accuracy. To address the issue of the TPN model not distinguishing between driver and passenger roles, the output layer is redesigned based on door area perception. This modification enhances the efficiency of multi-dimensional feature extraction. To tackle the issue of the large spatiotemporal span in boarding and alighting actions, which leads to interference from irrelevant movements, a sliding window mechanism is introduced. This mechanism, based on dynamic window weights, captures key video frames of the actions, enhancing recognition efficiency. Based on the above improvement measures, a boarding and alighting neural network(BANN) model, based on door area perception and dynamic weights, is proposed to efficiently and accurately recognize illegal passenger-carrying behaviors. A training dataset with 4, 047 annotated video clips and a test dataset with 810 unannotated video clips are constructed for model performance validation based on surveillance videos from Beijing Capital Airport. Experimental results demonstrate that the BANN model achieves precision and recall rates of 90.21% and 88.53%, respectively, representing improvements of 9.78% and 11.04% over the baseline TPN model. These results indicate that the BANN model can effectively meet the needs of traffic order supervision in transportation hubs.
-
表 1 乘客上下车动作正负样本划分表
Table 1. The positive and negative sample division table of passengers' boarding and alighting
正样本 乘客上下车 负样本(部分) 司机开、关车门 司机下车后短暂活动又上车 司机单独上车后驶离 行人从车旁路过 乘客与司机交谈后步行离开 乘客上车后又下车 乘客下车后返身取行李等 表 2 实验训练参数表
Table 2. Experimental training parameter table
参数 取值 迭代次数 1 000 初始学习率 0.000 3 动量 0.99 权重衰减 0.000 1 学习率调整策略 余弦退火策略 批大小 2 表 3 数据集样本分布数量表
Table 3. Dataset sample distribution quantity table
样本类别 数据量 总数据量 4 047 负样本 1 116 上车数据量 1 437 下车数据量 1 512 司机上下车数据量 1 213 乘客上下车数据量 2 298 表 4 现有动作识别方法的实验结果
Table 4. Experimental results of existing motion recognition methods
网络模型 查准率/% 查全率/% C3D 85.89 84.39 slowfast 88.90 87.39 TimeSformer 90.21 89.25 TPN基准(无车门损失函数) 90.39 89.91 BANN 95.47 93.89 表 5 司乘上下车动作测试结果
Table 5. Test results of driver's boarding and alighting movements
模型 查准率/% 查全率/% TP FP FN C3D 61.14 69.88 225 143 97 slowfast 64.40 75.24 237 131 78 TimeSformer 66.30 75.08 244 124 81 TPN 80.43 77.49 296 72 86 BANN 90.21 88.53 332 36 43 -
[1] 寇敏, 张萌萌, 赵军学, 等. 道路交通安全风险辨识与分析方法综述[J]. 交通信息与安全, 2022, 40(6): 22-32. doi: 10.3963/j.jssn.1674-4861.2022.06.003KOU M, ZHANG M M, ZHAO J X, et al. A Review of identification and analysis methods for road safety risk[J]. Journal of Transport Information and Safety, 2022, 40(6): 22-32. (in Chinese) doi: 10.3963/j.jssn.1674-4861.2022.06.003 [2] 张博, 庞基敏, 章文嵩, 等. 互联网大数据技术在智慧交通发展中的应用[J]. 科技导报, 2020, 38(9): 47-54.ZHANG B, PANG J M, ZHANG W S, et al. Application of internet big data technology in the development of smart transportation[J]. Science & Technology Review, 2020, 38(9): 47-54. (in Chinese) [3] 李熙莹, 陆强, 张晓春, 等. 基于人车交互行为模型的上下客行为识别[J]. 中国公路学报, 2021, 34(7): 152-163. doi: 10.3969/j.issn.1001-7372.2021.07.013LI X Y, LU Q, ZHANG X C, et al. Boarding and alighting behavior recognition based on human-vehicle interaction behavior model[J]. China Journal of Highway and Transport, 2021, 34(7): 152-163. (in Chinese) doi: 10.3969/j.issn.1001-7372.2021.07.013 [4] 王隽. 基于机器视觉的高速公路服务区违法上下客识别应用研究[J]. 时代汽车, 2022(14): 196-198 doi: 10.3969/j.issn.1672-9668.2022.14.069WANG J. Application research of illegal boarding and alighting recognition in expressway service area based on machine Vision[J]. Auto Time, 2022(14): 196-198. (in Chinese) doi: 10.3969/j.issn.1672-9668.2022.14.069 [5] 贺艺斌, 田圣哲, 兰贵龙. 基于改进Faster-RCNN算法的行人检测[J]. 汽车实用技术, 2022, 47 (05): 34-37.HE Y B, TIAN S Z, LAN G L. Pedestrian detection based on improved faster-RCNN algorithm[J]. Automobile Applied Technology, 2022, 47(05): 34-37. (in Chinese) [6] 张若杨, 贾克斌, 刘鹏宇. 视频监控中私自揽客违法行为检测[J]. 计算机应用与软件, 2019, 36 (3): 168-173, 209. doi: 10.3969/j.issn.1000-386x.2019.03.031ZHANG R Y, JIA K B, LIU P Y. Illegal behavior detection of carrying passengers privately in video surveillance[J]. Computer Applications and Software, 2019, 36(03): 168-173, 209. (in Chinese) doi: 10.3969/j.issn.1000-386x.2019.03.031 [7] 房春瑶, 贾克斌, 刘鹏宇. 基于监控视频的出租车违规私揽行为识别[J]. 计算机仿真, 2020, 37 (5): 326-331. doi: 10.3969/j.issn.1006-9348.2020.05.066FANG C Y, JIA K B, LIU P Y. Identification of taxi violation behavior based on surveillance video[J]. Computer Simulation, 2020, 37 (5): 326-331. (in Chinese) doi: 10.3969/j.issn.1006-9348.2020.05.066 [8] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. [9] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. International Conference on Computer Vision, Boston, USA: IEEE, 2015. [10] CARREIRA J, ZISSERMAN A. QUO VADIS, Action recognition? a new model and the kinetics dataset[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA: IEEE/CVF, 2017. [11] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE/CVF, 2018. [12] HUANG D A, RAMANATHAN V, MAHAJAN D, et al. What makes a video a video: analyzing temporal information in video understanding models and datasets[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE/CVF, 2018. [13] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]. International Conference on Computer Vision, Seoul, Korea (South): IEEE/CVF, 2019. [14] YANG C, XU Y, SHI J, et al. Temporal pyramid network for action recognition[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA: IEEE, 2020. [15] HAN K, XIAO A, WU E, et al. Transformer in transformer[J]. Advances in neural information processing systems, 2021, 34: 15908-15919. [16] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine intelligence, 2022, 45(1): 87-110. [17] BERTASIAS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[C]. International Conference on Machine Learning, Vienna, Austria: IMLS, 2021. [18] 杨世强, 罗晓宇, 乔丹, 等. 基于滑动窗口和动态规划的连续动作分割与识别[J]. 计算机应用, 2019, 39(2): 348-353.YANG S Q, LUO X Y, QIAO D et al. Continuous action segmentation and recognition based on sliding window and dynamic programming[J]. Journal of Computer Applications, 2019, 39(2): 348-353. (in Chinese) [19] HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]. International Conference on Computer Vision Workshops, Lido Island, Venice, Italy: IEEE, 2017. [20] ZHANGE D, ZHANG H, TANG J, et al. Feature pyramid transformer[C]. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK: European Computer Vision Association, 2020. -