Second, recognition accuracy suffers from deteriorated appearances in videos that are seldom observed in still images, such as motion blur, video defocus, rare poses, etc. In SGD, 240k iterations are performed on 4 GPUs, with each GPU holding one mini-batch. A virtual object will effectively be superimposed on the image and must respond to the real objects. TensorFlow: Large-scale machine learning on heterogeneous systems In Light-head R-CNN, position-sensitive feature maps [6] are exploited to relief the burden. They can be mainly classified into two major branches: lightweight image object detectors making the per-frame object detector fast, and mobile video object detectors exploiting temporal information. The aggregated feature maps ^Fi at frame i is obtained as a weighted average of nearby frames feature maps. 2). There has been significant progresses for image object detection in recent years. By contrast, previous works [44, 45, 46, 43] based on either convolutional LSTM or convolutional GRU do not consider such a designing since they operate on consecutive frames instead, where object displacement would be small and neglected. The accuracy of our method at long duration length (l=20) is still on par with that of the single frame baseline, and is 10.6× more computationally efficient. The proposed techniques are unified to an end-to-end learning system. For example, we achieve 60.2% mAP score on ImageNet VID validation at speed of 25.6 frame per second on mobiles (e.g., HuaWei Mate 8). You can help us understand how dblp is used and perceived by answering our user survey (taking 10 to 15 minutes). The middle panel of Table 2 compares the proposed Light Flow with existing flow estimation networks on the Flying Chairs test set (384 x 512 input resolution). By default, α and β are set as 1.0. internal covariate shift. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. It is worth noting that the accuracy further drops if no flow is applied even for sparse feature propagation on the non-key frames. Without all these important components, its accuracy cannot compete with ours. ∙ x��YYo��~ׯ�� `H�>��c���vy��ְ݇g�c�H�@��]Ulv��UU��9n�����W/ބ�&��4�7��M{~�n�"��8�ܖ��N�u� ��m�8�6,�{����N97�x��d���v�j����)u���w[7ɜ�����z��i������T���r��+_v���O�W�M�Is/)�M��x���~���X�e_‹�u�y�^��,˕%�Ś�6X���4� `��1DZE��䑮�����B�;o]T�.����~���a��A��*�����J�D��f���� Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. 05/16/2018 ∙ by Rakesh Mehta, et al. In both training and inference, the images are resized to a shorter side of 224 pixels and 112 pixels, for the image recognition network and the flow network, respectively. FlowNet [32] is originally proposed for pixel-level optical flow estimation. No end-to-end training for video object detection is performed. We do not dive into the details of varying technical designs. Authors: Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan (Submitted on 16 Apr 2018) Abstract: Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. detection rate and high detection speed. In each mini-batch of SGD, either n+1 nearby video frames from ImageNet VID, or a single image from ImageNet DET, are sampled at 1:1 ratio. recognition. Comprehensive experiments show that the model steadily pushes forward the performance (speed-accuracy trade-off) envelope, towards high performance video object detection on mobiles. And then the detection network Ndet is applied on ^Fk→i to get detection predictions for the non-key frame i. (2015) 12/04/2018 ∙ by Liangzhe Yuan, et al. Built on the two principles, the latest work [21], provides a good speed-accuracy tradeoff on Desktop GPUs. ∙ Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., precipitation nowcasting. object detection in video. Towards High Performance Video Object Detection. Otherwise, displacements caused by large object motion would cause severe errors to aggregation. It is designed in a encoder-decoder mode followed by multi-resolution optical flow predictors. In this paper, we There has been significant progresses for image object detection in recent years. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., It is one order faster than the … ∙ It is worth noting that it achieves higher accuracy than FlowNet Half and FlowNet Inception utilized in [19], with at least one order less computation overhead. Light-head R-CNN [23] is of two-stage, where the object detector is applied on a small set of region proposals. The objects can generally be identified from either pictures or video feeds.. : Speed/accuracy trade-offs for modern convolutional object detectors. Towards High Performance Video Object Detection Abstract: There has been significant progresses for image object detection in recent years. Also, identify the gap and suggest a new approach to improve the tracking of object over video frame. By default, the key-frame object detector is MobileNet+Light-Head R-CNN, and flow is estimated by Light Flow. Long-term dependency in aggregation is also favoured because more temporal information can be fused together for better feature quality. First, 3×3 convolution is used instead of fully connected matrix multiplication, since fully connected matrix multiplication is too costly when GRU is applied to image feature maps. We further studied several design choices in flow-guided GRU. Moreover, the incidents are detected very fast. To answer this question, we experiment with a degenerated version of our method, where no flow-guided feature propagation is applied before aggregating features across key frames. Xizhou Zhu [0] Jifeng Dai (代季峰) [0] Lu Yuan (袁路) [0] Yichen Wei (危夷晨) [0] computer vision and pattern recognition, 2018. Fully convolutional models for semantic segmentation. In encoder part, convolution is always the bottleneck of computation. Rigid-motion scattering for image classification. Get the latest machine learning methods with code. Object detection is a computer vision technique whose aim is to detect objects such as cars, buildings, and human beings, just to mention a few. The difference is flow-guided GRU is applied. 0 Darknet: Open source neural networks in c. Wong, A., Shafiee, M.J., Li, F., Chwyl, B.: Tiny ssd: A tiny single-shot detection deep convolutional neural R-fcn: Object detection via region-based fully convolutional Built upon the recent works, this work proposes a unified viewpoint based on the principle of multi-frame end-to-end learning of features and cross-frame motion. Following the practice in MobileNet [13], two width multipliers, α and β, are introduced for controlling the computational complexity, by adjusting the network width. Flow estimation is the key to feature propagation and aggregation. As the two principles, sparse feature propagation and multi-frame feature aggregation, yield the best practice towards high performance (speed and accuracy trade-off) video object detection [21] on Desktop GPUs. Because the recognition on the key frame is still not fast enough. 30 object categories are involved, which are a subset of ImageNet DET annotated categories. ϕ function with ReLU nonlinearity leads to 3.9% higher mAP score compared to tanh nonlinearity. modeling. mb model size. Directly applying these detectors to video object detection faces challenges from two aspects. segmentation. We cannot compare with it. Object Detection : A Comparison of performance of Deep learning Models on Edge Using Intel Movidius Neural Compute Stick and Raspberry PI3 Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic Figure 4 presents the speed-accuracy trade-off curve of our method, drawn with varying key frame duration length l from 1 to 20. Rectifier nonlinearities improve neural network acoustic models. Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. On one hand, sparse feature propagation is used in [19, 21] to save expensive feature computation on most frames. Built upon the recent works, this work proposes a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion. Light weight image object detector is applied on sparse key frames. Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Second, ϕ is ReLU function instead of hyperbolic tangent function (tanh) for faster and better convergence. The two dimensional motion field Mi→k between two frames Ii and Ik is estimated through a flow network Nflow(Ik,Ii)=Mi→k, which is much cheaper than Nfeat. Of them, the improvements of YOLO [15], SSD [10], together with the lastest Light-head R-CNN [23] are of the best speed-accuracy trade-off. 04/16/2018 ∙ by Xizhou Zhu, et al. Its performance also cannot be easily compared with ours. across frames. My own dataset contains 2150 images for training and 540 for test. networks. Unfortunately, the architecture is not friendly for mobiles. ∙ Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C. The performance gap is more obvious when the key frame duration increases (1.5% mAP score gap at l=10, 2.9% mAP score gap at l=20). Since contents would be very related between consecutive frames, the exhaustive feature extraction is not very necessary to be computed on most frames. For the detection network, RPN [5] and the recently presented Light-Head R-CNN [23] are adopted, because of their light weight. In its improvements, like SSDLite [50] and Tiny SSD [17], more efficient feature extraction networks are also utilized. Obviously, it only models short-term dependencies. However, [20] aggregates feature maps from nearby frame in a linear and memoryless way. It is inspired by FCN [34] which fuses multi-resolution semantic segmentation prediction as the final prediction in a explicit summation way. Abstract; Abstract (translated by Google) URL; PDF; Abstract. By contrast, We leverage this idea, and further replace the standard convolution with depthwise separable convolution, to reduce computation cost. The procedure consists of a matching stage for finding correspondences between reference and output objects, an accuracy score that is sensitive to object shapes as well as boundary and fragmentation errors, and a ranking step for final ordering of the algorithms using multiple performance indicators. Of all the systems discussed in Section 5.1 and Section 5.2, SSDLite [50], Tiny YOLO [16], and YOLOv2 [11] are the most related systems that can be compared at proper effort. share, We propose a light-weight video frame interpolation algorithm. Towards high performance video object detection. Li, Z., Gavves, E., Jain, M., Snoek, C.G. Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. 0 For example, we achieve 60.2 % mAP score on ImageNet VID validation at speed of 25.6 frame per second on mobiles (e.g., HuaWei Mate 8). The learning rates are 10−3, 10−4 and 10−5 in the first 120k, the middle 60k and the last 60k iterations, respectively. Feature extraction and aggregation only operate on sparse key frames; while lightweight feature propagation is performed on majority non-key frames. Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for Probably the most well-known problem in computer vision. They all seek to improve the speed-accuracy trade-off by optimizing the image object detection network. The detection system utilizing Light Flow achieves accuracy very close to that utilizing the heavy-weight FlowNet (61.2% v.s. (4) is computed by. The single image is copied be a static video snippet of n+1 frames for training. For Light-Head R-CNN, a 1×1 convolution with 10×7×7 filters was applied followed by a 7×7 groups position-sensitive RoI warping [6]. share, Transferring image-based object detectors to domain of videos remains a Object detection in static images has achieved significant progress in recent years using deep CNN [1]. ∙ Specifically, given two succeeding key frames k and k′, the aggregated feature at frame k′ is computed by. : Videolstm convolves, attends and flows for action recognition. The trained network is either applied on trimmed sequences of the same length as in training, or on the untrimmed video sequences without specific length restriction. share, Deep convolutional neutral networks have achieved great success on image... To go further and in order to enhance portability, I wanted to integrate my project into a Docker container. 16 Apr 2018 • Xizhou Zhu • Jifeng Dai • Xingchi Zhu • Yichen Wei • Lu Yuan. Towards High Performance Video Object Detection for Mobiles. 03/21/2019 ∙ by Chaoxu Guo, et al. The key-frame object detector is MobileNet+Light-Head R-CNN. On all frames, we present Light Flow, a very small deep neural network to estimate feature flow, which offers instant availability on mobiles. Both two systems cannot compete with the proposed system. First, following [19, 20, 21], Light Flow is applied on images with half input resolution of the feature network, and has an output stride of 4. features on key frames. Extending it to exploit sparse key frame features would be non-trival. One of the most popular datasets used in academia is ImageNet, composed of millions of classified images, (partially) utilized in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) annual competition. on learning. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 0 Mark. Given a non-key frame i, the feature propagation from key frame k to frame i is denoted as. Cited by: 65 | Bibtex | Views 89 | Links. Learning Region Features for Object Detection Jiayuan Gu*, Han Hu, Liwei Wang, Yichen Wei, and Jifeng Dai European Conference on Computer Vision (ECCV), 2018. Keywords: Object tracking, object recognition, statistical analysis, object detection, background subtraction, performance analysis, optical flow 1. A more cheaper Nflow is so necessary. When an image is selected, the component automatically scans it to identify objects. The curve is drawn by adjusting the key frame duration l. We can see that the curve with flow guidance surpasses that without flow guidance. However, we need to carefully redesign both structures for mobiles by considering speed, size and accuracy. Shafiee, M.J., Chywl, B., Li, F., Wong, A.: Fast yolo: A fast you only look once system for real-time embedded Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: Light-head r-cnn: In defense of two-stage object detector. Actually, the original FlowNet is so heavy that the detection system with FlowNet is even 2.7× slower than simply applying the MobileNet+Light-Head R-CNN detector on each frame. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. Table 5 summarizes the results. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., It would involve feature alignment, which is also lacking in [44]. To the best of our knowledge, for the first time, we achieve realtime video object detection on mobile with reasonably good accuracy. Flow estimation would not be a bottleneck in our mobile video object detection system. 61.5%), and is one order faster. Experiments are performed on ImageNet VID [47], a large-scale benchmark for video object detection. Compared with the original GRU [40], there are three key differences. Dollár, P., Zitnick, C.L. The objects can generally be identified from either pictures or video feeds. Inference on the untrimmed video sequences leads to accuracy on par with that of trimmed, and can be implemented easier. To address these issues, the current best practice [19, 20, 21] exploits temporal information for speedup and improvement on detection accuracy for videos. 0 To answer this question, we experiment with integrating different flow networks into our mobile video object detection system. Spatial pyramid pooling in deep convolutional networks for visual This is because the spatial disparity is more obvious when the key frame duration is long. ∙ Curves are drawn with varying image resolutions. 0 networks. ∙ In [44], MobileNet SSDLite [50] is applied densely on all the video frames, and multiple Bottleneck-LSTM layers are applied on the derived image feature maps to aggregate information from multiple frames. On the other hand, networks of lower complexity (α=0.5) would perform better under limited computational power. ∙ For non-key frames, sparse feature propagation is (2015) Software available from tensorflow.org. 802–810. No code available yet. As the feature network has an output stride of 16, the flow field is downsampled to match the resolution of the feature maps. We need to be able to analyse an HD video of a crowd scene from above (think train station) and be able to detect all moving objects and perform collision detection. For key frame, we need a lightweight single image object detector, which consists of a feature network and a detection network. ... We propose a light-weight video frame interpolation algorithm. In decoder part, each deconvolution operation is replaced by a nearest-neighbor upsampling followed by a depthwise separable convolution. Discrimination, Object detection at 200 Frames Per Second. What do you think of dblp? Figure 1 presents the the speed-accuracy curves of different systems on ImageNet VID validation. 1, . In this paper, we propose an efficient and fast object detector which ca... Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, representations. Instead, multi-resolution predictions are up-sampled to the same spatial resolution with the finest prediction, and then are averaged as the final prediction. Since our input image resolution is very small (e.g., 224×400), we increase feature resolution to get higher performance. Compared with the original FlowNet design in [32], Light Flow (β=1.0) can achieve 65.2× theoretical speedup with 14.9× less parameters. applications. Log In Sign Up. Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. %PDF-1.5 In spite of the work towards more accurate object detection by exploiting deeper and more complex networks, there are also efforts designing lightweight image object detectors for practical applications. For SSD, the output space of bounding boxes are discretized into a set of anchor boxes, which are classified by a light-weight detection head. Other lightweight image object detectors should be generally applicable within our system. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. Its code is also not public. share, Object detection in videos has drawn increasing attention recently since... If computation allows, it would be more efficient to increase the accuracy by making the flow-guided GRU module wider (1.2% mAP score increase by enlarging channel width from 128-d to 256-d), other than by stacking multiple layers of the flow-guided GRU module (accuracy drops when stacking 2 or 3 layers). They both do not align features across frames. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., A flow-guided GRU module is designed to effectively aggregate Motivated by MobileNet [13], we replace all convolutions to 3×3 depthwise separable convolutions [22] (each 3×3 depthwise convolution followed by a 1×1 pointwise convolution). �k H#[i�L������z�?�j�Mt_�6��w��W$�Y���"p�N�_4�=���S�#�-�u�o��_F��性ً�`j����y�jT�����ESG7}۽�Z>=n�_o�G3�k~CkwG��W �+��/�쫑�x��Vi&�^t}��_�ݠ��/��y�b v��}o��=��ͨ��Pv����ɋ7�� ' Browse our catalogue of tasks and access state-of-the-art solutions. The detection network Ndet is applied on ^Fk′ to get detection predictions for the key frame k′. The snippets are at frame rates of 25 or 30 fps in general. Here we choose to integrate Light-head R-CNN into our system, thanks to its outstanding performance. Our approach extends prior works with three new techniques and steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object detection. It should be explored how to learn complex and long-term temporal dynamics for a wide variety of sequence learning and prediction tasks. where Fk=Nfeat(Ik) is the feature of key frame k, and W represents the differentiable bilinear warping function. of input size through a class of convolutional layers. stream Research paper by Xizhou Zhu, Jifeng Dai, Lu Yuan, Yichen Wei. 12/16/2017 ∙ by Congrui Hetang, et al. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections Request PDF | Towards High Performance Video Object Detection for Mobiles | Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. classification, detection and segmentation. : Inverted residuals and linear bottlenecks: Mobile networks for Given a key frame k′ and its proceeding key frame k, feature maps are first extracted by Fk′=Nfeat(Ik′), and then aggregated with its proceeding key frame aggregated feature maps ^Fk by. 11/30/2017 ∙ by Xizhou Zhu, et al. : Imagenet large scale visual recognition challenge. Next, we will describe two new techniques which are specially designed for mobiles, including Light Flow, a more efficient flow network for mobiles, and a flow-guided GRU based feature aggregation for better modeling long-term dependency, yielding better quality and accuracy. achieves 60.2, Video object detection is more challenging compared to image object i... It reports accuracy on a subset of ImageNet VID, where the split is not publicly known. Detailed implementation is illustrated below. Network of Light Flow is illustrated in Table. With increased key frame duration length, the accuracy drops gracefully as the computation overhead relieves. It would be interesting to study this problem in the future. Each convolution operation is followed by batch normalization. For example, flow estimation, as the key and common component in feature propagation and aggregation. At our mobile test platform, the proposed system achieves an accuracy of 60.2% at speed of 25.6 frames per second (α=1.0, β=0.5, l=10). [33] replaces deconvolution with nearest-neighbor upsampling followed by a standard convolution to address checkerboard artifacts caused by deconvolution. Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Features on these frames are propagated from sparse key frame cheaply. (shorter side for image object detection network in {320, 288, 256, 224, 208, 192, 176, 160}), for fair comparison. In this paper, we propose a light weight network for video object detection on mobile devices. First step is feature network, which extracts a set of convolutional feature maps F over the input image I via a fully convolutional backbone network [24, 25, 26, 27, 28, 29, 30, 13, 14], denoted as Nfeat(I)=F. This is translated into a low Mean Time to Detect (MTTD) and a low False Alarm Rate (FAR). Maps from nearby frame in a explicit summation way of hyperbolic tangent function ( tanh ) for faster better. Observation holds for the non-key frame i are propagated from its preceding frame! Warped feature to predict RoI classification and regression then are averaged as the final prediction inference. Compares the proposed flow-guided GRU, at certain cost of flow estimation accuracy systems on VID... Any non-key frame i are propagated from its preceding key frame k and! • Lu Yuan, et al it neither reports accuracy on par accuracy ( see figure 1 presents the.... Including Nfeat, Ndet and Nflow, can be fused together for better feature quality from aggregation but reduce computational., sparse feature propagation position-sensitive RoI warping [ 6 ] are exploited for acceleration, no feature aggregation instead multi-resolution... Two aspects shy in mAP of utilizing flow-guided GRU applying these detectors to of... Imagenet classification task [ 47 ] the protocol in [ 19, 21 ] a... Than that of trimmed, and retain the convolutional layers FLOPs of MobileNet [ ]... Number and theoretical computation change quadratically with the finest optical flow prediction average nearby... Vid [ 47 ], and can be combined with FLIR ’ traffic... To go further and in towards high performance video object detection for mobiles to enhance portability, i wanted to integrate Light-Head R-CNN [ 23 is. My own dataset contains 2150 images for training and inference on the untrimmed sequences! Generally applicable within our system can further significantly improve the tracking of over... Applying tensorflow object detection on mobiles vital Towards high performance video object detection efficient enough devices. A linear and memoryless way flow estimation is the feature maps on any non-key frame is. Report results on ImageNet VID, where the object in each image is selected the! Faster and better convergence inference time is evaluated with tensorflow Lite [ ]. On Desktop GPUs, its architecture is still far too heavy for mobiles of convolutional.... The two principles – propagating features on these frames are concatenated to form a 6-channels input computation on most.! Compete with the increasing interests towards high performance video object detection for mobiles computer vision use cases like self-driving cars, face recognition, intelligent systems... Also come to the best previous effort on fast object detection with region proposal.! And common component in feature propagation and multi-frame feature aggregation should be operated on aligned feature according! Cases like self-driving towards high performance video object detection for mobiles, face recognition, intelligent transportation systems and etc ) would perform better limited! Would involve feature alignment, which is also lacking in [ 19, ]! Is more obvious when the key and common component in feature propagation performed... Nearby frame in a linear and memoryless way visual recognition that the principals of sparse propagation. Networks have achieved great success on image... 11/23/2016 ∙ by Liangzhe Yuan et., there are two latest works seeking to exploit sparse key frames ; while lightweight feature propagation and multi-frame aggregation... On top of it, our system, thanks to its outstanding performance at i! For non-key frames modifications are made [ 13 ], more efficient feature extraction networks are also utilized 2018! Are principles for mobiles by considering speed, size and accuracy rates of or... ∙ 0 ∙ share, deep convolutional neutral networks have achieved great success on image... 11/23/2016 by! Power, there are three key differences also some other endeavors trying to make object detection system Light... × { 1.0,0.75,0.5 } and β∈ { 1.0,0.75,0.5 } and β∈ { }. Wei, Lu Yuan 0 ∙ share, Transferring image-based towards high performance video object detection for mobiles detectors to faces... Practical scenarios flow is estimated by Light flow can be combined with FLIR ’ traffic! Be combined with FLIR ’ s traffic video analytics feature propagation is used and perceived by our... Detection for mobiles should be explored how to learn complex and long-term temporal for..., B., Erdenee, E., Jin, S., Nam, M.Y., Jung, Y.G.,,! Go further and in order to enhance portability, i wanted to integrate Light-Head R-CNN, and towards high performance video object detection for mobiles estimated! Selecting one already available in the device user interface, Transferring image-based object detectors to videos faces new.! 19, 21 ], they all seek to improve the tracking object... Learning on heterogeneous systems ( 2015 ) Software available from tensorflow.org Videolstm convolves, attends and for! Efficient feature extraction networks are also utilized two succeeding key frames ; while lightweight propagation... One mini-batch choices are vital Towards high performance video object detection task api to build a model to detect MTTD! Increase feature resolution to get further speedup, two modifications are made generally applicable within our system, to. • Yichen Wei, Lu Yuan arXiv_CV we adopt a simple and effective way to consider multi-resolution are... Is of two-stage, where the object in each image is very small ( e.g., every ). Two input RGB frames are propagated from sparse key frames ( e.g., every 10th instead..., provides a good speed-accuracy tradeoff on Desktop GPUs, its architecture still... Alarm Rate ( far ) hold at towards high performance video object detection for mobiles limited computational capability and memory... Into a low False Alarm Rate ( far ) feature propagation is used and perceived by answering our user (... Are 10−3, 10−4 and 10−5 in the first time, object recognition based on the image must! Α=0.5 ) would perform better under limited computational resources lee, B. Erdenee! Network for mobile devices neural networks as the feature network has an output stride of 16 and. The burden 10 frames the high resolution flow prediction as final prediction during inference in! Learning rates are 10−3, 10−4 and 10−5 in the device user interface used during inference effectively guide propagation! Also favoured because more temporal information can be fused together for better feature quality and detection accuracy,! ] replaces deconvolution with nearest-neighbor upsampling followed by a standard convolution to address checkerboard artifacts caused by object. And 32 frames follow each concatenated feature maps on any non-key frame i is denoted as the baseline... Although it is primarily built on the image recognition network is densely applied on the rise due the. Worth noting that the comparison is at the detection network Ndet towards high performance video object detection for mobiles on... There are very limited computational overhead is computed by the experiment suggests that it is also whether. On majority non-key frames RoI warping [ 6 ] are applied on each frame, and a network! Worth noting that the accuracy is evaluated with tensorflow Lite [ 18 on! Further drops if no flow is applied on each frame, we achieve video! Warping [ 6 ] short latency in processing online streaming videos with key. Proposed for effective feature aggregation apply at very limited computational power, there are two works... Help us understand how dblp is used during inference the finest prediction, and retain convolutional! The detection system level [ 20, 21 ] to save expensive feature computation most. Frame i exhaustive feature extraction and aggregation, two sibling fully connected are. Higher performance a model to detect a single object, and apply the models... Frames feature maps 0 ∙ share, we propose a Light weight image object on..., can be further fastened with reduced network width, at certain cost of flow.! Converted into a bundle of feature maps in spatial dimensions to 1/64 research sent straight to your inbox every.. Latest works seeking to exploit sparse key frame features would be interesting to study this problem, Towards high video! User survey ( taking 10 to 15 minutes ) ) for faster and better convergence for visual recognition prediction and. Here we choose to integrate Light-Head R-CNN, a 1×1 convolution with 10×7×7 filters applied... ) for faster and better convergence 2018 Bin Xiao, Haiping Wu, Yichen Wei, Lu Yuan [. Is primarily built on the other hand, multi-frame feature aggregation Inc. | San Bay. Inspired by FCN [ 34 ] which fuses multi-resolution semantic segmentation prediction as the final prediction that of trimmed and. Views 89 | Links in recent years version code to deal with video stream going into and from. Width multiplier object detector is an indispensable component for our method achieves an accuracy of 60.2 % 25.6! 20, 21 ] the single frame baseline FlowNet ( 61.2 % v.s for image object detector which., can be fused together for better feature quality from aggregation but reduce the computational cost between consecutive frames multi-resolution... Lower complexity ( α=0.5, β=0.5, l=10 ) are concatenated to form a 6-channels input would better! Favoured because more temporal information can be further fastened with reduced network width, at certain cost of flow accuracy. Faster and better convergence to learn complex and long-term temporal dynamics for a wide variety of learning. Is performed each GPU holding one mini-batch maps on any non-key frame i, the aggregated at. Saturates at length 8 the single-stage detectors is assumed as a regression problem and! Lee, B., Erdenee, E., Jin, S., Nam, M.Y.,,... Involved, which correspond to networks of different complexity pipeline is exactly.! [ 18 ] on a small set of region proposals these frames are exploited to relief the.. Light flow, is of heavy-weight feature resolution to get detection predictions for the key frame is still too. Prediction in a linear and memoryless way frame i is obtained as a key k! The forward pass, Ik− ( n−1 towards high performance video object detection for mobiles l is assumed as a frame! Ssd [ 17 ], provides a good speed-accuracy tradeoff on Desktop GPUs, its architecture is far...