Nov 15, 2016 - (DCF) has drawn much attention in visual object tracking community. ..... steps. However, it will usually take a long time, which makes it impracticable for visual tracking. Moreover, the positive samples in visual tracking are very li
Mar 27, 2018 - TOPG method to the task of visual tracking and propose a. TOPG-based ... of Computer Science and Engineering, Beihang University, Beijing 100191, ...... (d) out of view, (e) fast motion, (f) illumination variation, (g) motion blur, (h)
On-Board Visual Tracking with Unmanned Aircraft System. (UAS). Ashraf Qadir1 .... The object is detected comparing the template and the image using zero mean normalized cross correlation. The .... connected via RS232 serial cable to the on-board comp
Nov 21, 2016 - ing situations of similar background on sequences Coupon and. Box . Our tracker takes structure information of both target object and background into account, and performs more robustly to similar distractors or background clutter
Apr 20, 2017 - the problem of learning deep fully convolutional features for the CFB visual ... learning. I. INTRODUCTION. One of the major problems in computer vision is single object visual tracking, which has potential applications includ- ing vis
Aug 13, 2017 - for visual tracking, which maps an exemplar of the target and a larger search area of second frame to a response map. In contrast to these methods, which do not have an online updating scheme that adapts the tracker to variations in th
Aug 2, 2017 - of many computer vision systems, can be naturally specified as an online learning problem. ... â Shenzhen Key Lab of Advanced Telecommunication and Information. Processing, College of Information .... train deep networks online with li
Dec 19, 2016 - of-the-arts. Index TermsâVisual Tracking, Deep Neural Network, Inde- pendent Component Analysis with Reference. I. INTRODUCTION. Visual tracking has long ... neural networks (ConvNet) , . Driven by the large- ... and observati
Jan 22, 2016 - AbstractâSampling and budgeting training examples are two essential factors in tracking algorithms based on support vector machines (SVMs) as a tradeoff between accuracy and efficiency. Recently, the circulant matrix formed by dense
Nov 21, 2016 - ing situations of similar background on sequences Coupon and. Box . Our tracker takes structure information of ... domain CNN architecture and achieves state-of-the-art per- formances on various benchmarks. ...... applied to handwr
Oct 13, 2018 - Yipeng Ma1, Chun Yuan2( ), Peng Gao1, and Fei Wang1( ) ..... Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in.
Oct 7, 2013 - get in a video sequence, a tracking task aims to infer the states of the target in the succeeding frames. Despite sig- ... [email protected]). work of  identifies the influential factors of a test se- .... to use the histograms of
Aug 1, 2017 - residual learning to take appearance changes into account. Extensive ... training. Second, most DCFs trackers use a linear interpo- lation operation to update the learned filters over time. Such an empirical interpolation weight is unli
Aug 13, 2017 - line learn the target. Extensive experiments conducted on widely used benchmarks, OTB and VOT, demonstrate en- couraging results compared to ...... C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale ma
Nov 8, 2018 - 1. Visualization of 3 tracking results. Green, purple, red box denote tracking .... continuous deep Q-Learning for hyperparameter selection.
sor to better understand and diagnose visual tracking system. According to their research, effective feature representation plays significant role in a tracker.
Autonomous driving application is developing towards specific scenes. ... expensive LiDAR and computers, but is still much more expensive than regular ... path following, path following at night, small radius turning, all with reasonable cost.
Mar 18, 2016 - AbstractâGraph based representation is widely used in visual tracking field by finding correct correspondences between target parts in consecutive frames. However, most graph based trackers consider pairwise geometric relations betwe
Du Yong Kim is with the Department of Electrical and Computer. Engineering, Curtin ... able to provides track identities with completely new structure and evaluated using ..... LAB code with Intel 2.53GHz CPU laptop. ... best tracking results.
Apr 28, 2015 - Adaptive Visual Tracking for Robotic Systems ...... The author would like to thank the anonymous reviewers and the Associate Editor of Auto-.
Feb 8, 2018 - These features were selected from predefined hand-crafted features. (such as SIFT). In , the method extracts motion, appear- ance and location saliency maps and predicts the next loca- tion of objects via multiplication of all salien
May 9, 2008 - AbstractâKernel-based mean shift (MS) trackers have proven to be a promising alternative to stochastic particle filtering track- ers. Despite its popularity, MS trackers have two fundamental drawbacks: (1) The template model can only
Feb 17, 2016 - learning-rate is essentially a trade-off between adaptiveness and stability. However, even with very small ... Here we use Snake (video game) as an analogy for learning-rate based visual trackers (tracker1 and ..... With an intelligent
port vector machine and its on-line learning version in this section. A. Mean Shift Tracking. Mean shift (MS) tracking was firstly presented in . In. MS tracking, the object is represented by a square region which is cropped and normalized into a
FPGA-based Acceleration System for Visual Tracking Ke Song1, 2, Chun Yuan2*,Peng Gao1, Yunxu Sun1
arXiv:1810.05367v2 [cs.CV] 16 Oct 2018
Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China 2 Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China * Email: [email protected]
Abstract Visual tracking is one of the most important application areas of computer vision. At present, most algorithms are mainly implemented on PCs, and it is difficult to ensure real-time performance when applied in the real scenario. In order to improve the tracking speed and reduce the overall power consumption of the visual tracking, this paper proposes a real-time visual tracking algorithm based on DSST(Discriminative Scale Space Tracking) approach. We implement a hardware system on Xilinx XC7K325T FPGA platform based on our proposed visual tracking algorithm. Our hardware system can run at more than 153 frames per second. In order to reduce the resource occupation, our system adopts the batch processing method in the feature extraction module. In the filter processing module, the FFT IP core is time-division multiplexed. Therefore, our hardware system utilizes LUTs and storage blocks of 33% and 40%, respectively. Test results show that the proposed visual tracking hardware system has excellent performance. 1. Introduction Visual tracking is an important branch in the field of computer vision. tracking on motion platforms uses embedded devices as the processor. Its computing resources and storage resources are limited, and some of them also require the power consumption, volume, weight, etc. In addition, the use of high-resolution cameras also increases the amount of computation of visual tracking, and makes it difficult to handle high frame rates. Field Programmable Gate Array (FPGA) has a parallel processing structure and superior flexibility. It is suitable for the implementation of large-data throughput data interfaces and visual algorithms. At the same time, FPGA has low power consumption and high integrating capability, which can reduce the overall power consumption and volume of the embedded device. Therefore, using FPGA as a hardware development platform has important theoretical significance and application value. At present, many researchers have studied on the design of visual tracking hardware system. Miao et al.  completed a dedicated processing chip for high-speed visual tracking which can perform real-time tracking in a simple background. Jung  designed a real-time
tracking system based on adaptive color histogram on FPGA. The processing frequency can reach 81FPS for 15x15 pixels. Elkhatib  proposed a hardware system that uses Altera Cyclone II platform to process a resolution of 640x480 images under 20MHz clock conditions. Its frame rate can reach 21.7 FPS. The system architecture of the SIFT+BRIEF architecture proposed by Wang , which shows that the 720p image can reach 60 FPS on FPGA. Tahara  proposed a particle filter-based target tracking system using Xilinx Kintex VII FPGA and achieved good results. These systems have achieved excellent results, but all the results were obtained by sacrificing the real-time capabilities. In addition, some FPGAs are relatively expensive to select and are not suitable for general application scenarios. In this paper, we propose a real-time visual tracking hardware system based on Xilinx XC7K325T FPGA platform. Use a lower-cost FPGA and optimize the algorithm to reduce the amount of calculations. In addition, resource optimization and structure optimization are implemented for the hardware implementation to reduce resource consumption, and the accuracy is greatly improved while ensuring accuracy. 2. Algorithm Theory and Hardware Implementation DSST  is a tracking algorithm based on correlation filters. It has good robustness to motion blur and illumination changes, and can estimate the target scale. In this paper, 1-D gray feature and 32-dimensional (HOG) feature are selected to estimate the target position. The 32-dimensional HOG(Histogram of Oriented Gradients) feature is used to estimate 7 target dimensions. The two processes are independent, making the algorithm easier to parallelize. The hardware design is divided into seven modules: 1) image extraction module; 2) scale calculation module; 3) interpolation module; 4) HOG feature extraction module; 5) correlation filter calculation and update module; 6) target position and scale calculation module 7) target information update module. 2.1 Algorithm Theory DSST selects multidimensional features which indexed by the feature channel l 1,2, , d during sample extraction. The multidimensional features of the input sample f are composed of grayscale and HOG.
Compared with MOSSE(Minimum Output Sum of Squared Error filter), the increased HOG feature makes the algorithm adapt to the texture feature scene better. The minimum mean square error sum can be expressed as,
l 1 hl * f l g
l 1 hl
Where g is the desirable output that supposed to follow Gaussian distribution, and j indicates the current processed frame number. The image size of h, g, f is M⨉N. The parameter is a regular term which can eliminate the influence of the zero component in the spectrum of the sample f and avoid the denominator of the above solution being zero. Based on the circulant properties of correlation filters, the time domain solution can be converted to the frequency domain solution as the following formula, GF l Al Hl d t （2） k k B F F t k 1
Where l is a dimension of the feature, t represents the current frame. However, solving the linear equations of d d dimension is time-consuming. In order to obtain a robust approximation, the numerator and Atl denominator B t in the above equation are separately updated as follows,
Atl （1-）Atl1 Gt Ft l d
Bt （1 - ）Bt 1 k 1 Ft Ft k
Here is a learning rate. For a new sample Z of size M⨉N, the maximum response of the target position is, d Al Z l y F 1 k 1 （5） B Furthermore, DSST proposes a fast-scale estimation method to deal with target scale variations. In each frame, the corresponding optimal target scales can be found while estimating target positions. The algorithm first uses a position-dependent filter to determine the new target position. Then, centered on the estimated target position, 33 candidate blocks of different scales are selected, and the optimal matching scale is found by using a scale filter. The image block selection basis is as follows, S 1 S 1 a n P a n R，n [ , , ] 2 2
2.2 Algorithm Optimization There are two main methods of scale estimation. One is the method proposed in DSST and the other is the method proposed in SAMF . DSST uses 33 scale spaces. Then, after extracting the HOG feature for each scale, it is stretched into a one-dimensional vector and a one-dimensional correlation filter is trained online to calculate responses for each scale space. SAMF generates 7 scale spaces based on 7 scale factors. Then, each scale space is interpolated to a fixed size, and the position and scale of the target are calculated. The scale and position of the new target can be determined by searching the peak value of responses. However, the scale estimation method proposed in DSST consume a large amount of storage resources and computing resources, hence, we adopt the scale estimation method proposed in SAMF. The target response calculation is performed using the same samples as the position estimate in the scale estimation. Firstly, the gray level feature and HOG feature are used to estimate the target position, and then the HOG feature is used to estimate the scale. These two processes are independent of each other, so they can be parallelized, which reduces resource consumption and processing time significantly. 2.3 Hardware Architecture In this paper, we implement the hardware system based on Xilinx XC7K325T development platform. Figure 1 shows the overall architecture of the system. As shown in Figure 1, the input data must be preprocessed firstly. The image block extraction module extracts the largest-scale image block, and other-scale image blocks are read from the largest-scale image block. The interpolation module inserts the extracted candidate frames of different scales into a fixed size of 128x128. The feature extraction module performs feature extraction on the data in the candidate frame, and the extracted image block size is 32x32.
Where P and R indicate the width and height of the target size. S represents the number of scale layers which n
and the scale filter.
we set as 33. a denotes the scale factors used to obtain different scaled blocks. Then use trained filters to estimate the scale. Finally, the final target state is obtained by combining the results of the position filter
Figure 1. Overall architecture of the hardware system 2.4 Implementation Detail Because the system uses 33 dimensional features (1-D grayscale features and 32-dimensional HOG features). If
the filter response of the 33-dimensional feature is simultaneously consumed in parallel, it will consume too much resources. Therefore, the 33-dimensional features are divided into eight batches for processing. In this way, only five layers of related filtering calculation structures need to be designed, which reduces a large part of storage resources and computing resources. The batch processing structure is shown in Fig. 2.
Figure 2. Schematic diagram of batch processing In this paper, when conducting FFT and IFFT, we need to perform 2D Fourier transform on the input feature map of size 32x32 and 2D inverse Fourier transform on the final result to obtain the final response map. When Fourier transform is performed, one-dimensional Fourier transform is performed on the row data firstly, and the intermediate data is stored and then read out as a column to perform one-dimensional Fourier transform of the column data. Because of their chronological order, the Fourier and inverse Fourier transforms can be optimized. Use an IP core to time-multiplex it to reduce resource consumption. As shown in Figure 3, it shows the schematic diagram before and after optimization：
Figure 3. FFT2 calculation structure 3. Discussion We use MATLAB to verify the algorithm. Our algorithm is performed on a PC with intel i7 4790 [email protected] GHz, and the experiment was tested on the OTB50 sub-set. The result as follows. Table 1 is the comparison of our algorithm with other tracker results.
Table 1. Tracking speed of our tracker and other trackers Method OURS DSST Struck TLD MIL FPS 153 24 5.97 22.2 22.84 It can be seen from table 1, our algorithm has
greatly improved accuracy and speed compared to traditional algorithms. Table 2 shows the resource usage of the FPGA. Table 2. FPGA Resource Utilization Summary Resource Used Resources Utilization Slice Registers 95485 23% Slice LUTs 68433 33% Block RAM/FIFO 179 40% DSP48E1s 143 17% From the above table, it can be seen that the optimized system structure impressively reduces resource consumption. In the case of less resource consumption, the required functions are well implemented, certain stability and robustness are ensured, the goal of optimizing the hardware implementation is ensured, and the performance of the tracker is significantly improved. 4. Summary In this paper, based on DSST, we improve and optimize the algorithm to meet the hardware implementation requirements for visual tracking. Then we use the FPGA development platform to implement the visual tracking hardware system, and conduct system-level experimentation and verification. The performance of the tracker is evaluated through software and hardware systems. In application, our algorithm has low cost and low power consumption. It can process HD video in real time, with superior robustness and speed. In addition, our algorithm is fast and suitable for low-cost hardware implementation. In the future work, we will further improve the accuracy of the algorithm, so that it can be applied to complex environment. References  Miao W, Lin Q and Wu N. Japanese Journal of Applied Physics,46(4B):2220(2007).  Cho J U, Jin S H and Xuan D P, et al. IEEE International Conference on Robotics and Biomimetics, p172-177(2008).  Lina N. Elkhatib, Fawnizu A. Hussin and et al, IEEE 4th International Conference on Intelligent and Advanced Systems, p.745-749 (2012). Wang, J., et al., IEEE Transactions on Circuits & Systems for Video Technology,24(3), p525-538 (2014).  Akane Tahara, Yoshiki Hayashida and et al, IEEE 4th International Symposium on Computing and Networking, p.422-428 (2016).  M. Danelljan, G. Häger, and F. S. Khan, et al. British Machine Vision Conference, p1-14(2014).  Li Y and Zhu J. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration. p254-265(2014).