This paper presents a novel frame-pair based method for visual object tracking. Instead of adopting two-stream Convolutional Neural Networks (CNNs) to represent each frame, we stack frame pairs as the input, resulting in a single-stream CNN tracker with much fewer parameters. The proposed tracker can learn generic motion patterns of objects with much less annotated videos than previous methods. Besides, it is found that trackers trained using two successive frames tend to predict the centers of searching windows as the locations of tracked targets. To alleviate this problem, we propose a novel sampling strategy for off-line training. Specifically, we construct a pair by sampling two frames with a random offset. The offset controls the moving smoothness of objects. Experiments on the challenging VOT14 and OTB datasets show that the proposed tracker performs on par with recently developed generic trackers, but with much less memory. In addition, our tracker can run in a speed of over 100 (30) fps with a GPU (CPU), much faster than most deep neural network based trackers.