Vision-based localization is a temporal informative task in which we can obtain information about the ego-motion of a vehicle from the historical information via examining consecutive frames. Sufficient temporal information helps to reduce the search space of the next location. Hence, both efficiency and accuracy of the localization system can be enhanced. This paper presents a semi-supervised deep vision-based localization algorithm, using a novel tubing strategy to find the starting location of a vehicle. We group different number of consecutive frames as sets of tubes based on their temporal correlation to achieve pair searching with variable tube sizes. We also enhance an off-the-shelf network model with our modified training data generation method to improve the discrimination power of the features given by the model. Experimental results show that our proposed temporal correlation based initialization module can confidently localize the starting location of a vehicle (for a certain journey), and achieve 40% precision improvement over that of the conventional CNN approaches.