We introduce a learning based method for extracting distinctive features on video objects. Based on the extracted features, we are able to derive dense correspondences between the object in the current video frame and the reference template, and then use the correspondences to identify the grasping points on the object. We train a deep-learning model to predict dense feature maps using the training data collected via solving simultaneous localization and mapping (SLAM). Further, a new feature-aggregation technique based on the optical flow of consecutive frames is applied to the integration of multiple feature maps for alleviating uncertainties. We also use the optical flow information to assess the reliability of feature matching. The experimental results show that our approach effectively reduces unreliable correspondences and thus improves the matching accuracy.