Dynamic displacement response of civil structures is an important index for in-construction and in-service structural condition assessment. However, accurately measuring the displacement of large-scale civil structures such as high-rise buildings still remains as a challenging task. In order to cope with this problem, a vision-based system with the use of industrial digital camera and image processing has been developed for long-distance, remote, and real-time monitoring of dynamic displacement of supertall structures. Instead of acquiring image signals, the proposed system traces only the coordinates of the target points, therefore enabling real-time monitoring and display of displacement responses in a relatively high sampling rate. This study addresses the in-situ experimental verification of the developed vision-based system on the Canton Tower of 600 m high. To facilitate the verification, a GPS system is used to calibrate/verify the structural displacement responses measured by the vision-based system. Meanwhile, an accelerometer deployed in the vicinity of the target point also provides frequency-domain information for comparison. Special attention has been given on understanding the influence of the surrounding light on the monitoring results. For this purpose, the experimental tests are conducted in daytime and nighttime through placing the vision-based system outside the tower (in a brilliant environment) and inside the tower (in a dark environment), respectively. The results indicate that the displacement response time histories monitored by the vision-based system not only match well with those acquired by the GPS receiver, but also have higher fidelity and are less noise-corrupted. In addition, the low-order modal frequencies of the building identified with use of the data obtained from the vision-based system are all in good agreement with those obtained from the accelerometer, the GPS receiver and an elaborate finite element model. Especially, the vision-based system placed at the bottom of the enclosed elevator shaft offers better monitoring data compared with the system placed outside the tower. Based on a wavelet filtering technique, the displacement response time histories obtained by the vision-based system are easily decomposed into two parts: a quasi-static ingredient primarily resulting from temperature variation and a dynamic component mainly caused by fluctuating wind load.