Abstract:
Real-time high resolution 3D scene maps are essential for many computer vision-based applications like autonomous vehicle navigation and collision avoidance. However, stereo vision is an ill-posed and computationally demanding problem which requires both efficient regularisation and hardware acceleration to generate accurate depth maps from image pairs with large disparity ranges. While dozens of stereo algorithms exist in the literature, not many of them are adaptable to modern GPU architectures. Several potential realtime stereo algorithms were implemented on GPUs and their speed and accuracy measured. Symmetric Dynamic Programming Stereo was found to be the best candidate, providing good accuracy at the target 30 frame per second rates. Its GPU solution was further optimised for high definition, low latency GPU performance and analysed in detail. A comparison with a known similar implementation on FPGA highlighted relative strengths and limitations of different hardware architectures. This thesis presents the highest throughput GPU based stereo system reported in literature so far: it evaluates more than 15 billion disparities per second on an inexpensive commercial GPU. This system was further extended with a multi-resolution computation framework for faster processing. This approach not only reduces the overall computation demands and thus increases achievable frame rates, but also incorporates scene constraints and inter scan-line consistency directly into the disparity calculation in order to improve the matching performance. The thesis also applies the developed stereo system to a challenging problem of real-time object detection and tracking. An observed scene is described with 3D point clouds provided by the system, thus trading inevitably some accuracy for speed. Also, noisy depth measurements and artefacts hinder tracking accuracy. To reduce artefacts and segment foreground objects to be tracked, a joint depth and colour filter was used. Then a fast clustering procedure based upon a set of simple volume rules identified the candidate objects and an opportunistic tagging system tracked objects through occlusions. Kalman filtering with distance-specific error covariance was implemented to accurately predict positions of objects in the next frame. Experiments with many synthetic and real-world video sequences confirmed the accuracy in tracking multiple people in various environments.