In this paper, we propose a novel extension to the Class-specific Hough Forest (CHF) framework for object detection and localization. Our approach utilizes depth information during training to build a more discriminative codebook which simultaneously encodes features from the object and the surrounding context. In particular, we augment the CHF with contextual image patches, and design a series of depth-aware uncertainty measures for the binary tests used in CHF training. The new splitting criteria integrates relative physical scales of image patches, 3D offset uncertainty of votes, and 3D-distance modulated voting confidence. We show that the extended CHF is capable of learning better context models and building high- quality codebooks in appearance. As the model relies on depth information only in training, our system can be applied to object localization in 2D images. We demonstrate the efficacy of our method by experiments on two challenging RGB-D object datasets, and we empirically show our method achieves significant improvement over the state of the art with a more robust Hough voting scheme.