Omnidirectional cameras are a suitable and cost-effective choice for Visual Place Recognition (VPR), as they provide comprehensive information from the scene regardless of the robot orientation. However, vision sensors are vulnerable to environmental appearance changes (e.g., illumination, season). While multi-modal approaches can overcome these challenges, they introduce significant cost and system complexity. This paper introduces a novel fusion framework that enhances VPR robustness by integrating visual data with geometric features derived from monocular depth estimation, retaining a single-camera setup. In the ablation study, both early and late fusion strategies are evaluated to optimally combine appearance-based and depth-derived descriptors. The extensive evaluation on challenging, indoor and outdoor datasets demonstrates that the proposed method consistently boosts retrieval performance across multiple state-of-the-art VPR backbones. Furthermore, this improvement is achieved without requiring end-to-end retraining, allowing our method to function as a pluggable module for pre-trained models. Consequently, this work presents a powerful, practical, and low-cost solution for robust VPR, with high potential to scale as monocular depth estimation and VPR models continue to improve.