Dr Pedro Porto Buarque de Gusmão
Academic and research departments
Computer Science Research Centre, School of Computer Science and Electronic Engineering.Publications
The ubiquity of camera-enabled mobile devices has lead to large amounts of unlabelled video data being produced at the edge. Although various self-supervised learning (SSL) methods have been proposed to harvest their latent spatio-temporal representations for task-specific training, practical challenges including privacy concerns and communication costs prevent SSL from being deployed at large scales. To mitigate these issues, we propose the use of Federated Learning (FL) to the task of video SSL. In this work, we evaluate the performance of current state-of-the-art (SOTA) video-SSL techniques and identify their shortcomings when integrated into the large-scale FL setting simulated with kinetics-400 dataset. We follow by proposing a novel federated SSL framework for video, dubbed FedVSSL, that integrates different aggregation strategies and partial weight updating. Extensive experiments demonstrate the effectiveness and significance of FedVSSL as it outperforms the centralized SOTA for the downstream retrieval task by 6.66% on UCF-101 and 5.13% on HMDB-51.
The ubiquity of camera-enabled devices has led to large amounts of unlabeled image data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of the learned visual representations without needing to move data around. However, client bias and divergence during FL aggregation caused by data heterogeneity limits the performance of learned visual representations on downstream tasks. In this paper, we propose a new aggregation strategy termed Layer-wise Divergence Aware Weight Aggregation (L-DAWA) to mitigate the influence of client bias and divergence during FL aggregation. The proposed method aggregates weights at the layer-level according to the measure of angular divergence between the clients' model and the global model. Extensive experiments with cross-silo and cross-device settings on CIFAR-10/100 and Tiny ImageNet datasets demonstrate that our methods are effective and obtain new SOTA performance on both contrastive and non-contrastive SSL approaches.
Simultaneous localization and mapping (SLAM) system typically employs vision-based sensors to observe the surrounding environment. However, the performance of such systems highly depends on the ambient illumination conditions. In scenarios with adverse visibility or in the presence of airborne particulates (e.g., smoke, dust, etc.), alternative modalities such as those based on thermal imaging and inertial sensors are more promising. In this article, we propose the first complete thermal-inertial SLAM system that combines neural abstraction in the SLAM front end with robust pose-graph optimization in the SLAM back end. We model the sensor abstraction in the front end by employing probabilistic deep learning parameterized by mixture density networks (MDNs). Our key strategies to successfully model this encoding from thermal imagery are the usage of normalized 14-b radiometric data, the incorporation of hallucinated visual (RGB) features, and the inclusion of feature selection to estimate the MDN parameters. To enable a full SLAM system, we also design an efficient global image descriptor that is able to detect loop closures from thermal embedding vectors. We performed extensive experiments and analysis using three datasets, namely self-collected ground robot and hand-held data taken in indoor environment, and one public dataset (SubT-tunnel) collected in underground tunnel. Finally, we demonstrate that an accurate thermal-inertial SLAM system can be realized in conditions of both benign and adverse visibility.
Real-time positioning of emergency personnel has been an active research topic for many years. However, studies on how to improve navigation accuracy by using prior information on the idiosyncratic motion characteristics of firefighters are scarce. This paper presents an algorithm for generating pseudo observations of position and orientation based on standard search patterns used by firefighters. The iterative closest point algorithm is used to compare walking trajectories estimated from inertial odometry with search patterns generated from digital maps. The resulting fitting errors are then used to integrate the pseudo observations into a map-aided navigation filter. Specifically, we present a sequential Monte Carlo solution where the pattern comparison is used to both update particle weights and create new particle samples. Experimental results involving professional firefighters demonstrate that the proposed pseudo observations can achieve a stable localization error of about one meter, and offer increased robustness in the presence of map errors.
Federated Learning (FL) enables training ML models on edge clients without sharing data. However, the federated model's performance on local data varies, disincentivising the participation of clients who benefit little from FL. Fair FL reduces accuracy disparity by focusing on clients with higher losses while personalisation locally fine-tunes the model. Personalisation provides a participation incentive when an FL model underperforms relative to one trained locally. For situations where the federated model provides a lower accuracy than a model trained entirely locally by a client, personalisation improves the accuracy of the pre-trained federated weights to be similar to or exceed those of the local client model. This paper evaluates two Fair FL (FFL) algorithms as starting points for personalisation. Our results show that FFL provides no benefit to relative performance in a language task and may double the number of underperforming clients for an image task. Instead, we propose Personalisation-aware Federated Learning (PaFL) as a paradigm that pre-emptively uses personalisation losses during training. Our technique shows a 50% reduction in the number of underperforming clients for the language task while lowering the number of underperforming clients in the image task instead of doubling it. Thus, evidence indicates that it may allow a broader set of devices to benefit from FL and represents a promising avenue for future experimentation and theoretical analysis.
In the last decade, numerous supervised deep learning approaches have been proposed for visual–inertial odometry (VIO) and depth map estimation, which require large amounts of labelled data. To overcome the data limitation, self-supervised learning has emerged as a promising alternative that exploits constraints such as geometric and photometric consistency in the scene. In this study, we present a novel self-supervised deep learning-based VIO and depth map recovery approach (SelfVIO) using adversarial training and self-adaptive visual–inertial sensor fusion. SelfVIO learns the joint estimation of 6 degrees-of-freedom (6-DoF) ego-motion and a depth map of the scene from unlabelled monocular RGB image sequences and inertial measurement unit (IMU) readings. The proposed approach is able to perform VIO without requiring IMU intrinsic parameters and/or extrinsic calibration between IMU and the camera. We provide comprehensive quantitative and qualitative evaluations of the proposed framework and compare its performance with state-of-the-art VIO, VO, and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI, EuRoC and Cityscapes datasets. Detailed comparisons prove that SelfVIO outperforms state-of-the-art VIO approaches in terms of pose estimation and depth recovery, making it a promising approach among existing methods in the literature.