Publications

Journal article

Goubard C, Demiris Y, 2025,

Cognitive modelling of visual attention captures trust dynamics in human-robot collaboration

, ACM Transactions on Human-Robot Interaction, Vol: 14, ISSN: 2573-9522

Understanding how humans perceive and interact with robots is crucial for collaborative scenarios. Trust, a pivotal factor in such interactions, is inherently volatile and subjective, posing significant challenges for robots. However, trust has also been shown to influence specific human bio-signals and behaviours, suggesting that it could be inferred from those indicators. One such indicator is visual attention, the cognitive process of focusing on distinct environmental elements, often manifested through eye gaze. Despite recent research connecting eye gaze and trust in Human-Robot Collaboration scenarios, this relationship remains largely unexplored. This paper presents a novel signal, the Attention Arbitration Ratio (AAR), which is shown to be a promising real-time predictor of subjective and objective trust measures. We obtain this signal using a visual attention modelling framework that explicitly emulates the Bottom-Up and Top-Down processes, two key cognitive components. We demonstrate the connection between the AAR and trust using Bayesian data analysis, and we analyse the sensitivity of that connection with different visual attention models. For evaluation purposes, we collected gaze data and trust questionnaires from 49 interactions where 29 participants engaged in a collaborative assistive cooking task with a robot, for a total duration of 24h53 of data collection.

Journal article

Chacón Quesada R, Estévez Casado F, Demiris Y, 2025,

An integrated 3D eye-gaze tracking framework for assessing trust in human–robot interaction

, ACM Transactions on Human-Robot Interaction, Vol: 14, Pages: 1-28, ISSN: 2573-9522

We introduce a comprehensive approach to examining the complexities of trust during Human–Robot Interactions (HRIs) through an innovative 3D eye-gaze tracking framework. Trust is a fundamental psychological factor in HRI studies, influencing how humans perceive and interact with robots. Although researchers have previously highlighted eye-tracking as a promising tool for capturing behavioural manifestations of trust continuously and non-intrusively, traditional approaches have been limited to 2D setups, leaving their applicability to real-world HRI largely unexplored. Thus, there still is limited evidence for the feasibility and validity of using eye-tracking to assess human–robot trust in more realistic settings. To this end, our framework employs Head-Mounted Displays with 3D eye-gaze and spatial tracking capabilities to gather continuous eye-gaze data alongside real-time user and robot positions. In addition to 3D eye-gaze tracking capabilities, we designed and incorporated a Bayesian model to evaluate experimental treatments’ effectiveness while identifying eye-gaze features correlating with participants’ subjective trust scores. The latter are measured using Likert-type instruments, widely used in HRI research. We applied our framework to a user study involving 25 participants performing an inspection task with a robot under two reliability conditions—high versus low. Our results revealed significant differences in subjective trust between conditions. Moreover, the results show that participants exposed to the low-reliability condition fixate for longer and have higher fixation and saccade amplitudes when compared to those in the high-reliability condition. Additionally, the group with low reliability had a greater rate of transitions between fixations. These findings are consistent with previous research on 2D settings. However, we observed differences in scan-path length and total fixation count compared to previous studies. Last

Journal article

Kotsovolis S, Demiris Y, 2025,

Garment diffusion models for robot-assisted dressing

, IEEE Robotics and Automation Letters, Vol: 10, Pages: 1217-1224, ISSN: 2377-3766

Robots have the potential to assist people with disabilities and the elderly. One of the most common and burdensome tasks for caregivers is dressing. Two challenges of robot-assisted dressing are modeling the dynamics of garments and handling visual occlusions that obstruct the perception of the full state of the garment due to the proximity between the garment, the robot, and the human. In this letter, we propose a diffusion-based dynamics model for garments during robot-assisted dressing that can deal with partial point cloud observations. The diffusion model, conditioned on the observation and the robot's action, is used to predict a full point cloud of the garment's opening of the future state. The model is utilized in a model predictive controller, that is trained iteratively with model-based reinforcement learning. In our experiments, we examine a common problem of dressing: the insertion of a garment's sleeve on an arm. As demonstrated by the performed experiments, the proposed diffusion-based model predictive controller can be effectively used for robot-assisted dressing and handle visual occlusions. Moreover, our approach is highly sample-efficient. Specifically, the controller achieved 91.2% success rate in the examined dressing task with less than 100 sampled trajectories. Real-wold experiments demonstrate that the proposed method can adapt to the sim-to-real gap and generalize well to novel garments and configurations of the body.

Abstract
Cite

Conference paper

Gu Y, Demiris Y, 2024,

Learning bimanual manipulation policies for bathing bed-bound people

, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 8936-8943

Assistive robots hold promise in enhancing the quality of life for older adults and people with mobility impairments in daily bed bathing routines. When providing bathing assistance to bed-bound people, human caregivers often support the joints when lifting the arms and legs to properly wash and dry occluded areas. This research introduces a novel approach to robotic bed bathing manipulation, where a bimanual robot learns to lift a target limb while controlling a cleaning tool to bath the surface within safe force bounds. To ensure safe, cooperative bath manipulation, our work combines Multi-Agent Reinforcement Learning (MARL) framework with a variable impedance action space enabling adaptive interaction with the environment and carefully-designed reward functions regulating contact force on the human body. Simulation results demonstrate improved bathing area coverage compared to unimanual models and exhibit great adaptability to contact-rich interaction within a safe force boundary. We validate our approach across various human body sizes, showcasing its generalizability. We also transfer our models to a physical Baxter robot bathing a medical-grade manikin. We further incorporate a force tracking controller with the trained models to enhance adaptation to noisy real-world bathing scenarios. To the best of our knowledge, this is the first robot-assisted bed bathing application that performs autonomous bathing around the human body using bimanual robot arms.

Conference paper

Chacón Quesada R, Estévez Casado F, Demiris Y, 2024,

On the effect of augmented-reality multi-user interfaces and shared mental models on human-robot trust

, 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), Publisher: IEEE, Pages: 1316-1322, ISSN: 1944-9437

Augmented Reality multi-user interfaces facilitate communication, coordination and collaboration among teams. Moreover, these interfaces can help to align the team’s perceptions and expectations under a shared mental model. This model is a psychological construct that represents the common knowledge, beliefs, and understandings held by team members. In this paper, we study to what extent, if any, the combination of Augmented Reality multi-user interfaces and shared mental models affects human-robot trust. To this end, we developed an Augmented Reality multi-user interface to perform a user study (N = 37) comparing non-dyadic human-robot interactions with a quadruped robot exhibiting low reliability (Group 3), against dyadic interactions while the robot exhibited high-reliability (Group 1) or low-reliability (Group 2). We made this comparison using validated trust questionnaires relevant to HRI. Our results, obtained via Bayesian data analysis methods, show differences in the distribution of answers between groups 1 and 2. Notably, this difference is smaller between groups 1 and 3, which suggests that the combination of shared mental models and multi-user interfaces holds promise as an effective way to manage and calibrate human-robot trust.

Journal article

Luo H, Demiris Y, 2024,

Benchmarking and simulating bimanual robot shoe lacing

, IEEE Robotics and Automation Letters, Vol: 9, Pages: 8202-8209, ISSN: 2377-3766

Manipulation of deformable objects is a challenging domain in robotics. Although it has been gaining attention in recent years, long-horizon deformable object manipulation remains largely unexplored. In this letter, we propose a benchmark for the bi-manual Shoe Lacing (SL) task for evaluating and comparing long-horizon deformable object manipulation algorithms. SL is a difficult sensorimotor task in everyday life as well as the shoe manufacturing sector. Due to the complexity of the shoe structure, SL naturally requires sophisticated long-term planning. We provide a rigorous definition of the task and protocols to ensure the repeatability of SL experiments. We present 6 benchmark metrics for quantitatively measuring the ecological validity of approaches towards bi-manual SL. We further provide an open-source simulation environment for training and testing SL algorithms, as well as details of the construction and usage of the environment. We evaluate a baseline solution according to the proposed metrics in both reality and simulation.

Conference paper

Goubard C, Demiris Y, 2024,

Learning self-confidence from semantic action embeddings for improved trust in human-robot interaction

, 2024 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE

In HRI scenarios, human factors like trust can greatly impact task performance and interaction quality. Recent research has confirmed that perceived robot proficiency is a major antecedent of trust. By making robots aware of their capabilities, we can allow them to choose when to perform low-confidence actions, thus actively controlling the risk of trust reduction. In this paper, we propose SCONE, a policy to learn self-confidence from experience using semantic action embeddings. Using an assistive cooking setting, we show that the semantic aspect allows SCONE to learn self-confidence faster than existing approaches, while also achieving promising performance in simple instructions following. Finally, we share results from a pilot study with 31 participants, showing that such a self-confidence-aware policy increases capability-based human trust.

Conference paper

Kotsovolis S, Demiris Y, 2024,

Model predictive control with graph dynamics for garment opening insertion during robot-assisted dressing

, 2024 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 883-890

Robots have a great potential to help people with movement limitations in activities of daily living, such as dressing. A common problem in almost all dressing tasks is the insertion of a garment’s opening around a part of the human body. The rich contact environment and the deformations of the garment make the task a challenging problem for robots. In this paper, we propose a bi-manual control method for garment opening insertion during robot-assisted dressing. Specifically, we propose a model predictive controller that uses an Attention-based Relational Graph Convolutional Network (ARGCN) for modeling the dynamics of the opening in the presence of the body. We train the model entirely in simulation and validate our method in four real-world dressing scenarios of a medical training manikin. We show that our method generalizes well in the real-world opening insertion tasks achieving an overall success rate of 97.5%, even though the dynamics and the shapes vastly differ from the simulation setup.

Journal article

Chacón Quesada R, Demiris Y, 2024,

Multi-dimensional evaluation of an augmented reality head-mounted display user interface for controlling legged manipulators

, ACM Transactions on Human-Robot Interaction, Vol: 13, ISSN: 2573-9522

Controlling assistive robots can be challenging for some users, especially those lacking relevant experience. Augmented Reality (AR) User Interfaces (UIs) have the potential to facilitate this task. Although extensive research regarding legged manipulators exists, comparatively little is on their UIs. Most existing UIs leverage traditional control interfaces such as joysticks, Hand-Held (HH) controllers and 2D UIs. These interfaces not only risk being unintuitive, thus discouraging interaction with the robot partner, but also draw the operator’s focus away from the task and towards the UI. This shift in attention raises additional safety concerns, particularly in potentially hazardous environments where legged manipulators are frequently deployed. Moreover, traditional interfaces limit the operators’ availability to use their hands for other tasks. Towards overcoming these limitations, in this article, we provide a user study comparing an AR Head-Mounted Display (HMD) UI we developed for controlling a legged manipulator against off-the-shelf control methods for such robots. This user study involved 27 participants and 135 trials, from which we gathered over 405 completed questionnaires. These trials involved multiple navigation and manipulation tasks with varying difficulty levels using a Boston Dynamics’s Spot, a 7 df Kinova robot arm and a Robotiq 2F-85 gripper that we integrated into a legged manipulator. We made the comparison between UIs across multiple dimensions relevant to a successful human–robot interaction. These dimensions include cognitive workload, technology acceptance, fluency, system usability, immersion and trust. Our study employed a factorial experimental design with participants undergoing five different conditions, generating longitudinal data. Due to potential unknown distributions and outliers in such data, using parametric methods for its analysis is questionable, and while non-parametric alternatives exist, they may

Journal article

Gu Y, Demiris Y, 2024,

VTTB: a visuo-tactile learning approach for robot-assisted bed bathing

, IEEE Robotics and Automation Letters, Vol: 9, Pages: 5751-5758, ISSN: 2377-3766

Robot-assisted bed bathing holds the potential to enhance the quality of life for older adults and individuals with mobility impairments. Yet, accurately sensing the human body in a contact-rich manipulation task remains challenging. To address this challenge, we propose a multimodal sensing approach that perceives the 3D contour of body parts using the visual modality while capturing local contact details using the tactile modality. We employ a Transformer-based imitation learning model to utilize the multimodal information and learn to focus on crucial visuo-tactile task features for action prediction. We demonstrate our approach using a Baxter robot and a medical manikin to simulate the robot-assisted bed bathing scenario with bedridden individuals. The robot adeptly follows the contours of the manikin's body parts and cleans the surface based on its curve. Experimental results show that our method can adapt to nonlinear surface curves and generalize across multiple surface geometries, and to human subjects. Overall, our research presents a promising approach for robots to accurately sense the human body through multimodal sensing and perform safe interaction during assistive bed bathing.

Conference paper

Zhong Y, Demiris Y, 2024,

DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting

, 38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence, Publisher: ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE, Pages: 10270-10278, ISSN: 2159-5399

Cite

Conference paper

Luo H, Demiris Y, 2023,

Bi-manual robot shoe lacing

, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, ISSN: 2153-0866

Shoe lacing (SL) is a challenging sensorimotor task in daily life and a complex engineering problem in theshoe-making industry. In this paper, we propose a system for autonomous SL. It contains a mathematical definition of the SL task and searches for the best lacing pattern corresponding to the shoe configuration and the user preferences. We propose a set of action primitives and generate plans of action sequences according to the designed pattern. Our system plans the trajectories based on the perceived position of the eyelets and aglets with an active perception strategy, and deploys the trajectories on a bi-manual robot. Experiments demonstrate that the proposed system can successfully lace 3 different shoesin different configurations, with a completion rate of 92.0%,91.6% and 77.5% for 6, 8 and 10-eyelet patterns respectively.To the best of our knowledge, this is the first demonstration of autonomous SL using a bi-manual robot.

Journal article

Zhang X, Demiris Y, 2023,

Visible and infrared image fusion using deep learning

, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol: 45, Pages: 10535-10554, ISSN: 0162-8828

Visible and infrared image fusion (VIF) has attracted a lot of interest in recent years due to its application in many tasks, such as object detection, object tracking, scene segmentation, and crowd counting. In addition to conventional VIF methods, an increasing number of deep learning-based VIF methods have been proposed in the last five years. Different types of methods, such as CNN-based, autoencoder-based, GAN-based, and transformer-based methods, have been proposed. Deep learning-based methods have undoubtedly become dominant methods for the VIF task. However, while much progress has been made, the field will benefit from a systematic review of these deep learning-based methods. In this paper we present a comprehensive review of deep learning-based VIF methods. We discuss motivation, taxonomy, recent development characteristics, datasets, and performance evaluation methods in detail. We also discuss future prospects of the VIF field. This paper can serve as a reference for VIF researchers and those interested in entering this fast-developing field.

Abstract
Cite

Journal article

Zhang F, Demiris Y, 2023,

Visual-tactile learning of garment unfolding for robot-assisted dressing

, IEEE Robotics and Automation Letters, Vol: 8, Pages: 5512-5519, ISSN: 2377-3766

Assistive robots have the potential to support disabled and elderly people in daily dressing activities. An intermediate stage of dressing is to manipulate the garment from a crumpled initial state to an unfolded configuration that facilitates robust dressing. Applying quasi-static grasping actions with vision feedback on garment unfolding usually suffers from occluded grasping points. In this work, we propose a dynamic manipulation strategy: tracing the garment edge until the hidden corner is revealed. We introduce a model-based approach, where a deep visual-tactile predictive model iteratively learns to perform servoing from raw sensor data. The predictive model is formalized as Conditional Variational Autoencoder with contrastive optimization, which jointly learns underlying visual-tactile latent representations, a latent garment dynamics model, and future predictions of garment states. Two cost functions are explored: the visual cost, defined by garment corner positions, guarantees the gripper to move towards the corner, while the tactile cost, defined by garment edge poses, prevents the garment from falling from the gripper. The experimental results demonstrate the improvement of our contrastive visual-tactile model predictive control over single sensing modality and baseline model learning techniques. The proposed method enables a robot to unfold back-opening hospital gowns and perform upper-body dressing.

Conference paper

Goubard C, Demiris Y, 2023,

Cooking up trust: eye gaze and posture for trust-aware action selection in human-robot collaboration

, TAS '23: First International Symposium on Trustworthy Autonomous Systems, Publisher: ACM, Pages: 1-5

Conference paper

Chacon Quesada R, Demiris Y, 2023,

Design and evaluation of an augmented reality head-mounted display user interface for controlling legged manipulators

, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 11950-11956, ISSN: 1050-4729

Designing an intuitive User Interface (UI) for controlling assistive robots remains challenging. Most existing UIs leverage traditional control interfaces such as joysticks, hand-held controllers, and 2D UIs. Thus, users have limited availability to use their hands for other tasks. Furthermore, although there is extensive research regarding legged manipulators, comparatively little is on their UIs. Towards extending the state-of-art in this domain, we provide a user study comparing an Augmented Reality (AR) Head-Mounted Display (HMD) UI we developed for controlling a legged manipulator against off-the-shelf control methods for such robots. We made this comparison baseline across multiple factors relevant to a successful interaction. The results from our user study ( N=17 ) show that although the AR UI increases immersion, off-the-shelf control methods outperformed the AR UI in terms of time performance and cognitive workload. Nonetheless, a follow-up pilot study incorporating the lessons learned shows that AR UIs can outpace hand-held-based control methods and reduce the cognitive requirements when designers include hands-free interactions and cognitive offloading principles into the UI.

Conference paper

Kotsovolis S, Demiris Y, 2023,

Bi-manual manipulation of multi-component garments towards robot-assisted dressing

, 2023 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE

In this paper, we propose a strategy for robot-assisted dressing with multi-component garments, such as gloves. Most studies in robot-assisted dressing usually experiment with single-component garments, such as sleeves, while multi-component tasks are often approached as sequential single-component problems. In dressing scenarios with more complex garments, robots should estimate the alignment of the human body to the manipulated garments, and revise their dressing strategy. In this paper, we focus on a glove dressing scenario and propose a decision process for selecting dressing action primitives on the different components of the garment, based on a hierarchical representation of the task and a set of environmental conditions. To complement this process, we propose a set of bi-manual control strategies, based on hybrid position, visual, and force feedback, in order to execute the dressing action primitives with the deformable object. The experimental results validate our method, enabling the Baxter robot to dress a mannequin's hand with a gardening glove.

Journal article

Candela E, Doustaly O, Parada L, Feng F, Demiris Y, Angeloudis Pet al., 2023,

Risk-aware controller for autonomous vehicles using model-based collision prediction and reinforcement learning

, Artificial Intelligence, Vol: 320, ISSN: 0004-3702

Autonomous Vehicles (AVs) have the potential to save millions of lives and increase the efficiency of transportation services. However, the successful deployment of AVs requires tackling multiple challenges related to modeling and certifying safety. State-of-the-art decision-making methods usually rely on end-to-end learning or imitation learning approaches, which still pose significant safety risks. Hence the necessity of risk-aware AVs that can better predict and handle dangerous situations. Furthermore, current approaches tend to lack explainability due to their reliance on end-to-end Deep Learning, where significant causal relationships are not guaranteed to be learned from data.This paper introduces a novel risk-aware framework for training AV agents using a bespoke collision prediction model and Reinforcement Learning (RL). The collision prediction model is based on Gaussian Processes and vehicle dynamics, and is used to generate the RL state vector. Using an explicit risk model increases the post-hoc explainability of the AV agent, which is vital for reaching and certifying the high safety levels required for AVs and other safety-sensitive applications. Experimental results obtained with a simulator and state-of-the-art RL algorithms show that the risk-aware RL framework decreases average collision rates by 15%, makes AVs more robust to sudden harsh braking situations, and achieves better performance in both safety and speed when compared to a standard rule-based method (the Intelligent Driver Model). Moreover, the proposed collision prediction model outperforms other models in the literature.

Conference paper

Zhong Y, Zhang F, Demiris Y, 2023,

Contrastive self-supervised learning for automated multi-modal dance performance assessment

, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE

A fundamental challenge of analyzing human motion is to effectively represent human movements both spatially and temporally. We propose a contrastive self-supervised strategy to tackle this challenge. Particularly, we focus on dancing, which involves a high level of physical and intellectual abilities. Firstly, we deploy Graph and Residual Neural Networks with Siamese architecture to represent the dance motion and music features respectively. Secondly, we apply the InfoNCE loss to contrastively embed the high-dimensional multimedia signals onto the latent space without label supervision. Finally, our proposed framework is evaluated on a multi-modal Dance- Music-Level dataset composed of various dance motions, music, genres and choreographies with dancers of different expertise levels. Experimental results demonstrate the robustness and improvements of our proposed method over 3 baselines and 6 ablation studies across tasks of dance genres, choreographies classification and dancer expertise level assessment.

Conference paper

Casado FE, Demiris Y, 2022,

Federated learning from demonstration for active assistance to smart wheelchair users

, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 9326-9331, ISSN: 2153-0858

Learning from Demonstration (LfD) is a very appealing approach to empower robots with autonomy. Given some demonstrations provided by a human teacher, the robot can learn a policy to solve the task without explicit programming. A promising use case is to endow smart robotic wheelchairs with active assistance to navigation. By using LfD, it is possible to learn to infer short-term destinations anywhere, without the need of building a map of the environment beforehand. Nevertheless, it is difficult to generalize robot behaviors to environments other than those used for training. We believe that one possible solution is learning from crowds, involving a broad number of teachers (the end users themselves) who perform demonstrations in diverse and real environments. To this end, in this work we consider Federated Learning from Demonstration (FLfD), a distributed approach based on a Federated Learning architecture. Our proposal allows the training of a global deep neural network using sensitive local data (images and laser readings) with privacy guarantees. In our experiments we pose a scenario involving different clients working in heterogeneous domains. We show that the federated model is able to generalize and deal with non Independent and Identically Distributed (non-IID) data.

Conference paper

Chacon Quesada R, Demiris Y, 2022,

Holo-SpoK: Affordance-aware augmented reality control of legged manipulators

, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 856-862

Although there is extensive research regarding legged manipulators, comparatively little focuses on their User Interfaces (UIs). Towards extending the state-of-art in this domain, in this work, we integrate a Boston Dynamics(BD) Spot® with a light-weight 7 DoF Kinova® robot arm and a Robotiq® 2F-85 gripper into a legged manipulator. Furthermore, we jointly control the robotic platform using an affordance-aware Augmented Reality (AR) Head-Mounted Display (HMD) UI developed for the Microsoft HoloLens 2. We named the combined platform Holo-SpoK. Moreover, we explain how this manipulator colocalises with the HoloLens 2 for its control through AR. In addition, we present the details of our algorithms for autonomously detecting grasp-ability affordances and for the refinement of the positions obtainedvia vision-based colocalisation. We validate the suitability of our proposed methods with multiple navigation and manipulation experiments. To the best of our knowledge, this is the first demonstration of an AR HMD UI for controlling legged manipulators.

Conference paper

Zolotas M, Demiris Y, 2022,

Disentangled sequence clustering for human intention inference

, IEEE/RSJ International Conference on Intelligent Robots and Systems, Publisher: IEEE, Pages: 9814-9820, ISSN: 2153-0866

Equipping robots with the ability to infer human intent is a vital precondition for effective collaboration. Most computational approaches towards this objective derive a probability distribution of “intent” conditioned on the robot’s perceived state. However, these approaches typically assumetask-specific labels of human intent are known a priori. To overcome this constraint, we propose the Disentangled Sequence Clustering Variational Autoencoder (DiSCVAE), a clustering framework capable of learning such a distribution of intent in an unsupervised manner. The proposed framework leverages recent advances in unsupervised learning to disentangle latentrepresentations of sequence data, separating time-varying local features from time-invariant global attributes. As a novel extension, the DiSCVAE also infers a discrete variable to form a latent mixture model and thus enable clustering over these global sequence concepts, e.g. high-level intentions. We evaluate the DiSCVAE on a real-world human-robot interaction datasetcollected using a robotic wheelchair. Our findings reveal that the inferred discrete variable coincides with human intent, holding promise for collaborative settings, such as shared control.

Journal article

Amadori PV, Fischer T, Wang R, Demiris Yet al., 2022,

Predicting secondary task performance: a directly actionable metric for cognitive overload detection

, IEEE Transactions on Cognitive and Developmental Systems, Vol: 14, Pages: 1474-1485, ISSN: 2379-8920

In this paper, we address cognitive overload detection from unobtrusive physiological signals for users in dual-tasking scenarios. Anticipating cognitive overload is a pivotal challenge in interactive cognitive systems and could lead to safer shared-control between users and assistance systems. Our framework builds on the assumption that decision mistakes on the cognitive secondary task of dual-tasking users correspond to cognitive overload events, wherein the cognitive resources required to perform the task exceed the ones available to the users. We propose DecNet, an end-to-end sequence-to-sequence deep learning model that infers in real-time the likelihood of user mistakes on the secondary task, i.e., the practical impact of cognitive overload, from eye-gaze and head-pose data. We train and test DecNet on a dataset collected in a simulated driving setup from a cohort of 20 users on two dual-tasking decision-making scenarios, with either visual or auditory decision stimuli. DecNet anticipates cognitive overload events in both scenarios and can perform in time-constrained scenarios, anticipating cognitive overload events up to 2s before they occur. We show that DecNet’s performance gap between audio and visual scenarios is consistent with user perceived difficulty. This suggests that single modality stimulation induces higher cognitive load on users, hindering their decision-making abilities.

Journal article

Nunes UM, Demiris Y, 2022,

Robust Event-Based Vision Model Estimation by Dispersion Minimisation

, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Vol: 44, Pages: 9561-9573, ISSN: 0162-8828

Cite
Citations: 22

Journal article

Zhang X, Angeloudis P, Demiris Y, 2022,

ST CrossingPose: a spatial-temporal graph convolutional network for skeleton-based pedestrian crossing intention prediction

, IEEE Transactions on Intelligent Transportation Systems, Vol: 23, Pages: 20773-20782, ISSN: 1524-9050

Pedestrian crossing intention prediction is crucial for the safety of pedestrians in the context of both autonomous and conventional vehicles and has attracted widespread interest recently. Various methods have been proposed to perform pedestrian crossing intention prediction, among which the skeleton-based methods have been very popular in recent years. However, most existing studies utilize manually designed features to handle skeleton data, limiting the performance of these methods. To solve this issue, we propose to predict pedestrian crossing intention based on spatial-temporal graph convolutional networks using skeleton data (ST CrossingPose). The proposed method can learn both spatial and temporal patterns from skeleton data, thus having a good feature representation ability. Extensive experiments on a public dataset demonstrate that the proposed method achieves very competitive performance in predicting crossing intention while maintaining a fast inference speed. We also analyze the effect of several factors, e.g., size of pedestrians, time to event, and occlusion, on the proposed method.

Abstract
Cite

Journal article

Zhang X, Feng Y, Angeloudis P, Demiris Yet al., 2022,

Monocular visual traffic surveillance: a review

, IEEE Transactions on Intelligent Transportation Systems, Vol: 23, Pages: 14148-14165, ISSN: 1524-9050

To facilitate the monitoring and management of modern transportation systems, monocular visual traffic surveillance systems have been widely adopted for speed measurement, accident detection, and accident prediction. Thanks to the recent innovations in computer vision and deep learning research, the performance of visual traffic surveillance systems has been significantly improved. However, despite this success, there is a lack of survey papers that systematically review these new methods. Therefore, we conduct a systematic review of relevant studies to fill this gap and provide guidance to future studies. This paper is structured along the visual information processing pipeline that includes object detection, object tracking, and camera calibration. Moreover, we also include important applications of visual traffic surveillance systems, such as speed measurement, behavior learning, accident detection and prediction. Finally, future research directions of visual traffic surveillance systems are outlined.

Abstract
Cite

Conference paper

Al-Hindawi A, Vizcaychipi M, Demiris Y, 2022,

Faster, better blink detection through curriculum learning by augmentation

, ETRA '22: 2022 Symposium on Eye Tracking Research and Applications, Publisher: ACM, Pages: 1-7

Blinking is a useful biological signal that can gate gaze regression models to avoid the use of incorrect data in downstream tasks. Existing datasets are imbalanced both in frequency of class but also in intra-class difficulty which we demonstrate is a barrier for curriculum learning. We thus propose a novel curriculum augmentation scheme that aims to address frequency and difficulty imbalances implicitly which are are terming Curriculum Learning by Augmentation (CLbA).Using Curriculum Learning by Augmentation (CLbA), we achieve a state-of-the-art performance of mean Average Precision (mAP) 0.971 using ResNet-18 up from the previous state-of-the-art of mean Average Precision (mAP) of 0.757 using DenseNet-121 whilst outcompeting Curriculum Learning by Bootstrapping (CLbB) by a significant margin with improved calibration. This new training scheme thus allows the use of smaller and more performant Convolutional Neural Network (CNN) backbones fulfilling Nyquist criteria to achieve a sampling frequency of 102.3Hz. This paves the way for inference of blinking in real-time applications.

Conference paper

Al-Hindawi A, Vizcaychipi MP, Demiris Y, 2022,

What is the patient looking at? Robust gaze-scene intersection under free-viewing conditions

, 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Publisher: IEEE, Pages: 2430-2434, ISSN: 1520-6149

Locating the user’s gaze in the scene, also known as Point of Regard (PoR) estimation, following gaze regression is important for many downstream tasks. Current techniques either require the user to wear and calibrate instruments, require significant pre-processing of the scene information, or place restrictions on user’s head movements.We propose a geometrically inspired algorithm that, despite its simplicity, provides high accuracy and O(J) performance under a variety of challenging situations including sparse depth maps, high noise, and high dynamic parallax between the user and the scene camera. We demonstrate the utility of the proposed algorithm in regressing the PoR from scenes captured in the Intensive Care Unit (ICU) at Chelsea & Westminster Hospital NHS Foundation Trust a .

Conference paper

Bin Razali MH, Demiris Y, 2022,

Using a single input to forecast human action keystates in everyday pick and place actions

, IEEE International Conference on Acoustics, Speech and Signal Processing, Publisher: IEEE, Pages: 3488-3492

We define action keystates as the start or end of an actionthat contains information such as the human pose and time.Existing methods that forecast the human pose use recurrentnetworks that input and output a sequence of poses. In this pa-per, we present a method tailored for everyday pick and placeactions where the object of interest is known. In contrast toexisting methods, ours uses an input from a single timestep todirectly forecast (i) the key pose the instant the pick or placeaction is performed and (ii) the time it takes to get to the pre-dicted key pose. Experimental results show that our methodoutperforms the state-of-the-art for key pose forecasting andis comparable for time forecasting while running at least anorder of magnitude faster. Further ablative studies reveal thesignificance of the object of interest in enabling the total num-ber of parameters across all existing methods to be reduced byat least 90% without any degradation in performance.

Journal article

Zhang F, Demiris Y, 2022,

Learning garment manipulation policies toward robot-assisted dressing.

, Science Robotics, Vol: 7, Pages: eabm6010-eabm6010, ISSN: 2470-9476

Assistive robots have the potential to support people with disabilities in a variety of activities of daily living, such as dressing. People who have completely lost their upper limb movement functionality may benefit from robot-assisted dressing, which involves complex deformable garment manipulation. Here, we report a dressing pipeline intended for these people and experimentally validate it on a medical training manikin. The pipeline is composed of the robot grasping a hospital gown hung on a rail, fully unfolding the gown, navigating around a bed, and lifting up the user's arms in sequence to finally dress the user. To automate this pipeline, we address two fundamental challenges: first, learning manipulation policies to bring the garment from an uncertain state into a configuration that facilitates robust dressing; second, transferring the deformable object manipulation policies learned in simulation to real world to leverage cost-effective data generation. We tackle the first challenge by proposing an active pre-grasp manipulation approach that learns to isolate the garment grasping area before grasping. The approach combines prehensile and nonprehensile actions and thus alleviates grasping-only behavioral uncertainties. For the second challenge, we bridge the sim-to-real gap of deformable object policy transfer by approximating the simulator to real-world garment physics. A contrastive neural network is introduced to compare pairs of real and simulated garment observations, measure their physical similarity, and account for simulator parameters inaccuracies. The proposed method enables a dual-arm robot to put back-opening hospital gowns onto a medical manikin with a success rate of more than 90%.

Conference paper

Jang Y, Demiris Y, 2022,

Message passing framework for vision prediction stability in human robot interaction

, IEEE International Conference on Robotics and Automation 2022, Publisher: IEEE, ISSN: 2152-4092

In Human Robot Interaction (HRI) scenarios, robot systems would benefit from an understanding of the user's state, actions and their effects on the environments to enable better interactions. While there are specialised vision algorithms for different perceptual channels, such as objects, scenes, human pose, and human actions, it is worth considering how their interaction can help improve each other's output. In computer vision, individual prediction modules for these perceptual channels frequently produce noisy outputs due to the limited datasets used for training and the compartmentalisation of the perceptual channels, often resulting in noisy or unstable prediction outcomes. To stabilise vision prediction results in HRI, this paper presents a novel message passing framework that uses the memory of individual modules to correct each other's outputs. The proposed framework is designed utilising common-sense rules of physics (such as the law of gravity) to reduce noise while introducing a pipeline that helps to effectively improve the output of each other's modules. The proposed framework aims to analyse primitive human activities such as grasping an object in a video captured from the perspective of a robot. Experimental results show that the proposed framework significantly reduces the output noise of individual modules compared to the case of running independently. This pipeline can be used to measure human reactions when interacting with a robot in various HRI scenarios.

Abstract
Cite

Conference paper

Bin Razali MH, Demiris Y, 2022,

Using eye-gaze to forecast human pose in everyday pick and place actions

, IEEE International Conference on Robotics and Automation

Collaborative robots that operate alongside hu-mans require the ability to understand their intent and forecasttheir pose. Among the various indicators of intent, the eyegaze is particularly important as it signals action towards thegazed object. By observing a person’s gaze, one can effectivelypredict the object of interest and subsequently, forecast theperson’s pose. We leverage this and present a method thatforecasts the human pose using gaze information for everydaypick and place actions in a home environment. Our method firstattends to fixations to locate the coordinates of the object ofinterest before inputting said coordinates to a pose forecastingnetwork. Experiments on the MoGaze dataset show that ourgaze network lowers the errors of existing pose forecastingmethods and that incorporating prior in the form of textualinstructions further lowers the errors by a significant amount.Furthermore, the use of eye gaze now allows a simple multilayerperceptron network to directly forecast the keypose.

Abstract
Cite

Journal article

Quesada RC, Demiris Y, 2022,

Proactive robot assistance: affordance-aware augmented reality user interfaces

, IEEE Robotics and Automation magazine, Vol: 29, ISSN: 1070-9932

Assistive robots have the potential to increase the autonomy and quality of life of people with disabilities [1] . Their applications include rehabilitation robots, smart wheelchairs, companion robots, mobile manipulators, and educational robots [2] . However, designing an intuitive user interface (UI) for the control of assistive robots remains a challenge, as most UIs leverage traditional control interfaces, such as joysticks and keyboards, which might be challenging and even impossible for some users. Augmented reality (AR) UIs introduce more natural interactions between people and assistive robots, potentially reaching a more diverse user base.

Conference paper

Nunes UM, Demiris Y, 2022,

Kinematic Structure Estimation of Arbitrary Articulated Rigid Objects for Event Cameras

, Pages: 508-514, ISSN: 1050-4729

We propose a novel method that estimates the Kinematic Structure (KS) of arbitrary articulated rigid objects from event-based data. Event cameras are emerging sensors that asynchronously report brightness changes with a time resolution of microseconds, making them suitable candidates for motion-related perception. By assuming that an articulated rigid object is composed of body parts whose shape can be approximately described by a Gaussian distribution, we jointly segment the different parts by combining an adapted Bayesian inference approach and incremental event-based motion estimation. The respective KS is then generated based on the segmented parts and their respective biharmonic distance, which is estimated by building an affinity matrix of points sampled from the estimated Gaussian distributions. The method outperforms frame-based methods in sequences obtained by simulating events from video sequences and achieves a solid performance on new high-speed motions sequences, which frame-based KS estimation methods can not handle.

Abstract
Cite
Citations: 3

Conference paper

Candela E, Parada L, Marques L, Georgescu T-A, Demiris Y, Angeloudis Pet al., 2022,

Transferring Multi-Agent Reinforcement Learning Policies for Autonomous Driving using Sim-to-Real

, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 8814-8820, ISSN: 2153-0858

Cite
Citations: 26

Conference paper

Al-Hindawi A, Vizcaychipi MP, Demiris Y, 2021,

Continuous non-invasive eye tracking in intensive care

, 43rd Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society (IEEE EMBC), Publisher: IEEE, Pages: 1869-1873, ISSN: 1557-170X

Delirium, an acute confusional state, is a common occurrence in Intensive Care Units (ICUs). Patients who develop delirium have globally worse outcomes than those who do not and thus the diagnosis of delirium is of importance. Current diagnostic methods have several limitations leading to the suggestion of eye-tracking for its diagnosis through in-attention. To ascertain the requirements for an eye-tracking system in an adult ICU, measurements were carried out at Chelsea & Westminster Hospital NHS Foundation Trust. Clinical criteria guided empirical requirements of invasiveness and calibration methods while accuracy and precision were measured. A non-invasive system was then developed utilising a patient-facing RGB camera and a scene-facing RGBD camera. The system’s performance was measured in a replicated laboratory environment with healthy volunteers revealing an accuracy and precision that outperforms what is required while simultaneously being non-invasive and calibration-free The system was then deployed as part of CONfuSED, a clinical feasibility study where we report aggregated data from 5 patients as well as the acceptability of the system to bedside nursing staff. To the best of our knowledge, the system is the first eye-tracking systems to be deployed in an ICU for delirium monitoring.

Conference paper

Nunes UM, Demiris Y, 2021,

Live demonstration: incremental motion estimation for event-based cameras by dispersion minimisation

, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Publisher: IEEE COMPUTER SOC, Pages: 1322-1323, ISSN: 2160-7508

Live demonstration setup. (Left) The setup consists of a DAVIS346B event camera connected to a standard consumer laptop and undergoes some motion. (Right) The motion estimates are plotted in red and, for rotation-like motions, the angular velocities provided by the camera IMU are also plotted in blue. This plot exemplifies an event camera undergoing large rotational motions (up to ~ 1000 deg/s) around the (a) x-axis, (b) y-axis and (c) z-axis. Overall, the incremental motion estimation method follows the IMU measurements. Optionally, the resultant global optical flow can also be shown, as well as the corresponding generated events by accumulating them onto the image plane (bottom left corner).

Conference paper

Chacon-Quesada R, Demiris Y, 2021,

Augmented reality eser interfaces for heterogeneous multirobot control

, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE, Pages: 11439-11444, ISSN: 2153-0858

Recent advances in the design of head-mounted augmented reality (AR) interfaces for assistive human-robot interaction (HRI) have allowed untrained users to rapidly and fluently control single-robot platforms. In this paper, we investigate how such interfaces transfer onto multirobot architectures, as several assistive robotics applications need to be distributed among robots that are different both physically and in terms of software. As part of this investigation, we introduce a novel head-mounted AR interface for heterogeneous multirobot control. This interface generates and displays dynamic joint-affordance signifiers, i.e. signifiers that combine and show multiple actions from different robots that can be applied simultaneously to an object. We present a user study with 15 participants analysing the effects of our approach on their perceived fluency. Participants were given the task of filling-out a cup with water making use of a multirobot platform. Our results show a clear improvement in standard HRI fluency metrics when users applied dynamic joint-affordance signifiers, as opposed to a sequence of independent actions.

Journal article

Girbes-Juan V, Schettino V, Demiris Y, Tornero Jet al., 2021,

Haptic and Visual Feedback Assistance for Dual-Arm Robot Teleoperation in Surface Conditioning Tasks

, IEEE TRANSACTIONS ON HAPTICS, Vol: 14, Pages: 44-56, ISSN: 1939-1412

Conference paper

Tian Y, Balntas V, Ng T, Barroso-Laguna A, Demiris Y, Mikolajczyk Ket al., 2021,

D2D: Keypoint Extraction with Describe to Detect Approach

, 15th Asian Conference on Computer Vision-ACCV-Biennial, Publisher: SPRINGER INTERNATIONAL PUBLISHING AG, Pages: 223-240, ISSN: 3004-9946

Cite
Citations: 3

Conference paper

Behrens JK, Nazarczuk M, Stepanova K, Hoffmann M, Demiris Y, Mikolajczyk Ket al., 2021,

Embodied Reasoning for Discovering Object Properties via Manipulation

, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 10139-10145, ISSN: 1050-4729

Cite
Citations: 2

Journal article

Fischer T, Demiris Y, 2020,

Computational modelling of embodied visual perspective-taking

, IEEE Transactions on Cognitive and Developmental Systems, Vol: 12, Pages: 723-732, ISSN: 2379-8920

Humans are inherently social beings that benefit from their perceptional capability to embody another point of view, typically referred to as perspective-taking. Perspective-taking is an essential feature in our daily interactions and is pivotal for human development. However, much remains unknown about the precise mechanisms that underlie perspective-taking. Here we show that formalizing perspective-taking in a computational model can detail the embodied mechanisms employed by humans in perspective-taking. The model's main building block is a set of action primitives that are passed through a forward model. The model employs a process that selects a subset of action primitives to be passed through the forward model to reduce the response time. The model demonstrates results that mimic those captured by human data, including (i) response times differences caused by the angular disparity between the perspective-taker and the other agent, (ii) the impact of task-irrelevant body posture variations in perspective-taking, and (iii) differences in the perspective-taking strategy between individuals. Our results provide support for the hypothesis that perspective-taking is a mental simulation of the physical movements that are required to match another person's visual viewpoint. Furthermore, the model provides several testable predictions, including the prediction that forced early responses lead to an egocentric bias and that a selection process introduces dependencies between two consecutive trials. Our results indicate potential links between perspective-taking and other essential perceptional and cognitive mechanisms, such as active vision and autobiographical memories.

Conference paper

Goncalves Nunes UM, Demiris Y, 2020,

Entropy minimisation framework for event-based vision model estimation

, 16th European Conference on Computer Vision 2020, Publisher: Springer, Pages: 161-176

We propose a novel Entropy Minimisation (EMin) frame-work for event-based vision model estimation. The framework extendsprevious event-based motion compensation algorithms to handle modelswhose outputs have arbitrary dimensions. The main motivation comesfrom estimating motion from events directly in 3D space (e.g.eventsaugmented with depth), without projecting them onto an image plane.This is achieved by modelling the event alignment according to candidateparameters and minimising the resultant dispersion. We provide a familyof suitable entropy loss functions and an efficient approximation whosecomplexity is only linear with the number of events (e.g.the complexitydoes not depend on the number of image pixels). The framework is eval-uated on several motion estimation problems, including optical flow androtational motion. As proof of concept, we also test our framework on6-DOF estimation by performing the optimisation directly in 3D space.

Conference paper

Zhang F, Demiris Y, 2020,

Learning grasping points for garment manipulation in robot-assisted dressing

, 2020 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 9114-9120

Assistive robots have the potential to provide tremendous support for disabled and elderly people in their daily dressing activities. Recent studies on robot-assisted dressing usually simplify the setup of the initial robot configuration by manually attaching the garments on the robot end-effector and positioning them close to the user's arm. A fundamental challenge in automating such a process for robots is computing suitable grasping points on garments that facilitate robotic manipulation. In this paper, we address this problem by introducing a supervised deep neural network to locate a predefined grasping point on the garment, using depth images for their invariance to color and texture. To reduce the amount of real data required, which is costly to collect, we leverage the power of simulation to produce large amounts of labeled data. The network is jointly trained with synthetic datasets of depth images and a limited amount of real data. We introduce a robot-assisted dressing system that combines the grasping point prediction method, with a grasping and manipulation strategy which takes grasping orientation computation and robot-garment collision avoidance into account. The experimental results demonstrate that our method is capable of yielding accurate grasping point estimations. The proposed dressing system enables the Baxter robot to autonomously grasp a hospital gown hung on a rail, bring it close to the user and successfully dress the upper-body.

Journal article

Gao Y, Chang HJ, Demiris Y, 2020,

User modelling using multimodal information for personalised dressing assistance

, IEEE Access, Vol: 8, Pages: 45700-45714, ISSN: 2169-3536

Conference paper

Zolotas M, Demiris Y, 2020,

Towards explainable shared control using augmented reality

, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Publisher: IEEE, Pages: 3020-3026

Shared control plays a pivotal role in establishing effective human-robot interactions. Traditional control-sharing methods strive to complement a human’s capabilities at safely completing a task, and thereby rely on users forming a mental model of the expected robot behaviour. However, these methods can often bewilder or frustrate users whenever their actions do not elicit the intended system response, forming a misalignment between the respective internal models of the robot and human. To resolve this model misalignment, we introduce Explainable Shared Control as a paradigm in which assistance and information feedback are jointly considered. Augmented reality is presented as an integral component of this paradigm, by visually unveiling the robot’s inner workings to human operators. Explainable Shared Control is instantiated and tested for assistive navigation in a setup involving a robotic wheelchair and a Microsoft HoloLens with add-on eye tracking. Experimental results indicate that the introduced paradigm facilitates transparent assistance by improving recovery times from adverse events associated with model misalignment.

Conference paper

Chacon-Quesada R, Demiris Y, 2020,

Augmented reality controlled smart wheelchair using dynamic signifiers for affordance representation

, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Publisher: IEEE

The design of augmented reality interfaces for people with mobility impairments is a novel area with great potential, as well as multiple outstanding research challenges. In this paper we present an augmented reality user interface for controlling a smart wheelchair with a head-mounted display to provide assistance for mobility restricted people. Our motivation is to reduce the cognitive requirements needed to control a smart wheelchair. A key element of our platform is the ability to control the smart wheelchair using the concepts of affordances and signifiers. In addition to the technical details of our platform, we present a baseline study by evaluating our platform through user-trials of able-bodied individuals and two different affordances: 1) Door Go Through and 2) People Approach. To present these affordances to the user, we evaluated fixed symbol based signifiers versus our novel dynamic signifiers in terms of ease to understand the suggested actions and its relation with the objects. Our results show a clear preference for dynamic signifiers. In addition, we show that the task load reported by participants is lower when controlling the smart wheelchair with our augmented reality user interface compared to using the joystick, which is consistent with their qualitative answers.

Journal article

Zambelli M, Cully A, Demiris Y, 2020,

Multimodal representation models for prediction and control from partial information

, Robotics and Autonomous Systems, Vol: 123, ISSN: 0921-8890

Similar to humans, robots benefit from interacting with their environment through a number of different sensor modalities, such as vision, touch, sound. However, learning from different sensor modalities is difficult, because the learning model must be able to handle diverse types of signals, and learn a coherent representation even when parts of the sensor inputs are missing. In this paper, a multimodal variational autoencoder is proposed to enable an iCub humanoid robot to learn representations of its sensorimotor capabilities from different sensor modalities. The proposed model is able to (1) reconstruct missing sensory modalities, (2) predict the sensorimotor state of self and the visual trajectories of other agents actions, and (3) control the agent to imitate an observed visual trajectory. Also, the proposed multimodal variational autoencoder can capture the kinematic redundancy of the robot motion through the learned probability distribution. Training multimodal models is not trivial due to the combinatorial complexity given by the possibility of missing modalities. We propose a strategy to train multimodal models, which successfully achieves improved performance of different reconstruction models. Finally, extensive experiments have been carried out using an iCub humanoid robot, showing high performance in multiple reconstruction, prediction and imitation tasks.

Abstract
Cite

Conference paper

Buizza C, Fischer T, Demiris Y, 2020,

Real-time multi-person pose tracking using data assimilation

, IEEE Winter Conference on Applications of Computer Vision, Publisher: IEEE

We propose a framework for the integration of data assimilation and machine learning methods in human pose estimation, with the aim of enabling any pose estimation method to be run in real-time, whilst also increasing consistency and accuracy. Data assimilation and machine learning are complementary methods: the former allows us to make use of information about the underlying dynamics of a system but lacks the flexibility of a data-based model, which we can instead obtain with the latter. Our framework presents a real-time tracking module for any single or multi-person pose estimation system. Specifically, tracking is performed by a number of Kalman filters initiated for each new person appearing in a motion sequence. This permits tracking of multiple skeletons and reduces the frequency that computationally expensive pose estimation has to be run, enabling online pose tracking. The module tracks for N frames while the pose estimates are calculated for frame (N+1). This also results in increased consistency of person identification and reduced inaccuracies due to missing joint locations and inversion of left-and right-side joints.

Conference paper

Cortacero K, Fischer T, Demiris Y, 2019,

RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments

, IEEE International Conference on Computer Vision Workshops, Publisher: Institute of Electrical and Electronics Engineers Inc.

In recent years gaze estimation methods have made substantial progress, driven by the numerous application areas including human-robot interaction, visual attention estimation and foveated rendering for virtual reality headsets. However, many gaze estimation methods typically assume that the subject's eyes are open; for closed eyes, these methods provide irregular gaze estimates. Here, we address this assumption by first introducing a new open-sourced dataset with annotations of the eye-openness of more than 200,000 eye images, including more than 10,000 images where the eyes are closed. We further present baseline methods that allow for blink detection using convolutional neural networks. In extensive experiments, we show that the proposed baselines perform favourably in terms of precision and recall. We further incorporate our proposed RT-BENE baselines in the recently presented RT-GENE gaze estimation framework where it provides a real-time inference of the openness of the eyes. We argue that our work will benefit both gaze estimation and blink estimation methods, and we take steps towards unifying these methods.

Abstract
Cite

Search or filter publications

Filter by type:

Filter by year:

Results

Search results

DanceMVP: Self-Supervised Learning for Multi-Task Primitive-Based Dance Performance Assessment via Transformer Text Prompting

Disentangled sequence clustering for human intention inference

Message passing framework for vision prediction stability in human robot interaction

Using eye-gaze to forecast human pose in everyday pick and place actions

Real-time multi-person pose tracking using data assimilation

RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments

­