Abstract

Space robot teleoperation is an important technology in the space human-robot interaction and collaboration. Hand-based visual teleoperation can make the operation more natural and convenient. The fast and accuracy hand detection is one of the most difficult and important problem in the hand-based space robot teleoperation. In this work, we propose a fast and accurate hand detection method by using a spatial-channel attention single shot multibox detector (SCA-SSD). The SSD framework is used and improved in our method by introducing spatial-channel attentions with feature fusion. To increase the restricted receptive field in shallow layers, two shallow layers are fused with deep layers by using feature fusion modules. And spatial attention and channel-wise attention are also used to extract more efficient features. This method can not only ease the computational burden but also bring more contextual information. To evaluate the effectiveness of the proposed method, experiments on some public datasets and a custom astronaut hand detection dataset (AHD) are conducted. The results show that our method can improve the hand detection accuracy by 2.7% compared with the original SSD with only 15 fps speed drops. In addition, the space robot teleoperation experiment proves that our hand detection method can be well utilized in the space robot teleoperation system.

1. Introduction

Due to the limited intelligence of space robots, space human-robot interaction plays an important role in the application of space tasks [1]. Teleoperation is one of a widely used space human-robot interaction method [2]. Teleoperation does not depend on the high intelligence capabilities of space robots. It can effectively combine the human decision-making ability with the space robot precise operation ability to improve the operation ability of space robots. There are some devices for space teleoperation. Some traditional devices, such as haptic feedback controllers [35], have stable and robust performance but lack of convenience. Some hand-based teleoperation devices [6], such as data gloves [79] and surface electromyography (SEMG) wristbands [1012], have good convenience performance. However, because they are wearable devices, the performance on different people is very different. So, complex calibration work is required before using them. Hand-based visual teleoperation [6] is an emerging teleoperation method. It has the advantages of noncontact, natural, and convenience.

Hand detection is an important and difficult issue in the hand-based visual teleoperation. Because (1) space robot teleoperation needs real-time and robust operation. So, the hand detection should balance fast and accurate performances. (2) Complex backgrounds and changing illumination inside and outside the space station cabin make the astronaut hands difficult to detect and locate. (3) Hand is a small object. Detection for small objects has always been a difficult problem in computer vision.

To deal with the above problems, a fast and accurate hand detection method is proposed in this paper. SSD framework [13] is used to design the hand detector since its good balance of speed and precision and ease of structural improvement. However, the SSD is not good at detecting small objects. Because it uses shallow layers to detect small objects, and shallow layers have enough contextual information but lack of semantic information. To address the lack of semantic information in the shallow layers, a multiattention module with feature fusion (MA-FF) is proposed to combine shallow layers with deep layers. The multiattention module extract channel attention features from deep and low-resolution feature maps and extract spatial attention features from high resolution layers, respectively. Then, the feature fusion module fuses these features to obtain new shallow layer feature maps with enough contextual and semantic information.

The main contributions and innovations are shown as follows. (1) A spatial-channel attention SSD (SCA-SSD) is proposed to deal with fast and accurate hand detection. The layers for object detection in the SSD structure are visualized to find out which layers play the most important role for small object detection. And these layers are improved and fused with deep layers. A multiattention module with feature fusion (MA-FF) is proposed. It includes a channel attention branch, a spatial attention branch, and a feature fusion branch. (2) A custom astronaut hand detection dataset (AHD) is designed. This dataset collects a large number of astronaut hand images and is used for hand detection verification for space robot teleoperation. (3) The experiments on hand detection datasets proves that the proposed SCA-SSD has fast and accurate hand detection performance, which is superior to some state-of-the-art method. And the experiments on the space robot teleoperation platform prove that the designed hand detector can be well used in the hand-based space robot teleoperation.

The rest of this paper is structured as follows. Section 2 reviews the prior work of hand-based robot teleoperation and hand detection methods. In Section 3, we first describe and visualize the original SSD and then elaborate the structure details of the proposed hand detection method. In Section 4, we provide the results of ablation experiments and comparative experiments on public datasets and a custom AHD dataset. And we also provide the application experiment on hand-based space robot teleoperation platform. Finally, we draw the conclusions and future work in Section 5.

2.1. Hand-Based Robot Teleoperation

The hand-based robot teleoperation methods include contact and noncontact methods. The mainly contact methods include haptic feedback-based, sEMG-based, and data glove-based methods. Haptic feedback-based teleoperation [35] is a traditional teleoperation method. It transmits the 6-Dof position and orientation of human hand to the robot through the haptic feedback controller. For example, the da Vinci surgical telemanipulator [3] can transmit the dual-hand motion information of the chief surgeon through two main joysticks to control the instruments and a 3D high-definition endoscope. The principle of the sEMG-based teleoperation [1012] is that when hand moves, the arm will generate corresponding motor neuron information, which can be obtained by decoding the sEMG signal. For example, Raspopovic et al. [10] used sEMG equipment to collect sEMG signal of hand gestures and used these gestures to control a dexterous hand. Data glove-based teleoperation [79] uses curvature sensors to collect the bending degrees of the fingers and the posture change of the entire human hand, to decode the movement of the hand. Fang et al. [7] designed a novel data glove to control a robotic hand-arm teleoperation system. The above contact teleoperation methods are lack of robustness for different people. The visual teleoperation is robust to different people due to its noncontact advantage [1416]. For example, Li et al. [14] designed a mobile robot hand-arm teleoperation system by using vision and IMU. Handa et al. [15] designed a vision-based teleoperation method for a dexterous robotic hand-arm system. Table 1 shows the comparison and summary of the above hand-based robot teleoperation methods.

2.2. Hand Detection Methods

Traditional visual hand detection methods [17] mainly include skin color-based hand detection, motion flow information-based hand detection, and shape model-based hand detection. These methods only extract the shallow information of hands, which are subject to many conditions. Nowadays, deep learning-based hand detection methods can achieve better detection performance in complex environment [1820]. Hand detection can be regarded as a kind of object detection. There are some typical deep learning-based object detectors, such as RCNN series [21, 22], YOLO series [2325], and SSD series [13, 26, 27]. Among them, the SSD is a light weight one-stage network, which considers speed and accuracy trade-off and is easy to modify. For example, Gao et al. [18] designed a feature-map-fused SSD for robust real-time hand detection and localization. He also used SSD and body pose estimation for dual-hand detection [19]. Yu et al. designed a deep temporal model-based identity-aware hand detector by using the SSD framework for space human-robot interaction [20]. However, the SSD is stuck with the speed and accuracy dilemma for small object detection. Some useful methods and tricks are proposed to resolve this dilemma. DSSD [26] attempts to recover higher resolution features and adds with the primary features through shortcut connection. FSSD [28], DF-SSD [29], RSSD [30], and ESSD [31] provided many feature fusion methods to add more contextual information into shallow feature maps. Table 2 shows the comparison and summary of the above hand detection methods.

3. Spatial-Channel Attention SSD

In this section, first, the original SSD is introduced and visualized. Then, the proposed SCA-SSD is introduced, which includes the multi-attention module and feature fusion module.

3.1. SSD Introduction and Visualization

In this subsection, the SSD architecture is introduced first. And then, the detection visualization in SSD is shown to find out which layers are suitable for improving.

3.1.1. SSD Architecture

The SSD [13] is one of the outstanding one-stage detectors with high speed and accuracy. The architecture is shown in Figure 1. The VGG-16 is used as its backbone, and several extra convolution layers on the top of the network are used for prediction and classification by filters directly. Unlike other detectors, SSD uses pyramidal multiresolution feature maps as convolutional detector input, which means it handles different scales in different resolution feature maps. The SSD brings significant improvement on speed because of its one-stage architecture. However, it cannot get a high detection accuracy on small object. Because the shallow layers for detection have much contextual information but less semantic information. While the deep layers for detection are reverse. Small object detection needs enough semantic and contextual information for its low resolution. So, feature maps with enough semantic and contextual information should be designed for hand detection.

3.1.2. Detection Visualization in SSD

To find out which layers are suitable for improving for small object detection, the results of feature maps for object detection in SSD are visualized. We select one convolution layer as the input of the detector and block other convolution layers which means we only use one specific convolution layer to detect objects. The results are shown in Figure 2, which shows that the small objects are easier detected in shallow layers (conv4_3 and conv7 layers), and large objects are easier detected in deep layers (conv8_2, conv9_2, and conv10_2 layers). Because the contextual information is vital to small object detection and shallow layers have enough contextual information. However, due to the lack of semantic information, there are some missing detection results of small objects in conv4_3 and conv7 layers. Once it misses the object in shallow layers, it has no chance to be detected in the subsequent deep layers. To increase the accuracy of small object detection, we propose the SCA-SSD. A multiattention module is employed on conv4_3 and conv7 layers and then fuses them with conv8_2 and conv11_2, separately. The details are presented below.

3.2. SCA-SSD Architecture

In this subsection, the overview of the SCA-SSD is introduced first. Then, the multiattention module and feature fusion module are introduced, respectively.

3.2.1. Overview of SCA-SSD

The architecture of our proposed SCA-SSD is introduced and shown in Figure 3. From the figure, we can see that the SCA-SSD reuses the multiscale and one-stage architecture of the original SSD. Two multiattention branch with feature fusion (MA-FF) modules are employed on the shallow layers conv4_3 and conv7, respectively. They use the multiattention modules to extract channel and spatial features and use feature fusion modules to fuse the two shallow layers (conv4_3 and conv7) with the deep layers (conv8_2 and conv11_2). Finally, the two new feature maps output from the MA-FF modules are mainly used for small object detection.

3.2.2. Multiattention Module

To address the lack of information in shallow layers, we propose a multiattention module with feature fusion, and the improved structure of conv4_3 is shown in Figure 4 as an example. The design of the attention module is inspired by bottleneck attention module (BAM) [32]. To be specific, first, the spatial attention branch Atts is employed after conv4_3 and conv7, respectively. After that, channel attention branch is employed after the conv11_2 whose resolution is so that we can skip the global pooling operation in the squeeze stage coincidentally. Then, the Atts and are combined by element wise add operation to generate the cross resolution spatial-channel attention which terms . Finally, the sigmoid is applied for to obtain the weighted and then multiply with feature maps from the corresponding feature map. For instance, as shown in Figure 4, weighted is obtained from conv4_3 so that it multiplies and adds with conv4_3.

The spatial branch structure is shown in Figure 5(1). This branch follows encoder-decoder structure, but we do not down the resolution of the feature map to preserve more information. Each branch consists of a convolution layer to reduce the dimensions of channels, and two dilated convolution layers are employed for obtaining long-range information with a widely receptive field. Then, it will restore the number of channels as input by another convolution layer. In practice, each convolution layer and dilated convolution layer are followed by a batch normalization and a ReLU activation function except for the last convolution layer. Set the input feature map is , the output can be expressed as

where denotes the sigmoid function, denotes a conclusion operation with the filter size of .

The structure of the channel branch is shown in Figure 5(2). In , in order not to affect the value of conv11_2 feature map, one convolution layer following ReLU after the conv11_2 is employed. In excitation, two fully connected layers are used to blend different values in different channels. Then, they are expanded to match the sizes of conv4_3 and conv7. Set the input feature map is , the output can be expressed as where denotes the sigmoid function.

3.2.3. Feature Fusion Module

Even though the multiattention module brings extra contextual information to shallow layers, the spatial attention branch still has a drawback. The context is encoded as an attention mask so that the value is limited between zero and one. By multiplying with input feature map, it can enhance the useful information for detection. However, context and long-range information are encoded as attention mask which only provides weighted value. So, to capture more context, a feature fusion module which can be embedded within the multiattention module is proposed, and it is shown in Figure 4.

In the feature fusion module, two deconvolution layers are employed to restore the size of feature map from to and , so that it can match the size of conv7 and conv4_3. Our feature fusion module is inspired by DSSD [26], and two deconvolution layers are only used to avoid increasing much computational burden. In each Deconv-n block, it includes a deconvolution layer and a batch normalization (BN). After deconvolution, fusion operation is employed to merge a reweighted feature map with the output of deconvolution . Follow the feature-fusion SSD [27], element-wise add is used as the fuse operation. It can be proved that element-wise add outperforms the concatenate operation in the feature-fusion SSD [27]. At the end of this module, the ReLU activation function is employed.

4. Experiments and Analysis

In this section, to compare performance with the state-of-the-art object detection methods, experiments are conducted on Pascal VOC dataset [33] first. Then, experiments are conducted on the Oxford hands dataset [34] to demonstrate the effectiveness of our proposed method on public hand detection datasets. After that, the AHD dataset will be introduced, and experiments will be conducted on this dataset to prove the performance of astronaut hand detection. The mean average precision (mAP) is adopted as evaluation metric to evaluate our model prediction performance.

We implement the MA-SSD based on PyTorch [35]. The data augmentation method is followed with SSD [13], and the VGG-16 is used as the pretrained backbone. All experiments are performed on 4 NVIDIA RTX 2080 Ti GPU.

4.1. Experiments on the Pascal VOC Dataset
4.1.1. Training

In training stage, the batch size is set to 32, and the learning rate is set to with a warm-up phase at the first 500 times iteration. However, the experiment resulted the default learning rate is too small. Instead, the learning rate is set to with a warm-up phase at first 2800 iterations. Learning rate should be increased from with warm-up factor as 0.03333 gradually. The training step is set to 140 k iterations totally, and the learning rate is divided by 10 at 84 k and 112 k iterations which is different from original SSD [13] but similar to RFB-Net [36]. Following the trick in RFB-Net, the number of prior boxes in conv4_3 is increased to 6.

4.1.2. Introduction of the Pascal VOC Dataset

The objects in the Pascal VOC 2007 dataset include 4 categories and 20 subcategories, which are vehicle (car, bus, bicycle, motorbike, airplane, boat, and train), household (chair, sofa, dining table, TV, bottle, and potted plant), animal (cat, dog, cow, horse, sheep, and bird), and person. These images are collected from flickr and Microsoft Research Cambridge (MSRC) dataset. The dataset includes 9,963 images containing 24,640 annotated objects.

4.1.3. Comparative Experiments

To demonstrate the performance of the proposed SCA-SSD, some other state-of-the-art methods are compared. The results are shown in Table 3. For a fair comparison, the updated SSD [37] is used as our baseline, which can get a 77.7% mAP on the VOC test dataset. It is slightly higher than that of the original SSD [13], which mAP is 77.2%. By employing multiattention and fusion modules on the SSD, it achieves a 79.9% mAP, which is 2.7% higher than that of the original SSD [13] and 2.2% higher than that of the baseline [37]. The SCA-SSD brings significant improvement into SSD with the least impact on speed. It is only 15 FPS slower than the original SSD. And the mAP of the SCA-SSD is even higher than that of the SSD512, which has a higher input resolution () than that of the SCA-SSD (). We also compare the results of the proposed SCA-SSD with some state-of-the-art object detection methods like faster-RCNN [21], YOLO v4 [25], R-FCN [38], and StairNet [39]. From Table 1, we can see that the performance of the SCA-SSD is higher than most of the state-of-the-art methods both on accuracy and seed. In addition, we also show the results of some SSD-series methods like DSSD [26] and FSSD [28]. To the best of our knowledge, our SCA-SSD achieves the best performance within SSD-series methods. It proves that the proposed SCA-SSD can achieve a great performance for object detection both on speed and accuracy.

4.2. Experiments on Oxford Hands Dataset
4.2.1. Introduction of the Oxford Hands Dataset

The hand detection is different with normal object detection. It has small size and changeable shape. To better prove the performance of the SCA-SSD for hand detection, the experiments on hand detection dataset are also conducted. The Oxford hands dataset [34] which is a public hand detection dataset is used for training and testing. In the dataset, a total of 13050 hand instances are annotated. Hand instances larger than a fixed area of bounding box (1500 sq. pixels) are considered “big” enough for detections and are used for evaluation. This gives around 4170 high-quality hand instances. In each image, all the hands that can be perceived clearly by humans are annotated.

4.2.2. Ablation Experiment

To understanding SCA-SSD structure deeper and better, several ablation experiments are conducted to show the effectiveness of each module of the network on hand detection. The results are summarized in Table 4. In this experiment, first, we add channel attention and spatial attention models on the baseline structure, respectively. The mAP can increase 1.6% and 1.3% compared with the baseline method. And the speeds only drop by 3FPS. It proves that the proposed channel attention and spatial attention models are effective in hand detection. Second, we take the feature fusion module away from the SCA-SSD, which terms as SCA-SSD w/o fusion. The result decreases from 44.6% to 43.8% compared with the SCA-SSD w/fusion, which indicates the feature fusion module is effective in hand detection. The feature fusion module can improve 0.8% of mAP but it has little impact on the speed of inference, the speed still keeps on over 100 FPS (104FPS). So, the ablation experiment results show that the proposed channel attention, spatial attention, and feature fusion modules are effective to improve the performance of hand detection.

4.3. Experiments on AHD Dataset
4.3.1. AHD Dataset

To further verify the effectiveness of the designed SCA-SSD hand detector in hand-based space robot teleoperation, the experiment on the space environment images should be conducted. Since there is no such hand detection dataset, we customize a set of astronaut hand images in various intra/extravehicular activities from some sci-fi movies and YouTube resource. We named it AHD dataset. The dataset includes a total of 2000 images and more than 4000 instances. All hands in the images are labelled as “hand.”

4.3.2. Verification Experiment and Visualization

The AHD dataset is just used for verification. The hand detector trained on the Oxford hands dataset is verification on the AHD dataset, and the results are shown in Table 3. From Table 5, we can see that when the IoU is 0.50 : 0.95, the hand detect accuracy is 0.69. And when the IoU is 0.50, the hand detection accuracy is 0.88. It is proved that the SCA-SSD hand detector can achieve good performance on the AHD dataset. And when the hand areas are small, medium, and large, the hand detection accuracies are 0.56, 0.62, and 0.81, respectively. It is proved that the SCA-SSD hand detector can achieve good performance on hands with various areas.

To better show the results of the hand detection for astronaut’s hand, some of the result images are visualized as follows. From Figure 6, we can see that the proposed SCA-SSD hand detector can detect astronaut’s hands in various scenes.

4.4. Experiments on Space Robot Teleoperation Platform

The SCA-SSD hand detector is utilized in a designed space robot teleoperation platform, which is shown in Figure 7. The teleoperation platform includes a hand teleoperation space and a hand-arm robot motion space. A RealSense camera can capture the astronaut’s hands in real time. After that, the SCA-SSD hand detector can detect hands on the RGB images, and then, the 2D hand positions can be mapped to the corresponding depth images to obtain the 3D hand positions. Then, the real-time hand positions in the hand teleoperation space can be transferred to the hand-arm robot motion space by using the following mapping relationship equation. where the is the position of the end effector of the robot, and the is the hand position in the camera coordinate system. (, , ) is the hand position in -th frame, and (, , ) is the hand position in -th frame. is a scale factor, and we set in the teleoperation experiment.

By collecting the motion trajectories of hand in the camera coordinate system and robot end effector in the robot coordinate system, the trajectories are shown in Figure 8.

From Figure 8, we can see that the end effector of the robot can track the movement trajectory of the hand very well. And the maximum error is only 9.3 mm.

5. Conclusion and Future Work

In this work, a fast and accurate hand detection method was proposed by using a spatial-channel attention single shot multibox detector (SCA-SSD). And the proposed hand detector was utilized in a hand-based space robot teleoperation system. Specifically, two shallow layers were fused with deep layers by using feature fusion modules to increase the restricted receptive field in shallow layers. And spatial attention and channel-wise attention were also used to extract more efficient features. This method can not only ease the computational burden but also bring more contextual information. The comparative experiment, ablation experiment, and verification experiment have proved the good performance of the proposed SCA-SSD hand detector. Finally, the experiment on space robot teleoperation platform has demonstrated that the proposed SCA-SSD hand detector can be applied well in the space robot teleoperation. There are some limitations of the proposed hand detection and teleoperation method. First, the proposed method is only trained on public datasets, and due to the small sizes of the public datasets, the generalization ability of hand detection is not strong. Second, only the detection and localization of hands cannot control the space robots well, and the subsequent recognition of hand gestures and poses is also required.

In the future, hand gesture recognition methods need further research to realize space robot teleoperation for complex tasks. In addition, skeleton-based hand detection and pose estimation also require further research to achieve more precise teleoperation.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (62006204 and 62103407) and partly supported by the Shenzhen Outstanding Scientific and Technological Innovation Talents Training Project (RCBS20210609104516043).