Two-Stage Transfer Learning for Heterogeneous Robot Detection and 3D Joint Position Estimation in a 2D Camera Image Using CNN

(1)

Two-Stage Transfer Learning for Heterogeneous Robot Detection and 3D Joint Position Estimation in a 2D Camera Image using CNN

Justinas Miˇseikis¹, Inka Brijaˇcak², Saeed Yahyanejad³, Kyrre Glette⁴, Ole Jakob Elle⁵, Jim Torresen⁶

Abstract— Collaborative robots are becoming more common on factory floors as well as regular environments, however, their safety still is not a fully solved issue. Collision detection does not always perform as expected and collision avoidance is still an active research area. Collision avoidance works well for fixed robot-camera setups, however, if they are shifted around, Eye-to-Hand calibration becomes invalid making it difficult to accurately run many of the existing collision avoidance algorithms. We approach the problem by presenting a stand- alone system capable of detecting the robot and estimating its position, including individual joints, by using a simple 2D colour image as an input, where no Eye-to-Hand calibration is needed. As an extension of previous work, a two-stage transfer learning approach is used to re-train a multi-objective convolutional neural network (CNN) to allow it to be used with heterogeneous robot arms. Our method is capable of detecting the robot in real-time and new robot types can be added by having significantly smaller training datasets compared to the requirements of a fully trained network. We present data collection approach, the structure of the multi-objective CNN, the two-stage transfer learning training and test results by using real robots from Universal Robots, Kuka, and Franka Emika.

Eventually, we analyse possible application areas of our method together with the possible improvements.

I. INTRODUCTION

Collaborative robots are gaining popularity as an advanced version of traditional industrial robots. Not only they are capable of reliably performing high-precision complex movements repetitively without any fatigue or rest, but they are also claimed to be safe to operate around humans. Instead of fully separating them from people (e.g., using fences or light curtains), they are capable of sharing the same workspace with humans given the sophisticated collision detection systems. However, these systems do not always work as expected and might exert excess forces before stopping [1]. Furthermore, in some situations, like a robot located in a surgery theatre, collisions are not acceptable, and full collision avoidance should be implemented. This coincides with the goals of the Industry 4.0 concept [2].

A crucial part for the obstacle avoidance is getting real-time measurements of the workspace and environment

1 4 5 6Justinas Miˇseikis, Kyrre Glette, Ole Jakob Elle and Jim Torresen are with the Department of Informatics, University of Oslo, Oslo, Norway

2 3Inka Brijaˇcak and Saeed Yahyanejad are with the Joanneum Research - Robotics, Klagenfurt am W¨orthersee, Austria

4 6Kyrre Glette and Jim Torresen also have affiliation with RITMO, University of Oslo

5Ole Jakob Elle has his main affiliation with The Intervention Centre, Oslo University Hospital, Oslo, Norway[email protected]

1 4 6{justinm,kyrrehg,jimtoer}@ifi.uio.no

2[email protected]

3[email protected]

around the robot. Such sensing can be done using a variety of sensors: laser scanners, mono and stereo vision, RGB-D cameras, ultrasound sensors and motion capture systems.

(a) UR3, UR5, UR10 (b) KUKA LBR iiwa (c) Franka Emika Panda Fig. 1. Robot manipulators used in our experiments.

Even with advanced sensing systems, the problem still stands in the requirement of having a reliable calibration between the sensors and the robot - so-called Eye-to-Hand calibration [3]. Such a calibration maps the coordinate frames of the robot and vision sensors into a common coordinate frame. As a result, the position of an obstacle detected by one of the sensors can be easily calculated in point of view of the robot, and the necessary action is taken to avoid it. There are reliable and even automatic ways of performing Eye-to-Hand calibration, however, if any of the sensors is unexpectedly moved in relation to the robot, and unaccounted for, the calibration becomes invalid and the system might malfunction [4]. This can be an issue in dynamic environments like a surgery theatre, where there is a lot of human movement, as well as the equipment is constantly shifted around. Similar works and research on dynamic obstacle avoidance for robot arms normally require a fully-calibrated robot-camera system, which can be a challenge in non-static configuration setups [5] [6].

We have shown that the transfer learning approach can be used to adapt the system trained to recognise and estimate the position of the robot base and joints from one robot model to a new unseen one by having a limited amount of data [7] [8].

We base our work on a previously trained multi-objective CNN on Universal Robots (UR) and extend our work in the following manner. Instead of adapting the network to the new robot type, we adjust the CNN to incorporate new robot types, while still being able to recognise previously trained robots. Eventually, the proposed system is capable of identifying 5 different robots. Furthermore, with the help of motion capture system tracking the camera, we collected a complex training datasets with the camera being moved around in an unconstrained manner, obtaining a variety of viewing angles of the robots in front of complex backgrounds. A more thorough analysis also shows the impact of the accuracy

(2)

depending on the distance between the camera and the robot.

This paper is organized as follows. First, we provide an overview of related work in Section II. We present the system setup and dataset collection in Section III. Then, we explain the proposed method and CNN structure and configuration in Section IV and the transfer learning procedure in Section V.

We provide experiments and results in Section VI, followed by relevant conclusions and future work in Section VII.

II. RELATED WORK

With the recent deep learning revolution in computer vision, especially for classification tasks, like ImageNet, it has been proven that it is possible to learn to identify objects in difficult environments and conditions [9].

In order to train a deep learning network, large amounts of training data are needed with precisely marked ground truth data. Collecting such training datasets can be a time- consuming task. However, transfer learning approach is useful when a fully trained system exists for one type of the problem and can be adapted for different datasets by adjusting some of the parameters of the network while keeping other parameters fixed [9]. This has been proven to work for mid-level image representations in object classification, using the pre-trained network on natural images to adapt for medical image recognition and even emotion recognition [10] [11] [12]. Another interesting application of transfer learning is to use a fully trained network on night-time satellite imagery of poverty areas and adapt it to recognise poverty areas from daytime satellite imagery [13].

Furthermore, detailed analyses of the transfer learning ap- proaches were made with surveys of the techniques used and various CNN structures [14] [15].

Moreover, CNN based work in the field of human pose estimation in 2D [16], known asOpenPose, allowed further improvements on 3D human pose estimation with the help of a depth sensor [17]. The accuracy for a human keypoint in 3D is around11cm, mainly due to the inaccuracy of the depth sensor which grows with distance from the sensor.

On the other hand, many purely geometrical techniques have been employed to determine the position and orientation of an object from a single image by using some prior knowledge about the target object [18] [19]. In general, with these methods, they try to find patterns and features such as edges and corners which match the expected model and accordingly estimate the position and orientation. Some other researchers exploited the existence of a 3D model such as a CAD model [20] [21] to increase the accuracy of the estimation. Although the precision of their method is higher compared to our CNN-based method, they mainly suffer from a major drawback: they can only perform with solid and rigid objects which clearly does not apply to robot manipulators. Another problem is the necessity of having a 3D model available beforehand, which in our method is substituted with the training procedure. However, our method performs more robustly in case of deviation from the model in case of physical damages or attached end-effectors, and it

can also use the image colour information which is normally missing in a 3D model.

III. SYSTEM SETUP AND DATASET COLLECTION Deep learning networks are capable of robustly recognis- ing objects in complex backgrounds, but in order to achieve good performance, a large amount of precisely marked and diverse training data is needed. Considering the setup of three heterogeneous robotic manipulators, a system had to be set up to generate training data with accurate ground truth data marked automatically, given that manual ground truth generation for such datasets would take up a significant amount of time and effort.

Our setup consists of the following three robot types:

• Universal Robots: UR3, UR5, UR10, 6 DoF, Figure 1(a)

• KUKA LBR iiwa - 7 R800, 7 DoF, Figure 1(b)

• Franka Emika - Panda, 7 DoF, Figure 1(c)

Fig. 2. Setup with the Optitrack Motion Capture System

This two-stage transfer learning work, as the basis, uses al- ready trained multi-objective CNN [7], which was trained on datasets containing all three robot models from UR. Datasets containing KUKA LBR iiwa were previously collected for our one-stage transfer learning project [8]. All these datasets were collected using Kinect V2 sensor with necessary Eye- to-Hand calibration [22] every time position of the camera changed relative to the robot-base in order to achieve a high variety of backgrounds.

TABLE I. Dataset summary describing a number of samples collected for each type of the robot.

Robot Type Number of Datasets Total Number of Samples

Universal Robots 9 4350

Kuka LBR iiwa 14 1837

Franka Panda 5 2513

Table I summarizes all datasets collected for each robot with their number of recordings where they differ by camera placement relative to the robot, illumination, background.

New datasets containing Franka Emika Panda robot were recorded with free-moving Intel RealSense R200 RGB-D camera instead of Kinect V2. Since Eye-to-Hand calibration is only valid for the fixed camera setups, we could not use this method for the camera to robot coordinates- transformation measurements. Instead of performing Eye-to- Hand calibration, we have placedOptitrack(Motion Capture System) [23] markers over the moving camera and around the base of the robot in order to bring both systems into one coordinate frame of Optitrack (Figure 2). Since Optitrack’s marker (Rigid-BodyorRig) origin is not exactly aligned with

(3)

Fig. 3. Samples from the collected robot datasets. Robots used are Universal Robots (silver-blue), Kuka LBR iiwa (silver-orange) and Franka Emika - Panda (white-black). A variety of robot configurations, camera movements and angles as well as lighting conditions were used. In some cases, even other robots in the background were present.

the camera’s optical frame origin, extrinsic calibration was performed as described in [24] by observing and detecting one additional rig, that was fixed in the Optitrack frame, with our RGB camera from multiple positions. Example frames taken from the whole dataset can be seen in Figure 3.

Once all the transformations are connected in one coordinate system, a precise robot mask that is separating a robot body from the background when overlaying a colour image, used as a ground truth for teaching the CNN, can be calculated. It is done automatically by using ROS with MoveItpackage [25]. The robot model, taken from the Robot Description Format (URDF) files and mesh files provided by the robot manufacturers, is updated with the live information from robot’s joints encoder readings to create robot shape in real-time [26]. This shape is transformed to depth camera’s coordinate frame and mask image is constructed.

In order to ensure the robustness of the system, robot movements were programmed so that each robot joint is moved through the full range of motion in combination with other joints taking into account self-collision and table collision avoidance. Also, a trigger signal is used to save the data after each robot movement. With each trigger, we save camera colour images, robot joints and Cartesian coordinates, as well as ground-truth robot mask images. Moreover, to ensure a perfect overlap between colour and depth images, internal extrinsic camera calibration was used. All the input images are also rectified and have the resolution of480×360 pixels. Testing and validation sets were divided by the ratios of 80%and20% respectively based on random sampling.

IV. CNN STRUCTURE AND CONFIGURATION The structure of a multi-objective CNN is identical to previous work, where it was trained on UR robots [7]. The network trains for multiple outputs simultaneously by taking a single 2D colour image as an input. The network in this

paper is trained on four objectives: Robot mask, Robot type, 3D Robot base position and 3D Position of the robot joints.

The network has multiple branches, with some of the convolutional layers shared and then branched off to optimise specifically for each of the objectives. The structure of the multi-objective CNN can be seen in Figure 4.

A. Loss Functions

The loss function is needed to define the quality of training and drive the CNN towards achieving better results. Given four different outputs of the network, four loss functions need to be defined and eventually combined in a single one for the training process. Firstly, we describe each one of them and then explain how they are connected together.

Normally, the robot body takes up a rather small area in the whole image. For UR datasets, the area of the robot body is between 6−17%, for Kuka datasets, it is between 8−18%and for Franka Panda, it is between5−22%. Given a standard approach for the pixel classification loss function, an accuracy of over78%could be reached by classifying the whole image as background. This is conceptually wrong, so the loss function is adapted by calculating the foreground weight wf g as defined in Equation 1. It is based on the inverse probability of the foreground and background classes, whereZ∈ {f g, bg}.

w_{f g}= 1

P(Z=f g) (1) The background weightw_bg is calculated in Equation 2.

wbg= 1

P(Z=bg) (2) Definition of the loss function for the robot mask is done in two steps. First, a per-pixel losslⁿ is calculated in

(4)

Conv 32F Filter Size: 3x3

Dilation 2x2

Input Robot Mask

Dilaton 5x5

Dilation: 5x5

Dilation: 3x3

Dilation 3x3

Dilation 5x5

Dilation 3x3

Max Pooling 2x2

Max Pooling 2x2 Max Pooling

2x2

Joint 3D Coordinates

Robot Type

Robot Base Position

FC1024 Layers

Sum FC1024

FC512 Layers

Stage 1 TL Adjustment Stage 2 TL Adjustment

Fig. 4. The multi-objective CNN with a two-stage transfer learning. The CNN is optimising simultaneously for four objectives at the same time: robot mask, 3D coordinates of robot joints, 3D coordinates of the robot base and the robot type. The network is taught in two stages using the transfer learning approach. In stage 1, the parameters for all the layers, besides the final ones marked in blue arefrozenand the system is trained until there is no more improvement. Afterwards, in stage 2, the parameters marked in red, as well as all the stage 1 layers, are adjusted during the training. This approach allows faster training compared to the full training, while still reaching good accuracy.

Equation 3, whereiestisP(Y =f g),(1−iest)isP(Y =bg) andigt is the ground truth value from the mask image.

lⁿ(I_estⁿ , I_gtⁿ) =−wf gigtlog (iest)

−wbg(1−igt) log (1−iest) (3) Then, a normalised loss calculation is done for the whole image Lmask in Equation 4. In order to keep the same learning parameters independent of the input image size, a normalisation factorN is used, which is a number of pixels in the image.

Lmask(Iest, Igt) = 1 N

X

n

lⁿ(iest, igt) (4) Estimation of the 3D coordinates of the robot base and robot joints are defined as a regression problem instead of classification. Loss function uses the Euclidean distance between the ground truth and estimated values by the CNN.

For the robot joints estimation, the loss functionL_{J coords} is described in Equation 5, whereN_j is the number of joints, Ji is the ground truth position of each joint and Ei is the estimated values by the neural network.

LJ coords= 1 Nj

N_j

X

i=1

kJi−Eik₂ (5) The loss function for the coordinates of the robot base LBcoords is calculated in Equation 6. Bxyz is the ground truth position of the robot base in 3D, and E_xyz is the estimated 3D position of the robot base. These positions are relative to the coordinate frame of the camera.

LBcoords =

Bxyz−Exyz

₂ (6)

A multi-class categorical cross-entropy approach is used to identify the robot type Ltype. Ltype is calculated in Equation 7, where p is the ground truth labels, q are the predicted labels andc∈R, whereRcontains all the available types of robots in the dataset.

Ltype=−X

c

p(c) logq(c) (7) Eventually, all four previously defined loss functions are combined into a single loss function to be used in the training of the multi-objective CNN. The final loss function Lf inal

is calculated as a weighted sum of all the loss functions, as shown in Equation 8. The larger the weight W, the higher the impact on the corresponding value.

Lf inal=WmaskLmask+WJ coordsLJ coords

+W_BcoordsL_Bcoords+W_typeL_type (8) V. TRAINING USING TRANSFER LEARNING A common approach to training such a system would be to train the whole system using full datasets of all the robots. However, this would take a significant amount of computation and time. Overall, the goal of this work is to analyse the possibility of having a pre-trained system and expand it to include more robot types while having a limited amount of training data and time.

Transfer learning allows us to use a fully trained system for one robot type and then adjust it to include the newly provided training data. This is done byfreezingthe parameters in some of the layers while adjusting the remaining ones.

Given this partial adjustment, the training time and amount of training data required can be significantly reduced.

(5)

0 500 1000 1500

1umber of TraLnLng SamSles

0.0 0.5 1.0 1.5 2.0

Loss

0 10 20 30 40

TraLnLng TLme (h)

Loss and TLme vs 1umber of TraLnLng SamSles

(a) Loss function and training time against the number of training samples used. This was ac- quired by running a number of experiments using input datasets of different size. Using more than 500 training samples per robot type does not give a significant accuracy benefit while increasing the training time.

0 2000 4000 6000 8000 10000 12000

IteratLons

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Loss

Loss over traLnLng LteratLons TF: 6tage 2 TF: 6tage 1

(b) Loss function against the number of training iterations. Stage 1 and stage 2 of our two- stage transfer learning approach are marked by background colours on the graph. It can be seen that when stage 1 training saturates, unlocking parameters of more CNN layers allow the network to further improve results in stage 2 training.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

DistDnFe fURP tKe CDPeUD

1 2 3 4 5 6

(UURU (FP)

JRint (stiPDtiRn (UURU vs DistDnFe tR tKe CDP U5

.ukD )UDnkD 3DndD

(c) Errors for 3D position estimation of robot joints depending on the camera distance from the robot. Results are grouped by the robot type. Error and distance have close to linear correlation, but Franka Panda has higher error compared to the other robots. This is due to the robot body, which gives low contrast compared to the background.

Fig. 5. Evaluation of the two-stage transfer learning method using the test dataset in various categories.

The system was fully trained using UR robots with Kuka and Franka Panda robots added using the transfer learning approach. One crucial difference is that UR robots have 6 joints, while Kuka and Franka Panda are 7 joint robots. In general, it has been found that first convolutional layers tend to learn general visual features, while further layers figure out specific visual cues of the objects. Both, UR and Kuka robots have bright coloured joint covers, while the rest of the robot is silver, however, Franka Panda is mainly black and white, as seen in Figure 1.

Due to these differences, a two-stage transfer learning approach was taken up, as shown in Figure 4. In the first step, just the final layers of the multi-objective CNN are trained. This allows the network to adjust the dense layers to select the best-learned features for the robot recognition using currently learned visual cues. Only a small part of the CNN is adjusted, so the learning process is fast, and it re-learns robot classification and position estimation using current convolutional layer parameters.

However, after some point the learning process saturates and no more improvements are observed, defined by the reduction of loss. At this stage, the second part of the CNN is unlocked, allowing to modify parameters for the additional convolutional layers. This results in modifications of the visual cues that are learned as well as adjusting the final dense layers. The training speed is slower compared to the first stage of learning, but the loss is reduced even further.

In order to add the new robot types using the transfer learning approach, the new training dataset has to include both, the robot that the network was originally trained on, as well as the new robot(s) that should be recognised.

Weights for the loss function are adjusted to give more impact to identifying mask and robot type compared to our previous work. Selected weight values, based on trial and error from a number of experiments, were as follows:

W_mask: 1.2,W_{J coords}:1.2,W_Bcoords:1.2andW_type:0.6.

The number of training samples varied by the experiment and the input size of the images was scaled down and cropped to 256×212 pixels for all the datasets. The pixel

intensity values of the input images were normalised to the range between 0 and 1. The learning rate was set to0.001at the start of the training and then gradually decreased towards 0.000001as the training progressed.

VI. EXPERIMENTS AND RESULTS

The main goal of the experiments was to evaluate the capability of including new robot types by using the two- stage transfer learning method while using a multi-objective CNN fully trained on UR robots as a starting point.

Fig. 6. Visualisation of the output from the presented multi-objective CNN trained using a two-stage transfer learning approach. Each column represents each robot type in the following order: Kuka, UR and Franka Panda. The first row shows the estimated 3D joint position (red circles) against the ground truth position (green crosses). The second row shows the estimated mask of the robot and the third row shows the estimated robot base position.

Each of the experiments was conducted by taking a different size transfer learning dataset using a randomised sample selection to maximise the diversity of the data. The maximum amount of data was limited by the Kuka robot dataset to ensure the same amount of samples in each test for each of the robot types.

The system was evaluated using a testing set by comparing the output against the ground truth data. The robot mask accuracy is defined by counting the number of pixels in the

(6)

CNN output image that match the ground truth mask. For the robot joint and base coordinates, the Euclidean distance between the CNN estimated results and ground truth results was calculated. We compare the results between each of the robot type in a number of categories. Results are summarised in Table II and visualisation of estimations plotted on top of the testing set samples can be seen in Figure 6.

TABLE II. Summary of the Transfer Learning results (using 500 samples of each robot type for training) on the test set.

Measure UR Kuka Franka Panda

Mask Accuracy, % 97.0% 97.7% 94.3%

Robot Type Accuracy, % 97.5% 100% 96.1%

Joint Pos Error (Median) 3.12cm 3.30cm 3.64cm Base Pos Error (Median) 2.42cm 2.36cm 3.01cm

All of the robots had joint 3D position estimation error under3.64cm, with the base position estimation error under 3cm. The mask accuracy estimation exceeded 97% for UR and Kuka robots, while for Franka Panda it was a bit lower at 94.3%. Robot type was recognised correctly in all of the cases for Kuka robot, while UR and Franka Panda recognition were 97.5% and 96.1% respectively. Overall, it was noticed that given the distinct features, the CNN performed the best on Kuka robot, while low contrast Franka Panda robot had the worst results, but not far behind.

Considering overall performance of the two-stage transfer learning, as shown in Figure 5(a), for the multi-objective CNN, it can be seen that the loss function stops improving when having datasets of size between500 and750 training samples for each of the robot type, which corresponds to7to 10hours of training time. Increasing the number of training samples beyond750 does not improve the learning process, but significantly increases the training time.

Compared to the previously presented work in [7], the detection accuracy of the current two-stage transfer learning approach achieved similar accuracy in a joint position error of 2.46cm vs 3.12cm and slightly worse accuracy for the robot mask estimation: 97% vs 98% in the previous work.

The full training time of the multi-objective CNN for UR robots took60 hours vs10hours in the current work.

The performance of each training stage of transfer learning is shown in Figure 5(b). Stage 1, with parameters in final CNN layers being adjusted, saturates after 6000 iterations.

Afterwards, further layers are unlocked switching to Stage 2 and the loss function reduces even further settling down between10000−12000iterations.

Furthermore, we analyse the impact of the joint and base position estimation depending on the distance between the camera and the robot, visualised in Figure 5(c). There is close to a linear relationship between the distance between the robot and the accuracy of the 3D position estimation of the robot joints. Interestingly, at a very close distance of 1.2 meters, Franka Panda robot shows worse performance compared to 1.5 meter distance.

The detection time or forward-propagation of the multi- objective CNN was measured to be19−23msper frame on a nVidia GTX 1080 Ti graphics card.

VII. CONCLUSIONS AND FUTURE WORK In this paper, we have presented a two-stage transfer learning approach, which allows to re-train a previously trained multi-objective CNN to include numerous new robot types using a limited amount of training data. This approach reduces the time spent on collecting datasets with ground truth data, as well as the training time of the network itself.

Furthermore, a concept of a multi-objective CNN capable of identifying heterogeneous robots, classifying their types and estimating 3D positions of their joints and base was proven. A simple 2D colour image was used as an input and Kuka and Franka Panda robots were mounted with two- finger grippers on the end-effector, which were not taken into account by the CNN. The network successfully estimated the position of the robot as the was no end-effector present. If the TCP of the end-effector would be required, it could be calculated by adding the necessary CAD model or offset information to the estimated position of the end-effector.

With the detection time of under 23ms, the system has proven to be capable of working in real-time. At the current stage, a powerful GPU is needed to run it, however, a goal of optimising it for smaller mobile systems could be pursued. In this case, it could be implemented in small wearable cameras to be used both, for mobile robots or for human operators working in a robotised environments and used as a safety system, which can detect possible collisions without having any direct communication between the devices. The outcome could be a valuable measure for various safety applications in Human-Robot Interactionscenarios, where we need to know the position of the human and robot and their individual joints respective to each other.

The achieved robot joint position estimation is not accurate enough for visual servoing operations, but future work can focus on accuracy improvements. We believe that by using higher resolution images, multi-sensor detection and tracking in time series, the accuracy of our system could be improved.

Furthermore, an analysis of the impact on the detection accuracy depending on the weight selection of loss functions and layer selection for the transfer learning will be done.

With the current results, a high-level control is still possible for human-robot and robot-robot interaction.

Additionally, with the given robot mask detection in a 2D image, some robot self-inspection could be done to detect any damage, especially for autonomous robots operating in remote or disaster areas, where people do not have access to, for example for planetary exploration rovers.

ACKNOWLEDGMENT

This work is partially supported by The Research Council of Norway as a part of the Engineering Predictability with Embodied Cognition (EPEC) project, under grant agreement 240862, and Research Council of Norway through its Centres of Excellence scheme, project number 262762, and by the Austrian Ministry for Transport, Innovation and Technology (BMVIT) within the project framework CollRob (Collabora- tive Robotics).

(7)

REFERENCES

[1] I. Bonev, “Should We Fence the Arms of Universal Robots?” http:

//coro.etsmtl.ca/blog/?p=299, ETS, 2014, accessed September 5, 2018.

[2] J. Lee, B. Bagheri, and H.-A. Kao, “A cyber-physical systems archi- tecture for industry 4.0-based manufacturing systems,”Manufacturing Letters, vol. 3, pp. 18–23, 2015.

[3] R. K. Lenz and R. Y. Tsai, “Calibrating a cartesian robot with eye- on-hand configuration independent of eye-to-hand relationship,”IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 9, pp.

916–928, 1989.

[4] J. Miseikis, K. Glette, O. J. Elle, and J. Torresen, “Automatic calibration of a robot manipulator and multi 3d camera system,” in System Integration (SII), 2016 IEEE/SICE International Symposium on. IEEE, 2016, pp. 735–741.

[5] J. Mainprice and D. Berenson, “Human-robot collaborative manipula- tion planning using early prediction of human motion,” inIntelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 299–306.

[6] J. Miˇseikis, K. Glette, O. J. Elle, and J. Torresen, “Multi 3D camera mapping for predictive and reflexive robot manipulator trajectory estimation,” inComputational Intelligence (SSCI), 2016 IEEE Symposium Series on. IEEE, 2016, pp. 1–8.

[7] J. Miseikis, I. Brijacak, S. Yahyanejad, K. Glette, O. J. Elle, and J. Torresen, “Multi-Objective Convolutional Neural Networks for Robot Localisation and 3D Position Estimation in 2D Camera Images,”

in 2018 15th International Conference on Ubiquitous Robots (UR), June 2018, pp. 597–603.

[8] J. Miˇseikis, I. Brijacak, S. Yahyanejad, K. Glette, O. J. Elle, and J. Torresen, “Transfer learning for unseen robot detection and joint estimation on a multi-objective convolutional neural network,” in2018 IEEE International Conference on Intelligence and Safety for Robotics (ISR), Aug 2018, pp. 337–342.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[10] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and trans- ferring mid-level image representations using convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1717–1724.

[11] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique,”IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1153–1159, 2016.

[12] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning for emotion recognition on small datasets using transfer learning,”

in Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, 2015, pp. 443–449.

[13] S. M. Xie, N. Jean, M. Burke, D. B. Lobell, and S. Ermon,

“Transfer learning from deep features for remote sensing and poverty mapping,” CoRR, vol. abs/1510.00098, 2015. [Online]. Available:

http://arxiv.org/abs/1510.00098

[14] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteris- tics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.

[15] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,”Journal of Big Data, vol. 3, no. 1, p. 9, 2016.

[16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” inCVPR, 2017.

[17] C. D. W. B. Christian Zimmermann, Tim Welschehold and T. Brox,

“3d human pose estimation in rgbd images for robotic task learning,”

in IEEE International Conference on Robotics and Automation (ICRA), 2018. [Online]. Available: https://lmb.informatik.uni-freiburg.

de/projects/rgbd-pose3d/

[18] C.-M. Cheng, H.-W. Chen, T.-Y. Lee, S.-H. Lai, and Y.-H. Tsai,

“Robust 3d object pose estimation from a single 2d image,” inVisual Communications and Image Processing (VCIP), 2011 IEEE. IEEE, 2011, pp. 1–4.

[19] M. Zhu, K. G. Derpanis, Y. Yang, S. Brahmbhatt, M. Zhang, C. Phillips, M. Lecce, and K. Daniilidis, “Single image 3d object detection and pose estimation for grasping,” inRobotics and Automa- tion (ICRA), 2014 IEEE International Conference on. IEEE, 2014, pp. 3936–3943.

[20] C.-Y. Tsai, W.-Y. Wang, C.-H. Huang, and B.-R. Shih, “Cad model- based 3d object pose estimation using an edge-based nonlinear model fitting algorithm,” inProceedings of the 3rd IIAE International Con- ference on Intelligent Systems and Image Processing 2015. IIAE, 2015, pp. 59–62.

[21] J. J. Lim, A. Khosla, and A. Torralba, “Fpm: Fine pose parts-based model with 3d cad models,” in European conference on computer vision. Springer, 2014, pp. 478–493.

[22] R. Y. Tsai and R. K. Lenz, “A New Technique for Fully Autonomous and Efficient 3D Robotics Hand/Eye Calibration,” Robotics and Au- tomation, IEEE Transactions on, vol. 5, no. 3, pp. 345–358, 1989.

[23] N. Point, “Optitrack,” Natural Point, Inc.,[Online]. Available:

http://www. naturalpoint. com/optitrack/.[Accessed 5 5 2018], 2011.

[24] S. Chiodini, M. Pertile, R. Giubilato, F. Salvioli, M. Barrera, P. Franceschetti, and S. Debei, “Camera rig extrinsic calibration using a motion capture system,” in2018 5th IEEE International Workshop on Metrology for AeroSpace (MetroAeroSpace). IEEE, 2018, pp.

590–595.

[25] I. A. Sucan and S. Chitta, “MoveIt!” Online Available:

http://moveit.ros.org, 2013.

[26] W. Meeussen, J. Hsu, and R. Diankov, “URDF-Unified Robot De- scription Format,” 2012.