• No results found

Robot Body and Pose Estimation Using Deep Learning

3.2 Papers

4.1.2 Robot Body and Pose Estimation Using Deep Learning

With the following work, the research focus shifted to machine learning approaches

for robot-camera systems. In previously-presented solutions, the system highly

relied on having uninterrupted communication channels to send camera images, real-time robot position and even relied on an exact robot model information. If any of this information were delayed or inaccurate, the whole system would pos-sibly malfunction. To reduce the high dependency on multiple systems working reliably, deep learning approaches were used to teach the computer to recognise the robot in the image and estimate the exact position where it is located in 3D space without having any direct information from the robot sensors or even having an explicit model of the robot.

Given the task requirements of using a 2D colour image as an input and pre-dicting a mask of the robot in an image as well as the 3D position of each joint of the robot, convolutional neural networks (CNNs) were chosen as the most suitable approach to solve the task. CNNs are capable of learning complex tasks by effec-tively adjusting large amounts of weights between neurons and selecting the most appropriate filters automatically. However, they require extensive training datasets with correctly marked ground truth to train efficiently. An automatic training data collection system was set up using Universal Robots and RGB-D cameras, which allowed to collect thousands of training samples, which could be used to train the CNN.

The first successful approach was by using cascaded CNNs, where the first CNN was used to estimate a mask of the robot in the image and then the result was passed on to the second CNN, which predicted the 3D position of the robot joints, as described in Paper IV. The system has proven to work with a few-centimetre accuracy. However, the training process was cumbersome. A separate CNN was necessary for each of the robot types, and if the robot mask estimation had signif-icant errors, the joint position estimation was prone to fail because of imperfect input.

Naturally, a better CNN structure was explored to make the system more ro-bust. Instead of passing output from one CNN to be used as an input in the further process, the structure of the CNN was modified by merging both CNNs into one.

It resulted in a multi-objective convolutional neural network, which was capable

of estimating four outputs at the same time: robot type, robot mask, 3D base

po-sition of the robot and 3D popo-sitions of robot joints. The improvements meant that

the system was able to not only estimate all the same information as cascaded

CNNs but also identify the robot type. Everything was done in a single neural

network. The improved system resulted in Paper V. On the other hand, larger

multi-objective CNN meant that teaching it to recognise all the robots required

training data for all of them to be collected and used in the training process all at

once.

Despite having an automatic collection method for the robot training data, it was still a time-consuming process. The robot had to do hundreds of physical movements, with the camera frames captured and saved for each of the movement.

In some cases, there were still some unexpected irregularities or accuracy issues present when the trained system was tested in operation. Even though a multi-objective CNN was capable of recognising multiple robots, once the system was trained, it was limited just to the robots it was taught. If a new robot was added, the whole training process had to be repeated. A new approach was explored on how to reduce the amount of training data and time needed for adding a new robot to the recognition system. Transfer learning can re-train the network for a new robot type using a limited amount of training data. It allowed to reuse low-level features and reduced the amount of needed training data ten-fold as well as cut the training time from 60 to 2 hours. The work was described in Paper VI.

Re-training CNN using transfer learning is very efficient, but one significant issue remained. CNN effectively learned to recognise the new robot. However, it forgot the previously-known robots and was unable to identify or estimate their position anymore. It would not be useful in practical applications. Aiming to solve this problem, the transfer learning approach was adjusted not only to teach CNN to identify the new robot but also use the information of previously-known robots in the training process. This way, the network is capable of including the new robot into the current recognition system without losing previous knowledge.

Given the more complicated setup, a two-stage transfer learning approach was used, where a number of layers to be adjusted was changing during the training process to allow the system to adapt better, but still keep the training time low.

Eventually, the system was capable of using the base CNN trained on Universal Robots and include both Kuka and Franka Panda robot recognition in the same multi-objective CNN by using a two-stage transfer learning method presented in Paper VII.

Referring back to research question 2 (RQ 2), we have been able to teach the

system to recognise the robot not only in an input image but also to estimate its

position in 3D space in relation to the camera. It was made possible by using

deep learning and CNNs. It allows the robot to be recognised under various

il-lumination or in dynamically re-configurable camera-robot environment. To train

the network having just a limited amount of training data, a readily-trained

net-work can be taken with the transfer learning approach applied to it to incorporate

recognition of new robot types.