On Writing Custom Loss Functions in Keras

Yanfeng Liu
4 min readJan 1, 2019

If you are doing research in deep learning, chances are that you have to write your own loss functions pretty often. I was playing with a toy problem of solving inverse kinematics with neural networks, and I came across something that at first was very strange to me.

The inverse kinematics problem is formulated as the following: given a robot arm of a certain length (or maybe a robot arm consisting of several segments with different lengths) and a target location in either the 2D coordinate system or 3D, produce the angles that the robot arm needs to rotate its segments to, so that the end tip of the arm is at that location.

Consider a robot arm with 2 segments, with length L1 and L2. It needs to rotate these two segments to relative angles of q1 and q2. The end tip (p1, p2) will then be at:

Illustration borrowed from [1]

I decided to try the simplest version of the problem first, which involves only L1 and q1. The end tip calculation is then as easy as

This should be pretty easy to learn for a neural network. To train it, we provide the end tip location and … hold on, how should we train it?

You see, the problem is that robot arms can freely rotate to any degree. It might not be directly obvious when we only have 1 segment, but when we have 2 or more segments, there are exponentially more ways to reach the same location. In other words, there is no definite, fixed solution to our problem, even though it is pretty easy to solve. If we train the neural network in a strictly supervised, i.e. feeding the target location and expected angle outputs, the neural network will be biased towards one type of solutions over the others. If the bias in the ground truth is not consistent (solutions are picked randomly), then it is even worse: the network will be confused and try to mix different solutions into one. Consequently, the angles it predicts will land the end tip around the target but not on it.

A better way to formulate the problem is let the network predict the angles, but then convert angles to the xy location that the angles will take us to, and check the distance between the result xy and target xy. This part is the loss function’s job, which is the main focus of this blog post.

We basically need to write transform matrix in TensorFlow and then perform matrix multiplication between the location vector and the transform matrix. The rest is as simple as a mean squared error.

We added a little regularization loss2 in the end just to make sure it doesn’t go outside the boundary too far, because otherwise the network could easily add 2*pi to any solution and it will be still valid. In this case we set the boundary of the first angle to be [-pi, pi], and the second angle to be [-0.5*pi, 0.5*pi].

The only thing we need in order to train the network would be a way to generate data, such as the following block:

The network is a fully connected network. I added many layers just to make sure it is complex enough, but fewer layers will probably suffice.

The training starts, and the loss value starts to go down, as does the mean squared error on the test set. Everything works and life is great. I smile a little, and decide to increase the batch size so that the training is more stable and it converges faster. BOOM! Loss function goes up all of a sudden, leaving me utterly confused.

How is it possible that an already trained network suddenly gets destroyed simply because I increase the batch size? If anything, bigger batch sizes should help with the training.

And then it occurred to me that I wrote my own loss function, which at this point does not support batch size > 1 yet. The loss was simply not calculated correctly. This is the key. When you write your custom design loss function, please keep in mind that it won’t handle batch training unless you specifically tell it how to. Basically, you have to take the average loss over each example in the batch.

The tricky part is how to write such a loss function. In my case, all I really need it to do in batches is the matrix multiplication between the location vector and the transform matrix. The following block should do the trick:

Let’s test it with batch size = 32:

Results after 58 mini-batches
Results after 400 mini-batches

Below is the full source code for this blog post, ran in jupyter lab.

Library versions: tensorflow-gpu=1.10.0; keras=2.2.2; Python=3.6.6

[1] Tejomurtula, Sreenivas, and Subhash Kak. “Inverse kinematics in robotics using neural networks.” Information sciences 116.2–4 (1999): 147–164.

--

--