The Confusing Metrics of AP and mAP for Object Detection / Instance Segmentation
If you are working on an object detection or instance segmentation algorithm, you have probably come across the messy pile of different kinds of performance metrics. There are AP, AP50, AP75, mAP, AP@[0.5:0.95], and all kinds of fun stuff. Researchers are actually inventing even more new metrics as we speak. Pretty soon you will be pulling your hair out and ask yourself what the **** is all this. “I have an algorithm, and I have my dataset. All I want to do is to evaluate it. Why is it so hard?” Well, I have gone through the same thing and probably spent a ridiculously stupid amount of time trying to figure out what they are. Now that I have done that, I would like to make it easier for you.
Instance segmentation is kind of a new field. Until the age of deep learning, there weren’t a lot of datasets for it because the algorithms were simply not good enough. These days you mainly see MSCOCO, PASCAL, CityScape, and all sorts of datasets that include a task called instance segmentation. The main idea of instance segmentation is that we have to segment out each instance of each category, so that, for example, two people in an image will have the same category label as “person” but different instance labels like “person 1” and “person 2”, as shown below.
To evaluate a certain algorithm on instance segmentation, usually you will see the terms mentioned above, namely AP (Average Precision) and mAP (Mean Average Precision). Let’s start with the definition of precision and recall.
where tp = true positive, fp = false positive, fn = false negative. What are they? The table below should make things clear:
So basically precision is measuring the percentage of correct positive predictions among all predictions made; and recall is measuring the percentage of correct positive predictions among all positive cases in reality. There is always a trade-off between the two metrics. Imagine if we label everything as positive, then recall will be 1 because we do not have false negatives, but precision will be horrible because only a small percentage of our positive predictions are actually correct. In the other extreme case, we can be very careful about the selection of positive prediction, so prediction will be very good, but we might have labelled many positive cases as negative and consequently lowered recall.
Now that we understand precision and recall, we are well equipped to understand average precision. It is defined as the area under the precision-recall curve (PR curve). The x-axis is recall, and y-axis is precision. How do we get multiple precision-recall value pairs? In the case of objection detection or instance segmentation, this is done by changing the score cutoff. Any detection with a score lower than the cutoff is treated as a false positive. At each unique level of classification score that is present in the detection results, we calculate the precision and recall at that point. After we have gone through all the precision-recall value pairs corresponding to each unique score cutoff, we have a precision-recall curve. It might look like the image below:
Note that it is monotonically decreasing, which is what we want. And it makes sense: it is always a trade-off between precision and recall. Increasing one will decrease the other. Sometimes it is not monotonically decreasing. When that happens, we make the curve monotonically decreasing by setting P for R equal to the maximum P obtained for any R’>R. So the green curve will look like the red one after we are done.
Now the mystery of AP is solved. But what about the numbers after it? What is AP50, AP75, or AP@[0.5:0.95]? First we have to understand what IoU (Intersection over Union) is. The image below is my favorite explanation.
IoU is a good way of measuring the amount of overlap between two bounding boxes or segmentation masks. If the prediction is perfect, IoU = 1, and if it completely misses, IoU = 0. A degree of overlap will produce a IoU value between those two. We have to make a decision about when to call it a good enough prediction, i.e. true positive. 50% IoU is a good place to start, but what about 60%? 78.1%? 99%? 100%? Good question. It would introduce a bias if we favor a certain threshold and ignore the others. One way is to change the IoU threshold over a range. If we do that, the precision and recall values will change, and if we draw the precision-recall pairs on a coordinate system, they form a curve. This is the precision-recall curve. In COCO they change the IoU values from 50% to 95%, at a step of 5%. So we end up with 10 precision-recall pairs. If we take the average of those 10 values, we get AP@[0.5:0.95].
Sometimes the IoU threshold is fixed, for example, at 50% or 75%, which are called AP50 and AP75, respectively. When this is the case, it is simply the AP value with the IoU threshold at that value. Remember, again, we still have to calculate the precision-recall pairs at different score cutoffs.
Mean average precision (mAP) is much easier to understand once we understand AP. mAP is simply all the AP values averaged over different classes/categories. However, researchers do not like to make things easy. On the COCO website, it confusingly says:
AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.
One more thing before we get ready for some evaluation:
Usually in an object detection/instance segmentation algorithm, there are multiple categories. In fact, even if there are only one class, we still have make the distinction between that class and the background. To make that distinction, our algorithm has to output a classification score (or confidence score). We will just call it score for short. This score is needed for most instance segmentation/object detection datasets out there. If you are using a top-down approach, this is no big deal. Your network naturally generates a score for the image enclosed by the bounding box. This score is actually very helpful in the non-max suppression stage because it can filter out duplicate results. However, if you are working on a bottom-up approach, meaning that you do not use bounding boxes at all and instead assign a class label and an instance label for each pixel, then you have to use some kind of clustering algorithm for a mask. The score associated with that mask is not immediately obvious. I see two ways of computing it: either calculating the average score over all pixels in that mask, or training a separate branch in the network to specifically predict a single score for that mask.
There is some slight difference among different datasets’ evaluation mechanisms. For example:
In COCO, if you look at their source code, they rank all the detections based on the scores from high to low, and then cut off the results at the maximum number of detections allowed. For each detection, the algorithm iterates through all ground truth, and the previously unmatched ground truth with the highest IoU is matched with the detection.
In CityScapes, for each ground truth, the algorithm iterates through all predictions that have non-zero intersection with it. When there are more than one predictions matched with the same ground truth, the ones with the lower score are automatically set as false positive. IoU is only used to decide if it passes the threshold.