I was trying to finetune a object detection model on a custom dataset. I was looking for example code for the same, but most examples I could find were for classification. What could be the reason for this?
- Features: After feature extraction, classification is straightforward as it is just a fully connected layer on top of the extracted features. But for object detection (say YOLO), the features we extract are in the shape
S X S X B X (5 + C)
, where S is the grid size, B is the number of bounding boxes per grid cell, 5 is the bounding box coordinates and objectness score, and C is the number of classes. Here S X S is the grid size, to which we want to divide the image. - Loss Computation: After bounding box generation during training time, the loss computation is also different. We need to compute the loss for the bounding box coordinates, objectness score, and class prediction.
- Dataset Formats: There are several formats in which the labels and bounding boxes are represented.
Roboflow
, makes it easy to convert between these formats. - At inference time, from all the bounding boxes generated, we need to select the bounding boxes with the highest objectness score and then apply non-max suppression to remove overlapping bounding boxes.