Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some code questions #33

Open
xavidzo opened this issue May 15, 2021 · 19 comments
Open

Some code questions #33

xavidzo opened this issue May 15, 2021 · 19 comments

Comments

@xavidzo
Copy link

xavidzo commented May 15, 2021

Hello @pangsu0613, could you please explain with words the idea behind this algorithm to find the overlaps between the projected_3d_boxes (here in the code this is called just 'boxes') and the 2d boxes ( here in the code called 'query_boxes')
a)

# pang added to build the tensor for the second stage of training
@numba.jit(nopython=True,parallel=True)
def build_stage2_training(boxes, query_boxes, criterion, scores_3d, scores_2d, dis_to_lidar_3d,overlaps,tensor_index):
    N = boxes.shape[0] #70400
    K = query_boxes.shape[0] #30

    max_num = 900000
    ind=0
    ind_max = ind
    for k in range(K):
        qbox_area = ((query_boxes[k, 2] - query_boxes[k, 0]) *
                     (query_boxes[k, 3] - query_boxes[k, 1]))
        for n in range(N):

            iw = (min(boxes[n, 2], query_boxes[k, 2]) -
                  max(boxes[n, 0], query_boxes[k, 0]))
            if iw > 0:
                ih = (min(boxes[n, 3], query_boxes[k, 3]) -
                      max(boxes[n, 1], query_boxes[k, 1]))
                if ih > 0:
                    if criterion == -1:
                        ua = (
                            (boxes[n, 2] - boxes[n, 0]) *
                            (boxes[n, 3] - boxes[n, 1]) + qbox_area - iw * ih)
                    elif criterion == 0:
                        ua = ((boxes[n, 2] - boxes[n, 0]) *
                              (boxes[n, 3] - boxes[n, 1]))
                    elif criterion == 1:
                        ua = qbox_area
                    else:
                        ua = 1.0

                    overlaps[ind,0] = iw * ih / ua
                    overlaps[ind,1] = scores_3d[n,0]
                    overlaps[ind,2] = scores_2d[k,0]
                    overlaps[ind,3] = dis_to_lidar_3d[n,0]
                    tensor_index[ind,0] = k
                    tensor_index[ind,1] = n
                    ind = ind+1

                elif k==K-1:
                    overlaps[ind,0] = -10
                    overlaps[ind,1] = scores_3d[n,0]
                    overlaps[ind,2] = -10
                    overlaps[ind,3] = dis_to_lidar_3d[n,0]
                    tensor_index[ind,0] = k
                    tensor_index[ind,1] = n
                    ind = ind+1
            elif k==K-1:
                overlaps[ind,0] = -10
                overlaps[ind,1] = scores_3d[n,0]
                overlaps[ind,2] = -10
                overlaps[ind,3] = dis_to_lidar_3d[n,0]
                tensor_index[ind,0] = k
                tensor_index[ind,1] = n
                ind = ind+1
    if ind > ind_max:
        ind_max = ind
    return overlaps, tensor_index, ind

b) here when you calculate the feature 'distance_to_the_lidar', why do you divide by 82.0 ?

dis_to_lidar = torch.norm(box_preds[:,:2],p=2,dim=1,keepdim=True)/82.0

c) also, I don't understand why the output scores of the fusion network 'cls_pred' are in raw log format even though the input 3d and 2d scores were in sigmoid format. Can you please tell me the reason

@pangsu0613
Copy link
Owner

Hello @xavidzo
(a) The 3D boxes -> 2D boxes projection is not done here. It is done in voxelnet.py. For this function, the input are projected 3D boxes (After projection, there are 8 cornoer points for each box, but we choose the max and min xy to form axis aligned 2D box, this would be in the same format as the 2D detector ouputs), 2D boxes from 2D detector and some related corresponding information. The purpose of this function is to build the input tensor for fusion, since we only care about overlapped projected 3D and 2D detections, so we calculate the IoU between them. The main idea about calculate the IoU is first we check if the 2 boxes have overlaps in x axis direction, if yes, then check y direction, if both yes, then it means they have a overlap, then just calculate the overlapped regions and so on.
(b) because in SECOND, the detection field of view in LiDAR coordinate are set as 0<x<70.4, -40<y<40, -3<z<1, so in x-y plane, the longest distance is sqrt(70.4^2 + 40^2), which is around 81, I put 82 to make it smaller than 1, this value does not have a big impact on the final results, 81, 80 also works fine.
(c) because the fusion layers are some CNNs, and the final output layer does not have a ReLU nonlinearity. This is similar to most of the existing detection and classification heads used in other works.

@xavidzo
Copy link
Author

xavidzo commented May 16, 2021

Thank you. I have some more questions if you could answer them please:
a) why did you write the function mentioned above in python code and not in PyTorch?

iou_test,tensor_index, max_num = se.build_stage2_training(box_2d_preds.detach().cpu().numpy(),

iou_test,tensor_index, max_num = se.build_stage2_training(box_2d_preds.detach().cpu().numpy(),
                                                box_2d_detector,
                                                -1,
                                                final_scores.detach().cpu().numpy(),
                                                box_2d_scores,
                                                dis_to_lidar.detach().cpu().numpy(),
                                                overlaps1,
                                                tensor_index1)
            time_iou_build_end=time.time()
            iou_test_tensor = torch.FloatTensor(iou_test)  #iou_test_tensor shape: [160000,4]
            tensor_index_tensor = torch.LongTensor(tensor_index)

I mean, here you detach the inputs first to transfer them to cpu for casting to numpy array... in my experience this operation of moving the tensors from gpu to cpu can be quite slow, 20 ms or more.....then you convert the outputs 'iou_test' and 'tensor_index' to torch tensors. Could this function 'def build_stage2_training() ' be written directly in PyTorch code, or if not, why not?

b) I am working with CenterPoint on my project with a custom dataset, but in order to try CLOCs first, I trained CenterPoint with PointPillars backbone on kitti, though the results are not that great on the 3D task. The author tianweiy himself doesn't know the reason (he tried also on kitti)
In case you are interested, I report here my results on the validation split, before and after clocs:

CenterPoint alone
2021-05-16 00:14:22,183 - INFO - Evaluation official:
car AP(Average Precision)@0.70, 0.70, 0.70:
bbox AP:89.03, 81.67, 81.39
bev  AP:87.42, 80.24, 77.66
3d   AP:70.29, 62.64, 61.73
aos  AP:88.99, 81.28, 80.77
car AP(Average Precision)@0.70, 0.50, 0.50:
bbox AP:89.03, 81.67, 81.39
bev  AP:89.87, 89.02, 88.11
3d   AP:89.70, 88.47, 87.26
aos  AP:88.99, 81.28, 80.77


CenterPoint with CLOCs
2021-05-16 00:04:11,577 - INFO - Evaluation official:
car AP(Average Precision)@0.70, 0.70, 0.70:
bbox AP:89.44, 87.76, 86.56
bev  AP:87.51, 80.13, 80.18
3d   AP:82.28, 73.62, 68.91
aos  AP:89.42, 87.54, 86.26
car AP(Average Precision)@0.70, 0.50, 0.50:
bbox AP:89.44, 87.76, 86.56
bev  AP:89.70, 88.72, 88.31
3d   AP:89.58, 88.51, 87.92
aos  AP:89.42, 87.54, 86.26

As you can see, clocs did help to boost the results in 3d @0.70, 0.70, 0.70
I trained clocs for 45 epochs, in your paper you said you trained for 15 epochs, do you think more epochs would help to increase the accuracy or how did you choose the number of epochs in your case?

c ) Now I would like to apply clocs for multi-class fusion on kitti. For this I want to use the same function as before
def build_stage2_training(boxes, query_boxes, criterion, scores_3d, scores_2d, dis_to_lidar_3d,overlaps,tensor_index)
but only calculate the overlaps if the 2d detections and 3d detections have the same class labels, e.g. if both 2d and 3d labels are 'Car', I would calculate the IoU, otherwise not.
Then here when you compute the 'targets_for_fusion'

            iou_bev = d3_box_overlap(d3_gt_boxes_camera.detach().cpu().numpy(), pred_3d_box.squeeze().detach().cpu().numpy(), criterion=-1)
            iou_bev_max = np.amax(iou_bev,axis=0)
            target_for_fusion = ((iou_bev_max >= 0.7)*1).reshape(1,-1,1)
            positive_index = ((iou_bev_max >= 0.7)*1).reshape(1,-1)
            positives = torch.from_numpy(positive_index).type(torch.float32).cuda()
            negative_index = ((iou_bev_max <= 0.5)*1).reshape(1,-1)

Is it a good idea to keep > 0.7 as the threshold for positive targets and < 0.5 for negatives for all boxes of all classes?
Or should this threshold be different for each class because the sizes of the boxes are different? for example, Truck boxes are larger than Car boxes, Pedestrian boxes are the smallest, etc.
You told me already the performance is better when you train the fusion network separately for each class, but I would like to do the fusion of all classes in a single pass, all classes at once because for my application this needs to happen in real-time...
I cannot afford to run three or four instances of the fusion network, this will be too slow, hence my question

d) In relation to c), why do you calculate the 3d overlap of ground-truth and predicted boxes in the camera reference frame?
Can or should this not be done in the lidar frame?

d) Do you have the 2d detections from Cascade-rcnn for the Pedestrian and Cyclist classes? If yes, could you kindly share the files with me, please?

e) Do you think it would make sense to train CLOCs with the same loss function of the 3d detector? In the case of CenterPoint this would be the FastFocalLoss. Because CenterPoint is not anchor-based, but relies on the prediction of peaks on a heatmap. During training, it targets a 2D Gaussian produced by the projection of 3D centers of the ground-truth bounding boxes into the map-view.

https://github.com/tianweiy/CenterPoint/blob/400ae24bb4e8f4ccabd46757738bb0304bfa2681/det3d/models/losses/centernet_loss.py#L26

https://github.com/tianweiy/CenterPoint/blob/400ae24bb4e8f4ccabd46757738bb0304bfa2681/det3d/models/bbox_heads/center_head.py#L253

@pangsu0613
Copy link
Owner

a) Yes, you are right, this part can be written in pytorch, it is slower to move data from cpu and gpu back and forth, back then I just wanted to use some off-the-shelf functions from SECOND, then I forgot to change it.

b) Thank you very much for showing your CenterPoint results. I also have tested CenterPoint before. One of the reason that I can think of is the true positive metrics are different between KITTI and nuScenes are different, KITTI uses 3D/2D IoU, while nuScenes uses center distance (0.5, 1.0, 1.5, 2.0 meters), in another word, KITTI has more strict true positive metrics. Also, another potential reason is that KITTI dataset is smaller than nuScenes and only focus on front view while nuScenes has 360 field of view.
As you can see in the code, CLOCs is a very smaller network, so I don't think you need to train CLOCs for too many epochs, you can run evaluation for every certain number of epochs and observe the validation loss to decide when to stop.

c) Yes, you need different thresholds for different class, for KITTI, they use 0.7 for Car (large objects), 0.5 for Pedestrian and Cyclist (smaller objects), you can also follow this. CLOCs fusion network is just 4 layers of CNNs, it is smaller than most of the detection head (such as the center_head in CenterNet). My thought is having multiple CLOCs is not a big challenge to real-time performance. I remembered CenterPoint and SECOND both have multiple much heavier detection heads when doing multiple-class detection.

d1) Note that camera coordinate is different from image coordinate (in pixel), camera coordinate is also 3D coordinate (x pointint to right, y pointing down, z pointing forward). In KITTI, all the ground truth labels are in 3D camera coordinate.

d2) Cascade-RCNN does not provide the weights and parameters for Pedestrian and Cyclist, I used MS-CNN, I have uploaded the 2D detections to the same shared folder mentioned in readme, you can download it from there. The file name is "mscnn_ped_cyc_trainval_sigmoid_data_scale_1000.zip". Note that the score scale are in range [0,1000] (MS-CNN default settings), when you use them for fusion, remember to divide them by 1000 to make them in range [0, 1].

e) Thank you very much for the loss function links. I am sorry I am not very familiar with FastFocalLoss, I know that CenterPoint is anchor free, so they need to modify the loss function. I will have a look at this new loss.

@xavidzo
Copy link
Author

xavidzo commented May 31, 2021

Hello @pangsu0613, I wanted to give you an update:
So I trained CLOCs on top of CenterPoint, but this time I used the same heatmap FastFocalLoss from CenterPoint instead of SigmoidFocalLoss, the results look similar as before:

2021-05-31 15:55:27,236 - INFO - Evaluation official:
car AP(Average Precision)@0.70, 0.70, 0.70:
bbox AP:89.31, 88.18, 86.81
bev  AP:87.36, 81.06, 81.27
3d   AP:81.80, 73.45, 70.43
aos  AP:89.29, 87.93, 86.47
car AP(Average Precision)@0.70, 0.50, 0.50:
bbox AP:89.31, 88.18, 86.81
bev  AP:89.59, 89.11, 88.51
3d   AP:89.48, 88.91, 88.13
aos  AP:89.29, 87.93, 86.47

I wanted to ask you the following:
a) why do you put a threshold of -100 in this line of code if the scores for the car class are in the range of [0, 1]? So basically you don't filter any scores?

top_predictions=middle_predictions[np.where(middle_predictions[:,4]>=-100)]

b) I measured the time it takes to detach from gpu to cpu for casting to numpy, this time is actually negligible... 0.1 ms
but I've measured that for building the input tensor it takes on average 5 ms.... and this is only for one class, I can imagine building a tensor for each class of interest to train separate clocs heads would be then 5 ms times the number of classes, this would be slow I think, e.g. 15 ms for 3 classes. Maybe do you have any idea how to speed up even further the computations of such function?

iou_test,tensor_index, max_num = se.build_stage2_training(box_2d_preds.detach().cpu().numpy(),
                                                box_2d_detector,
                                                -1,
                                                final_scores.detach().cpu().numpy(),
                                                box_2d_scores,
                                                dis_to_lidar.detach().cpu().numpy(),
                                                overlaps1,
                                                tensor_index1)

def build_stage2_training(boxes, query_boxes, criterion, scores_3d, scores_2d, dis_to_lidar_3d,overlaps,tensor_index):

I tried the function running in pure PyTorch code, it was much slower than using numba, like 8 seconds.... As far as I understand with the statement @numba.jit(nopython=True,parallel=True) you make use of the cpu cores, right? do you know maybe if numba can also work on the gpu, if yes, how would be the syntax? As I said, I care about the speed of the algorithm because it should run as fast as possible for my real-time application

@pangsu0613
Copy link
Owner

Hello @xavidzo, thank you very much for the updates, it looks like FastFocalLoss has better performance in BEV mAP.
(a) Yes, this is a parameter that can be changed according to the 2D detections you want to fuse, -100 means no filtering, but you can try other thresholds.
(b) I guess numba can not be used for GPU. But my suggestion for this part is that changing "build_stage2_training" function into matrix operations, because currently "build_stage2_training" is done through "for loops", if you check the code (it is basically about calculating 2D IoU), it can be done through matrix multiplication and mask operations through GPU, I think that could be faster. I did thought about this while ago, but I don't have time to test.
I uploade an early-stage python script about this to the same shared folder in readme, the file name is "iou_mat_gpu.py", I tried to build the input tensor using matrix operations, it was unfinished, you can have a look and take it as a reference, maybe it could help. Let me know if you have further questions or new findings.

@pangsu0613
Copy link
Owner

@xavidzo Thank you soo much for sharing your results and comprehensive analysis!
Just want to confirm one thing, you mentioned that the total delay is around 20ms, this includes clocs heads 34ms, buidling the input tensors through numba 34ms, then I guess the the rest 10+ms is from some functions in prepare_fusion_inputs, am I right?

@xavidzo
Copy link
Author

xavidzo commented Jun 14, 2021

No, the 10 ms delay alone comes from the steps before I build the input tensors inside the function def prepare_input_tensor().
So basically the 20 ms total delay I mentioned earlier is the sum of the operations in the function prepare_input_tensor(), plus running the 3 clocs instances. Btw, I am using a gpu Nvidia GeForce RTX 2080 Super

In the function def prepare_fusion_inputs(), before I get the inference results from the 3d_detector it also takes some time to read and parse the 2d_detections data to have the list of top_predictions for each class, I measured this needs around 30 ms, but now I don't care much about this time since for my application I will not store the 2d_data exactly in kitti format, but already in a format suitable to just pass this data directly downstream in the pipeline as it is with less or no more operations, so this shouldn't be a big issue hopefully

Moreover in this loop when I build the tensors in def prepare_input_tensor():

    for i in range(3):
        iou_test,tensor_ind, max_num = eval.build_stage2_training(box_2d_preds_numpy[(i)*13392:(i+1)*13392, :],
                                            boxes_2d_detector[i],
                                            -1,
                                            final_scores_numpy[(i)*13392:(i+1)*13392,:].reshape(-1,1),
                                            boxes_2d_scores[i],
                                            dis_to_lidar_numpy[(i)*13392:(i+1)*13392,:],
                                            overlaps[i],
                                            tensor_indices[i])


        iou_test_tensor = torch.FloatTensor(iou_test)
        iou_test_tensor = iou_test_tensor.permute(1,0)
        iou_test_tensor = iou_test_tensor.reshape(1,4,1,900000)

        tensor_ind = torch.LongTensor(tensor_ind)
        tensor_ind = tensor_ind.reshape(-1,2)

    
        if max_num == 0:
            non_empty_iou_test_tensor = torch.zeros(1,4,1,2)
            non_empty_iou_test_tensor[:,:,:,:] = -1
            non_empty_tensor_index_tensor = torch.zeros(2,2)
            non_empty_tensor_index_tensor[:,:] = -1
        else:
            non_empty_iou_test_tensor = iou_test_tensor[:,:,:,:max_num]
            non_empty_tensor_index_tensor = tensor_ind[:max_num,:]    for i in range(3):
        iou_test,tensor_ind, max_num = eval.build_stage2_training(box_2d_preds_numpy[(i)*13392:(i+1)*13392, :],
                                            boxes_2d_detector[i],
                                            -1,
                                            final_scores_numpy[(i)*13392:(i+1)*13392,:].reshape(-1,1),
                                            boxes_2d_scores[i],
                                            dis_to_lidar_numpy[(i)*13392:(i+1)*13392,:],
                                            overlaps[i],
                                            tensor_indices[i])


        iou_test_tensor = torch.FloatTensor(iou_test)
        iou_test_tensor = iou_test_tensor.permute(1,0)
        iou_test_tensor = iou_test_tensor.reshape(1,4,1,900000)

        tensor_ind = torch.LongTensor(tensor_ind)
        tensor_ind = tensor_ind.reshape(-1,2)

    
        if max_num == 0:
            non_empty_iou_test_tensor = torch.zeros(1,4,1,2)
            non_empty_iou_test_tensor[:,:,:,:] = -1
            non_empty_tensor_index_tensor = torch.zeros(2,2)
            non_empty_tensor_index_tensor[:,:] = -1
        else:
            non_empty_iou_test_tensor = iou_test_tensor[:,:,:,:max_num]
            non_empty_tensor_index_tensor = tensor_ind[:max_num,:]

I tried to include the operations below in the numba function def build_stage2_training() still in numpy code, then I casted the outputs non_empty_iou_test_tensor and non_empty_tensor_index_tensor to torch.FloatTensor() and torch.LongTensor() respectively

        iou_test_tensor = torch.FloatTensor(iou_test)
        iou_test_tensor = iou_test_tensor.permute(1,0)
        iou_test_tensor = iou_test_tensor.reshape(1,4,1,900000)

        tensor_ind = torch.LongTensor(tensor_ind)
        tensor_ind = tensor_ind.reshape(-1,2)

    
        if max_num == 0:
            non_empty_iou_test_tensor = torch.zeros(1,4,1,2)
            non_empty_iou_test_tensor[:,:,:,:] = -1
            non_empty_tensor_index_tensor = torch.zeros(2,2)
            non_empty_tensor_index_tensor[:,:] = -1
        else:
            non_empty_iou_test_tensor = iou_test_tensor[:,:,:,:max_num]
            non_empty_tensor_index_tensor = tensor_ind[:max_num,:]

I tried what I just said and building the 3 input tensors took in total less than a millisecond, however, when I run the inference with the clocs heads the inference time increased to more than 100 ms, weird behavior, don't know why this happens...

Anyways, if you have some ideas to further optimize my code, I would be glad to hear about any suggestions since I believe your skills in PyTorch and python should be better than mine. I researched about the inference time in other state-of-the-art multi-modality sensor fusion works, the majority run at 15 Hz or much slower, thus my approach with your method is quite similar I suppose, meaning in total CenterPoint + CLOCs = 60 ms ~ 15 Hz, but my thesis supervisor wants it even faster. It's true the clocs tiny fusion layer itself is pretty fast to run, only the pre-processing functions before take "a long time". I was thinking about how to turn clocs from a late-fusion method into an early-fusion or intermediate-fusion, but I think in terms of speed it would not make a difference I guess....

I wanted to ask also if perhaps you could kindly share with me some of the code you used in the experiments presented in your paper regarding the improvement of clocs at various distance ranges? I would like to do some analysis like that as well.
Furthermore, your work with clocs will be properly cited in my master thesis, in case you are interested, I could forward you a copy after I finish it. I already reviewed clocs in one paper we submitted from the chair of robotics in my university.

@pangsu0613
Copy link
Owner

Hello @xavidzo , sorry for the late response. You could refer #28 on how to generate results for different distance ranges. Let me know if you have further questions. I'll look into the speed issue and let you know my thoughts.

@urbansound8K
Copy link

@xavidzo Could you share your steps in how you built the inference for real-time, please?

@xavidzo
Copy link
Author

xavidzo commented Jul 1, 2021

Hi @urbansound8K, I think it's pretty understandable from the code snippet I posted
I could help you better if you ask me more specific questions
My code is for the moment not exactly real-time, it's just the piece I use for training and evaluation.... for real-time I still have to integrate the ros components, and yolov4 into the pipeline

@xavidzo
Copy link
Author

xavidzo commented Jul 1, 2021

Hello @pangsu0613, do you have already some feedback on the matter of speed, please? Any ideas on how to make it faster?

@pangsu0613
Copy link
Owner

Hello @xavidzo , sorry for the late response, I am swarmed by many tasks. One advice that I have to speed things up is to reduce the number of 3D detection candidates and 2D detection candidtes (especially for SECOND 3D detector), currently there are 70400 3D detection candidates from SECOND for each frame, but most of them (probably more than 90% of them) are in very very low quality (also in very low score such as 0.01) and not contribute too much for the fusion and final output, I suggest set a score threshold of 0.1, 0.15 or 0.2, this will reduce the number of detection candidates to only hundreds (even less than 100 for many frames) and not affect the final performance too much. Same thing for 2D detection candidates. I believe this could speed things up. I'll let you know if I have more feedback.

@xavidzo
Copy link
Author

xavidzo commented Sep 15, 2021

Hello @pangsu0613, I just wanted to let you know your clocs method works indeed for real-time inference in combination with a super-fast detector like YOLOv4. I tested this in my thesis with the classes Car, Pedestrian, Cyclist and Van in KITTI. For all classes, the accuracy was improved, though the accuracy of CenterPoint with PointPillars backbone is poor in KITTI when trained on all classes. Tianweiy, the author of CenterPoint himself doesn't know why. Qualitatively speaking, yes one can see a reduced amount of false positives and missed detections.
But my point is, having a separate clocs network for every class is the way to proceed. The only thing one has to consider is also having a powerful GPU with more than 8 Gb to be able to run at the same time the 3D detector (in my case CenterPoint) and the 2D detector in parallel (in my case, YOLOv4). I used a gpu Nvidia RTX 3090, and the latency introduced by using clocs (I mean all related to clocs, the preprocessing functions to project the 3d boxes to the image plane and build the input tensors for fusion plus the pass through the 4 clocs layers) takes in total only around 10 or 15 ms, so it's super fast on the RTX 3090. The time efficiency is also due to the numba function to build the input tensors.

I read you are testing CenterPoint on nuScenes. Can you already confirm that clocs also improves the evaluation metrics of CenterPoint in nuScenes?
Will you publish a new paper where you show the applicability of CLOCs in the nuScenes or Waymo datasets? If yes, when do you plan to publish your results? Thanks

@pangsu0613
Copy link
Owner

Hello @xavidzo , that's wonderful! Thank you for keeping me posted! May I know which 3D detector that you used for KITTI? Did you set score threshold for 3D detection candidates?
As for CenterPoint, one potential reason is that the true positive metrics for nuScenes and KITTI are different, nuScenes uses center distance-based true positive metrics that igoring the dimensions of the bounding boxes, while KITTI uses more strict 3D IoU.
Yes, I have tested a new version of CLOCs on nuScenes, and yes, CLOCs also improves the evaluation metrics of CenterPoint on nuScenes. The new paper is under review for now.

@xavidzo
Copy link
Author

xavidzo commented Sep 16, 2021

yes, as I said, I trained a CenterPoint model with PointPillars backbone on KITTI, but the accuracy is not that great.... at least in theory from the evaluation results.... I didn't set any score threshold for 3D detection candidates. I took the raw output of CenterPoint, which means I assigned each class to one detection head, and the size of each detection head is 248 x 216 = 53568, These numbers depend on the point cloud range, voxel size and output stride of the neck of the network configuration, but I don't know the relation formula for how to calculate these numbers beforehand
Anyways, so the amount of 3D boxes / scores feed for fusion for one class only is 53568

I understand that the evaluation metrics are different / more strict on KITTI than on nuScenes, but that does not explain why the CenterPoint performance is not that great. I think PointPillars performs better on KITTI, so CenterPoint lies behind a little the PointPillars baseline (more noticeable for the car class)

Could you tell me please me how did you handle the different losses for all classes with CLOCs in nuScenes?
Did you use the suggestion I posted before above in this thread? Or if you did it another way, could you paste here maybe some code snippet of your loss function for CLOCs in nuScenes? Thanks a lot in advance.

Also, your new paper will have the same name it has currently in arvix? Otherwise, can you please tell me the new title?
I would really like to read it as soon as it is available

@pangsu0613
Copy link
Owner

Hello @xavidzo , sorry for the late response. From loss function perspective, what I have done is very straightforward, I have one loss function for each class because I trained different classes separately (the disadvantage is that it is a little bit time consuming because there are 10 classes in nuScenes, not 3 in KITTI, the good thing is training CLOCs is fast because CLOCs is a small network). I used the same SigmoidFocalClassificationLoss provided in this repo (originally from SECOND codebase). I should have tried to train multiple classes simultaneously but I don't have enough time. But I think it is definitely feasible to train them together, SigmoidFocalClassificationLoss can also handle multi classes. I didn't use the fastfocal loss you suggested due to limited time that I have, sorry about that.
Because it is a double-blind review process, we cannot share the paper for now, but we will release it to arvix when we know the final decision.
Let me know if you have further questions.

@xavidzo
Copy link
Author

xavidzo commented Oct 13, 2021

Hello @pangsu0613, just to be clear, so you trained separate CLOCs networks for every class in nuScenes, and then for the evaluation (or inference) you group all the CLOCs networks together and add these to a single CenterPoint model (then you also load many different checkpoints for every CLOCs instance specialized in one class)? Or did you trained also different CenterPoint models for every class?
Were you able to isolate the checkpoint of CLOCs from CenterPoint or can you be a bit more specific how you integrated CLOCs? I mean, in my experiments I defined a 'Fusion Layer' module that contains all the CLOCs CNNs I need, and a CenterPoint detector is a member inside this module that is frozen to not update its weights parameters. So when I trained the whole 'Fusion Layer' class, the created checkpoint contain the weights of both CenterPoint and all CLOCs CNNs put together, but here in your repository with SECOND I saw you were able to isolate and save the CLOCs checkpoint independently and separated from the pretrained SECOND checkpoint, but I was not able to do this
If you trained one CLOCs differently for every class in nuScenes, how did you define the threshold for the positive targets of the loss function, again > 0.7 for 'Car' and similar, and for other classes, like 'Truck', 'Pedestrian', etc?
Maybe after your new paper is released, could you share some of your code, please?
Would you be interested in reading my thesis heavily based on your work with experiments and qualitative results on KITTI? I could forward to you the pdf

@pangsu0613
Copy link
Owner

pangsu0613 commented Nov 9, 2021

Hello @xavidzo, sorry for the late response.

just to be clear, so you trained separate CLOCs networks for every class in nuScenes, and then for the evaluation (or inference) you group all the CLOCs networks together and add these to a single CenterPoint model (then you also load many different checkpoints for every CLOCs instance specialized in one class)? Or did you trained also different CenterPoint models for every class?

Yes, I have one CenterPoint model and multiple CLOCs networks. Each CLOCs network is for one class in nuScenes. And I have multiple checkpoints for the different CLOCs networks. nuScenes have a different evaluation metrics which is based on center distance (Not 3D IoU), I think it is not as strict as 3D IoU, so I set a lower threshold for all the classes, for Car, bus and trucks and other large objects, I set 0.5, and for pedestrian, bicycle and other smaller objects, I set 0.25.

@xavidzo
Copy link
Author

xavidzo commented Jan 28, 2022

Hello @pangsu0613, first of all congratulations on your improvement over the baseline, I mean the Fast-CLOCs, well-done sir!
Do you have an estimated date where the code will be released?
And secondly, could you tell me please where to find the supplementary material to the paper? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants