-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some code questions #33
Comments
Hello @xavidzo |
Thank you. I have some more questions if you could answer them please: CLOCs/second/pytorch/models/voxelnet.py Line 509 in b2f0e23
I mean, here you detach the inputs first to transfer them to cpu for casting to numpy array... in my experience this operation of moving the tensors from gpu to cpu can be quite slow, 20 ms or more.....then you convert the outputs 'iou_test' and 'tensor_index' to torch tensors. Could this function 'def build_stage2_training() ' be written directly in PyTorch code, or if not, why not? b) I am working with CenterPoint on my project with a custom dataset, but in order to try CLOCs first, I trained CenterPoint with PointPillars backbone on kitti, though the results are not that great on the 3D task. The author tianweiy himself doesn't know the reason (he tried also on kitti)
As you can see, clocs did help to boost the results in 3d @0.70, 0.70, 0.70 c ) Now I would like to apply clocs for multi-class fusion on kitti. For this I want to use the same function as before
Is it a good idea to keep > 0.7 as the threshold for positive targets and < 0.5 for negatives for all boxes of all classes? d) In relation to c), why do you calculate the 3d overlap of ground-truth and predicted boxes in the camera reference frame? d) Do you have the 2d detections from Cascade-rcnn for the Pedestrian and Cyclist classes? If yes, could you kindly share the files with me, please? e) Do you think it would make sense to train CLOCs with the same loss function of the 3d detector? In the case of CenterPoint this would be the FastFocalLoss. Because CenterPoint is not anchor-based, but relies on the prediction of peaks on a heatmap. During training, it targets a 2D Gaussian produced by the projection of 3D centers of the ground-truth bounding boxes into the map-view. |
a) Yes, you are right, this part can be written in pytorch, it is slower to move data from cpu and gpu back and forth, back then I just wanted to use some off-the-shelf functions from SECOND, then I forgot to change it. b) Thank you very much for showing your CenterPoint results. I also have tested CenterPoint before. One of the reason that I can think of is the true positive metrics are different between KITTI and nuScenes are different, KITTI uses 3D/2D IoU, while nuScenes uses center distance (0.5, 1.0, 1.5, 2.0 meters), in another word, KITTI has more strict true positive metrics. Also, another potential reason is that KITTI dataset is smaller than nuScenes and only focus on front view while nuScenes has 360 field of view. c) Yes, you need different thresholds for different class, for KITTI, they use 0.7 for Car (large objects), 0.5 for Pedestrian and Cyclist (smaller objects), you can also follow this. CLOCs fusion network is just 4 layers of CNNs, it is smaller than most of the detection head (such as the center_head in CenterNet). My thought is having multiple CLOCs is not a big challenge to real-time performance. I remembered CenterPoint and SECOND both have multiple much heavier detection heads when doing multiple-class detection. d1) Note that camera coordinate is different from image coordinate (in pixel), camera coordinate is also 3D coordinate (x pointint to right, y pointing down, z pointing forward). In KITTI, all the ground truth labels are in 3D camera coordinate. d2) Cascade-RCNN does not provide the weights and parameters for Pedestrian and Cyclist, I used MS-CNN, I have uploaded the 2D detections to the same shared folder mentioned in readme, you can download it from there. The file name is "mscnn_ped_cyc_trainval_sigmoid_data_scale_1000.zip". Note that the score scale are in range [0,1000] (MS-CNN default settings), when you use them for fusion, remember to divide them by 1000 to make them in range [0, 1]. e) Thank you very much for the loss function links. I am sorry I am not very familiar with FastFocalLoss, I know that CenterPoint is anchor free, so they need to modify the loss function. I will have a look at this new loss. |
Hello @pangsu0613, I wanted to give you an update:
I wanted to ask you the following: CLOCs/second/pytorch/models/voxelnet.py Line 398 in b2f0e23
b) I measured the time it takes to detach from gpu to cpu for casting to numpy, this time is actually negligible... 0.1 ms
Line 127 in b2f0e23
I tried the function running in pure PyTorch code, it was much slower than using numba, like 8 seconds.... As far as I understand with the statement @numba.jit(nopython=True,parallel=True) you make use of the cpu cores, right? do you know maybe if numba can also work on the gpu, if yes, how would be the syntax? As I said, I care about the speed of the algorithm because it should run as fast as possible for my real-time application |
Hello @xavidzo, thank you very much for the updates, it looks like FastFocalLoss has better performance in BEV mAP. |
@xavidzo Thank you soo much for sharing your results and comprehensive analysis! |
No, the 10 ms delay alone comes from the steps before I build the input tensors inside the function def prepare_input_tensor(). In the function def prepare_fusion_inputs(), before I get the inference results from the 3d_detector it also takes some time to read and parse the 2d_detections data to have the list of top_predictions for each class, I measured this needs around 30 ms, but now I don't care much about this time since for my application I will not store the 2d_data exactly in kitti format, but already in a format suitable to just pass this data directly downstream in the pipeline as it is with less or no more operations, so this shouldn't be a big issue hopefully Moreover in this loop when I build the tensors in def prepare_input_tensor():
I tried to include the operations below in the numba function def build_stage2_training() still in numpy code, then I casted the outputs non_empty_iou_test_tensor and non_empty_tensor_index_tensor to torch.FloatTensor() and torch.LongTensor() respectively
I tried what I just said and building the 3 input tensors took in total less than a millisecond, however, when I run the inference with the clocs heads the inference time increased to more than 100 ms, weird behavior, don't know why this happens... Anyways, if you have some ideas to further optimize my code, I would be glad to hear about any suggestions since I believe your skills in PyTorch and python should be better than mine. I researched about the inference time in other state-of-the-art multi-modality sensor fusion works, the majority run at 15 Hz or much slower, thus my approach with your method is quite similar I suppose, meaning in total CenterPoint + CLOCs = 60 ms ~ 15 Hz, but my thesis supervisor wants it even faster. It's true the clocs tiny fusion layer itself is pretty fast to run, only the pre-processing functions before take "a long time". I was thinking about how to turn clocs from a late-fusion method into an early-fusion or intermediate-fusion, but I think in terms of speed it would not make a difference I guess.... I wanted to ask also if perhaps you could kindly share with me some of the code you used in the experiments presented in your paper regarding the improvement of clocs at various distance ranges? I would like to do some analysis like that as well. |
@xavidzo Could you share your steps in how you built the inference for real-time, please? |
Hi @urbansound8K, I think it's pretty understandable from the code snippet I posted |
Hello @pangsu0613, do you have already some feedback on the matter of speed, please? Any ideas on how to make it faster? |
Hello @xavidzo , sorry for the late response, I am swarmed by many tasks. One advice that I have to speed things up is to reduce the number of 3D detection candidates and 2D detection candidtes (especially for SECOND 3D detector), currently there are 70400 3D detection candidates from SECOND for each frame, but most of them (probably more than 90% of them) are in very very low quality (also in very low score such as 0.01) and not contribute too much for the fusion and final output, I suggest set a score threshold of 0.1, 0.15 or 0.2, this will reduce the number of detection candidates to only hundreds (even less than 100 for many frames) and not affect the final performance too much. Same thing for 2D detection candidates. I believe this could speed things up. I'll let you know if I have more feedback. |
Hello @pangsu0613, I just wanted to let you know your clocs method works indeed for real-time inference in combination with a super-fast detector like YOLOv4. I tested this in my thesis with the classes Car, Pedestrian, Cyclist and Van in KITTI. For all classes, the accuracy was improved, though the accuracy of CenterPoint with PointPillars backbone is poor in KITTI when trained on all classes. Tianweiy, the author of CenterPoint himself doesn't know why. Qualitatively speaking, yes one can see a reduced amount of false positives and missed detections. I read you are testing CenterPoint on nuScenes. Can you already confirm that clocs also improves the evaluation metrics of CenterPoint in nuScenes? |
Hello @xavidzo , that's wonderful! Thank you for keeping me posted! May I know which 3D detector that you used for KITTI? Did you set score threshold for 3D detection candidates? |
yes, as I said, I trained a CenterPoint model with PointPillars backbone on KITTI, but the accuracy is not that great.... at least in theory from the evaluation results.... I didn't set any score threshold for 3D detection candidates. I took the raw output of CenterPoint, which means I assigned each class to one detection head, and the size of each detection head is 248 x 216 = 53568, These numbers depend on the point cloud range, voxel size and output stride of the neck of the network configuration, but I don't know the relation formula for how to calculate these numbers beforehand I understand that the evaluation metrics are different / more strict on KITTI than on nuScenes, but that does not explain why the CenterPoint performance is not that great. I think PointPillars performs better on KITTI, so CenterPoint lies behind a little the PointPillars baseline (more noticeable for the car class) Could you tell me please me how did you handle the different losses for all classes with CLOCs in nuScenes? Also, your new paper will have the same name it has currently in arvix? Otherwise, can you please tell me the new title? |
Hello @xavidzo , sorry for the late response. From loss function perspective, what I have done is very straightforward, I have one loss function for each class because I trained different classes separately (the disadvantage is that it is a little bit time consuming because there are 10 classes in nuScenes, not 3 in KITTI, the good thing is training CLOCs is fast because CLOCs is a small network). I used the same SigmoidFocalClassificationLoss provided in this repo (originally from SECOND codebase). I should have tried to train multiple classes simultaneously but I don't have enough time. But I think it is definitely feasible to train them together, SigmoidFocalClassificationLoss can also handle multi classes. I didn't use the fastfocal loss you suggested due to limited time that I have, sorry about that. |
Hello @pangsu0613, just to be clear, so you trained separate CLOCs networks for every class in nuScenes, and then for the evaluation (or inference) you group all the CLOCs networks together and add these to a single CenterPoint model (then you also load many different checkpoints for every CLOCs instance specialized in one class)? Or did you trained also different CenterPoint models for every class? |
Hello @xavidzo, sorry for the late response.
Yes, I have one CenterPoint model and multiple CLOCs networks. Each CLOCs network is for one class in nuScenes. And I have multiple checkpoints for the different CLOCs networks. nuScenes have a different evaluation metrics which is based on center distance (Not 3D IoU), I think it is not as strict as 3D IoU, so I set a lower threshold for all the classes, for Car, bus and trucks and other large objects, I set 0.5, and for pedestrian, bicycle and other smaller objects, I set 0.25. |
Hello @pangsu0613, first of all congratulations on your improvement over the baseline, I mean the Fast-CLOCs, well-done sir! |
Hello @pangsu0613, could you please explain with words the idea behind this algorithm to find the overlaps between the projected_3d_boxes (here in the code this is called just 'boxes') and the 2d boxes ( here in the code called 'query_boxes')
a)
b) here when you calculate the feature 'distance_to_the_lidar', why do you divide by 82.0 ?
CLOCs/second/pytorch/models/voxelnet.py
Line 497 in b2f0e23
c) also, I don't understand why the output scores of the fusion network 'cls_pred' are in raw log format even though the input 3d and 2d scores were in sigmoid format. Can you please tell me the reason
The text was updated successfully, but these errors were encountered: