head

Lidong Yu Computer Vision Researcher

My Experience

I am looking for a fulltime engineer and PhD position now. I have just finished a year of research intern in UISEE(Nanjing) AI Center under the supervision of Prof. Jianbo Shi. The work was mainly focused on high-level vision tasks like tracking (SOT and MOT) and relative navigation. Before that, I have received my Master degree in Computer Vision from Beijing Institue of technology advised by Prof. Yunde Jia and Prof. Yuwei Wu . My master thesis is mainly focused on the low-level vision tasks like stereo matching and monocular depth estimation. I received my Bachelor degree in Computer Science from the Ocean University of China.

Publications

1. Deep Stereo Matching with Explicit Cost Aggregation Sub-Architecture.

Lidong Yu, Yucheng Wang, Yuwei Wu, Yunde Jia

32th AAAI Conference on Artificial Intelligence (AAAI), 2018.

2. FoggyStereo: Stereo Matching with Fog Volume Representation.

Chengtang Yao, Lidong Yu

CVPR, 2022.

Research Projects

Solution:

We propose to use the low-level feature map, which contains rich geometry information like edges and silhouettes to guide the cost aggregation.

aaai paper

1.Deep stereo matching with explicit cost aggregation

Key Words:

Stereo Matching, Cost Aggregation, Interpretable Guidance

Problem:

Current deep stereo matching lacks interpretable cost aggregation

Results:

Overpass baseline GC-Net 2.4% on EPE, reach top3 on KITTI15 in 2017 and surpass 0.7 on the foreground. The final disparity map gives better performance on details and keeps the better result on edges.

Solution:

We explicitly model the mapping relation by the similarity between high-resolution (HR) and low-resolution (LR) feature maps. The mapping relation is represented by a weight matrix where each element is the similarity between corresponding HR and LR feature vectors. To map the LR matching result back to HR space, we use the computed matrix rather than use the normal up-sampling layers like bilinear interpolation or deconvolution.

iccv paper

2.Explicit context mapping for stereo matching

Key Words:

Stereo Matching, Up-Sampling, Bottleneck module

Problem:

The sub-sampling operation provides the feature with a large receptive field, but it also brings the loss of details like silhouettes. Using skip-connection can maintain the details, but it is hard to interpret how the high-level features refine the low-level matching result.

Results:

Our method gets a SOTA result with 0.59 EPE on the SceneFlow dataset, where the baseline PSM only gets 1.09 on EPE. The results on KITTI is just comparable because the sparse supervision is not supportive to learn the mapping matrix by backward propagation.

Solution:

We propose the region support knowledge as guidance to regularize the deep stereo matching. The region support knowledge, P, consists of four kinds of prior knowledge, where P1 is the discriminative region supports, P2 is the representative pixel sets, P3 is the object-level segmentation, and P4 is the plane-level segmentation.

iccv paper

3.Interpretable stereo matching with region support

Key Words:

Stereo Matching, interpretable network, regional guidance, guided learning

Problem:

Current deep stereo matching methods are mostly built in a black box and lack of interpretability. To build a reliable stereo matching pipeline, we argue that it is essential to design the network in an explainable manner.

Results:

The final stereo matching method outperforms baseline GCNet on SceneFlow and KITTI datasets by applying the region support for all three steps. On Sceneflow, it is four times faster than the baseline and gets 0.2 promotion on EPE. On KITTI15, it gets 2.34 for all pixels error, which is 0.5 higher than GCNet.

Solution:

We propose a new depth estimation method using region support as the inference guidance and design a region support network to realize the depth inference.

iccv paper iccv paper

4. Monocular Depth Estimation with regional guidance

Key Words:

Monocular Depth Estimation, Auxiliarty Loss, Multi-Task learning

Problem:

Monocular depth estimation mostly relies on monocular cues like the scale ratio, feature variance of objects, and so on, but these cues are ambiguous to guide the depth inference. Regional guidance, like semantic segmentation, helps resolve the ambiguity. But in many situations, the semantic guidance has a conflict with depth distribution.

Results:

The region support network finally runs in an end-to-end manner by seamlessly integrating two modules. Our method reaches SOTA performance on the NYU dataset with the region support, which outperforms the baseline 0.1 (coarse) and 0.3 (fine) on RMSE-linear.

Solution:

We introduce a fog volume representation to collect these depth hints from the fog. We construct the fog volume by stacking images rendered with depths computed from disparity candidates that are also used to build the cost volume. We fuse the fog volume with cost volume to rectify the ambiguous matching caused by fog.

cvpr2022

5.FoggyStereo: Stereo Matching with Fog Volume Representation

Key Words:

Stereo Matching, Defog Rendering, Iterative Refinement, Noise as Feature

Problem:

Stereo matching in foggy scenes is challenging as the scattering effect of fog blurs the image and makes the matching ambiguous. Prior methods deem the fog as noise and discard it before matching. Different from them, we propose to explore depth hints from fog and improve stereo matching via these hints. The exploration of depth hints is designed from the perspective of rendering. The rendering is conducted by reversing the atmospheric scattering process and removing the fog within a selected depth range. The quality of the rendered image reflects the correctness of the selected depth, as the closer it is to the real depth, the clearer the rendered image is. We introduce a fog volume representation to collect these depth hints from the fog. We construct the fog volume by stacking images rendered with depths computed from disparity candidates that are also used to build the cost volume.

Results:

Experiments show that our fog volume representation significantly promotes the SOTA result on foggy scenes by 10% to 30% while maintaining a comparable performance in clear scenes.

Engineer Projects

Solution:

We leverage the imitation learning method and directly learn the driving behavior from collected data. We implement the world model, GAIL, and InfoGail as the baseline and add a prediction module to get a more stable result.

aaai paper

1.Imitation Learning for Relative Navigation

Key Words:

Imitation Learning, Relative Navigation, Action Prediction, Cross-modal Inference

Target:

Build a learnable controller for an autonomous driving system and realize the relative navigation of cars.

Results:

Our method gives a driving score of 970 on the GYM platform, which surpasses most reinforcement learning methods. The action prediction module is successfully employed on real cars, which navigate in the closed roads.

Solution:

We propose to distangle the foreground information from the extracted feature and use it for the correlation operation.

aaai paper

2.Single Object Detection by Feature Disentanglement

Key Words:

Single Object Detection, Feature Distanglement, Capsule, GAN

Problem:

The correlation based matching is effective for SOT, but the feature are fused with background information, which makes the feature less discriminative for the matching

Results:

We get comparable result to the PYSOT where the accuracy is 0.389 on OTB100, 0.549 on VOT. Our method gets a more stable performance on long-term video and is robust to the camera vibration.

Solution:

We propose to use the deformation convolution network to predict the deformation between two images and connect the detection results.

aaai paper

3.Multiple Object Detection with Deformation Prediction

Key Words:

Multiple Object Detection, Deformable Convolution

Target:

We aim to use deformable convolution to learn the relative deformation between two images. We warp the detection results on the T-1 frame to the T frame with the deformation and connect them with the detection result on the T frame.

Results:

On KITTI, we get a much higher accuracy for the vehicle, which is 0.752 compared to the baseline (without video aggregation) 0.419. As for the pedestrian, we get 0.817 compared to the baseline 0.356. With video aggregation, we only detect 60% of the bbox of the baseline. Although the recall comes down, we can add the sampling frame and connect most of the frames.

Solution:

We propose to use the deformation convolution network to predict the deformation between two images and connect the detection results.

aaai paper

4.Video Object Detection by Temporal Aggregation

Key Words:

Video Object Detection, Deformable Convolution, Temporal Aggregation

Problem:

Due to the occlusion and blurring, the object is hard to be detected as the feature is not damaged. We aim to bring supportive information from related frames to enhance the feature of the current frame.

Results:

Our final result uses the RepPoint as the baseline and adds a mask on the deformable convolution layer. On KITTI, our AP is 0.628, which is 12$ higher than the baseline 0.560. On Waymo, our method gets AP of 0.663 compared to the baseline of 0.626.