首页 > , Vol. , Issue () : -
In recent years, there have been plenty of excellent models for joint classification of hyperspectral and LiDAR data, which were designed on the basis of supervised learning methods such as convolutional neural networks. Their classification performance depends largely on the quantity and quality of training samples. Nevertheless, when the distribution of ground objects becomes increasingly complex and the resolutions of remote sensing images grow higher and higher, it is hard to obtain high-quality labels only with limited cost and manpower. To this end, many scholars made efforts to learn features directly from unlabeled samples. For instance, the theory of autoencoder was applied to multimodal joint classification, achieving satisfactory performance. Although those methods based on reconstruction idea reduce the dependence on labeled information to a certain extent, there still exist several problems to be settled. For example, such methods pay more attention to data reconstruction, and fail to guarantee that the extracted features have sufficient discriminant ability, thus affecting the performance of joint classification. To address the aforementioned issue, this paper proposes an effective model named Joint Classification of Hyperspectral and LiDAR Data Based on Inter-Modality Match Learning. Different from feature extraction models based on reconstruction idea, the proposed model tends to compare the matching relationship between samples from different modality, hence enhancing the discriminative ability of features. Specifically, it is composed of inter-modality matching learning network and multimodal joint classification network. The former one is prone to identify whether the input patch pairs of hyperspectral image and LiDAR data match, so reasonable construction of matching labels is essential. To this end, spatial positions of center pixels in cropped patches and KMeans clustering method are employed. Such constructed labels and the patch pairs are combined together to train the network. It is worth noting that this process does not use manual labeled information and is capable of directly extracting features from abundant unlabeled samples. Furthermore, in the joint classification stage, the structure and trained parameters of matching learning network are transferred, and then a small number of manually labeled training samples are used to finetune the model parameters. Eventually, extensive experiments were conducted on two widely used datasets named Houston and MUUFL to verify the effectiveness of the proposed model, including comparison experiments with several state-of-the-art models, hyper-parameter analyses experiments and ablation study experiments. Take the first experiment as an example, when compared with other models such as CNN, EMFNet, AE_H, AE_HL, CAE_H, CAE_HL, IP-CNN and PToP CNN, the proposed model in this paper is able to achieve higher performance on both datasets, whose OAs are 88.39% and 81.46%, respectively. In summary, our model enables to reduce the dependence on manually labeled data and improve the joint classification accuracy in the case of limited training samples. In the future, a better model structure and more testing datasets will be explored to make further improvements.