Geo-objects in High-resolution Remote Sensing Images (HRSIs) have clear category attributes and rich semantic information. With the support of artificial intelligence technology, the spatial relationship can be automatically recognized by a computer. At present, the semantic understanding of HRSIs mainly relies on an image caption model to generate sentences based on the global features. However, coarse-grained features can easily cause the category attribute of the object to be mispredicted during the sentence generation process. In fact, taking the geo-object as the basic unit of semantic understanding is more in line with people’s habit of cognizing geographic space. To obtain more accurate sentences, this study constructs an Object-based Geo-spatial Relation Image Understanding Dataset (OGRIUD) and proposes a dual LSTM-driven semantic understanding method.The proposed dataset is based on the object, and the sentence description includes the category and location information of the ground object, which make up the deficiency of the target category and the location information in the semantic understanding of the current remote sensing field. The proposed method uses the object detection model to identify salient objects in the image and uses the object features as input in the language model to alleviate the problem of incorrectly predicted categories in the description. Furthermore, to use HRSI scene information, we fuse the global and regional features and use dual LSTM to predict the attention distribution of each geo-object.We compare the global feature-based approach with the object feature based approach proposed in this paper. Quantitative analysis results show that the proposed method exhibits increased exact matching accuracy, from 53.5% of the original to 62.33%. The visual analysis results show that the proposed method, and the generated spatial relation description statements are also more abundant.This method enables the language model to focus on objects with actual semantics, and the matching degree between the generated description statement and the remote sensing image content is also improved. The correspondence between the visual object and description improves the interpretability of remote sensing image understanding.