In this paper, we propose to learn Deep Semantic Space (DSS) for cross-modal event retrieval, which is achieved by exploiting deep learning models to extract semantic features from images and textual articles jointly. More specifically, a VGG network is used to transfer deep semantic knowledge from a large-scale image dataset to the target image dataset. Simultaneously, a fully-connected network is designed to model semantic representation from textual features (e.g., TF-IDF, LDA). Furthermore, the obtained deep semantic representations for image and text can be mapped into a high-level semantic space, in which the distance between data samples can be measured straightforwardly for cross-model event retrieval. In particular, we collect a dataset called Wiki-Flickr event dataset for cross-modal event retrieval, where the data are weakly aligned unlike image-text pairs in the existing cross-modal retrieval datasets. Extensive experiments conducted on both the Pascal Sentence dataset and our Wiki-Flickr event dataset show that our DSS outperforms the state-of-the-art approaches.