Photo-sharing social media sites provide new ways for users to share their experiences and interests on the Web, which aggregate large amounts of multimedia resources associated with a wide variety of real-world events in different types and scales. In this work, we aim to tackle social event detection from these large amounts of image collections by devising a semi-supervised multimodal clustering algorithm, denoted by SSMC, which exploits label signals to guide the fusion of the multimodal features. Particularly, SSMC takes advantage of the distribution over the similarities on a small amount of labeled data to represent the images, fusing multiple heterogeneous features seamlessly. As a result, SSMC has low computational complexity in processing multimodal features for both initial and updating stages. Experiments are conducted on the Mediaeval social event detection challenge, and the results show that our approach achieves better performance compared with the baseline algorithms.