The task of summarizing a document is a complex task that requires a person to multitask between reading and writing processes. Since a person's cognitive load during reading or writing is known to be dependent upon the level of comprehension or difficulty of the article, this suggests that it should be possible to analyze the cognitive process of the user when carrying out the task, as evidenced through their eye gaze and typing features, to obtain an insight into the different difficulty levels. In this paper, we categorize the summary writing process into different phases and extract different gaze and typing features from each phase according to characteristics of eye-gaze behaviors and typing dynamics. Combining these multimodal features, we build a classifier that achieves an accuracy of 91.0% for difficulty level detection, which is around 55% performance improvement above the baseline and at least 15% improvement above models built on a single modality. We also investigate the possible reasons for the superior performance of our multimodal features.