Early prognosis of the radiotherapy-related esophageal fistula is of great significance in making personalized stratification and optimal treatment plans for esophageal cancer (EC) patients. The effective fusion of diagnostic consideration guided multi-level radiographic visual descriptors is a challenging task. We propose an end-to-end clinical knowledge enhanced multi-level cross-channel feature extraction and aggregation model. Firstly, clinical attention is represented by contextual CT, segmented tumor and anatomical surroundings from nine views of planes. Then for each view, a Cross-Channel-Atten Network is proposed with CNN blocks for multi-level feature extraction, cross-channel convolution module for multi-domain clinical knowledge embedding at the same feature level, and attentional mechanism for the final adaptive fusion of multi-level cross-domain radiographic features. The experimental results and ablation study on 558 EC patients showed that our model outperformed the other methods in comparison with or without multi-view, multi-domain knowledge, and multi-level attentional features. Visual analysis of attention maps shows that the network learns to focus on tumor and organs of interests, including esophagus, trachea, and mediastinal connective tissues.