TY - GEN
T1 - Sentence Boundary Detection of Financial Data with Domain Knowledge Enhancement and Cross-lingual Training
AU - Wan, Mingyu
AU - Xiang, Rong
AU - Chersoni, Emmanuele
AU - Klyueva, Natalia
AU - Ahrens, Kathleen
AU - Miao, Bin
AU - Broadstock, David Clive
AU - Kang, Jian
AU - Yung, Hing Wah
AU - Huang, Chu-ren
PY - 2019/8
Y1 - 2019/8
N2 - Sentence Boundary Detection is a basic requirement in Natural Language Processing and remains a challenge to language processing for specific purposes especially with noisy source documents. In this paper, we deal with the processing of scanned financial prospectuses with a feature-oriented and knowledge-enriched approach. Feature engineering and knowledge enrichment are conducted with the participation of domain experts and for the detection of sentence boundaries in both English and French. Two versions of the detection system are implemented with a Random Forest Classifier and a Neural Network. We engineer a fused feature set of punctuation, digital number, capitalization, acronym, letter and POS tag for model fitting. For knowledge enhancement, we implement a rule-based validation by extracting a keyword dictionary from the out-of-vocabulary sequences in FinSBD’s datasets. Bilingual training on both English and French training sets are conducted to ensure the multilingual robustness of the system and to extend the relatively small training data. Without using any extra data, our system achieves fair results on both tracks in the shared task. Our results (English1 : F1-Mean = 0.835; French: F1-Mean = 0.86) as well as a post-task quick improvement with self-adaptive knowledge enhancement based on testing data demonstrate the effectiveness and robustness of bilingual training with multi-feature mining and knowledge enhancement for domain specific SBD task.
AB - Sentence Boundary Detection is a basic requirement in Natural Language Processing and remains a challenge to language processing for specific purposes especially with noisy source documents. In this paper, we deal with the processing of scanned financial prospectuses with a feature-oriented and knowledge-enriched approach. Feature engineering and knowledge enrichment are conducted with the participation of domain experts and for the detection of sentence boundaries in both English and French. Two versions of the detection system are implemented with a Random Forest Classifier and a Neural Network. We engineer a fused feature set of punctuation, digital number, capitalization, acronym, letter and POS tag for model fitting. For knowledge enhancement, we implement a rule-based validation by extracting a keyword dictionary from the out-of-vocabulary sequences in FinSBD’s datasets. Bilingual training on both English and French training sets are conducted to ensure the multilingual robustness of the system and to extend the relatively small training data. Without using any extra data, our system achieves fair results on both tracks in the shared task. Our results (English1 : F1-Mean = 0.835; French: F1-Mean = 0.86) as well as a post-task quick improvement with self-adaptive knowledge enhancement based on testing data demonstrate the effectiveness and robustness of bilingual training with multi-feature mining and knowledge enhancement for domain specific SBD task.
M3 - Conference article published in proceeding or book
BT - Proceedings of The First Workshop on Financial Technology and Natural Language Processing: The FinSBD Shared Task
ER -