Abstract
This study investigates the impact of continued pre-training transformer-based deep learning models on historical corpus, focusing on BERT, RoBERTa, XLNet, and GPT-2. By extracting word representations from different layers, we compute gender bias embedding scores and analyze their correlation with human bias scores and real-world occupation participation differences. Our results show that BERT, an encoderonly model, achieves the most substantial improvement in capturing human-like lexical semantics and world knowledge, outperforming traditional static word vectors like Word2Vec. Continued pre-training on historical data significantly enhances BERT’s performance, especially in the lower-middle layers. When historical human biases are difficult to quantify due to data scarcity, continued pre-training BERT on historical corpora and averaging lexical representations up to the 6th layer provides an
accurate reflection of gender-related historical biases and world knowledge.
accurate reflection of gender-related historical biases and world knowledge.
Original language | English |
---|---|
Title of host publication | Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation |
Editors | Nathaniel Oco, Shirley N. Dita, Ariane Macalinga Borlongan, Jong-Bok Kim |
Publisher | Tokyo University of Foreign Studies |
Pages | 1316-1331 |
Publication status | Published - Dec 2024 |
Event | The 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38] - Tokyo University of Foreign Studies, Tokyo, Japan Duration: 7 Dec 2024 → 9 Dec 2024 |
Conference
Conference | The 38th Pacific Asia Conference on Language, Information and Computation [PACLIC-38] |
---|---|
Country/Territory | Japan |
City | Tokyo |
Period | 7/12/24 → 9/12/24 |