TY - GEN
T1 - Linear-time self attention with codeword histogram for efficient recommendation
AU - Wu, Yongji
AU - Lian, Defu
AU - Gong, Neil Zhenqiang
AU - Yin, Lu
AU - Yin, Mingyang
AU - Zhou, Jingren
AU - Yang, Hongxia
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Self-attention has become increasingly popular in a variety of sequence modeling tasks from natural language processing to recommendation, due to its effectiveness. However, self-attention suffers from quadratic computational and memory complexities, prohibiting its applications on long sequences. Existing approaches that address this issue mainly rely on a sparse attention context, either using a local window, or a permuted bucket obtained by locality-sensitive hashing (LSH) or sorting, while crucial information may be lost. Inspired by the idea of vector quantization that uses cluster centroids to approximate items, we propose LISA (LInear-time Self Attention), which enjoys both the effectiveness of vanilla self-attention and the efficiency of sparse attention. LISA scales linearly with the sequence length, while enabling full contextual attention via computing differentiable histograms of codeword distributions. Meanwhile, unlike some efficient attention methods, our method poses no restriction on casual masking or sequence length. We evaluate our method on four real-world datasets for sequential recommendation. The results show that LISA outperforms the state-of-the-art efficient attention methods in both performance and speed; and it is up to 57x faster and 78x more memory efficient than vanilla self-attention.
AB - Self-attention has become increasingly popular in a variety of sequence modeling tasks from natural language processing to recommendation, due to its effectiveness. However, self-attention suffers from quadratic computational and memory complexities, prohibiting its applications on long sequences. Existing approaches that address this issue mainly rely on a sparse attention context, either using a local window, or a permuted bucket obtained by locality-sensitive hashing (LSH) or sorting, while crucial information may be lost. Inspired by the idea of vector quantization that uses cluster centroids to approximate items, we propose LISA (LInear-time Self Attention), which enjoys both the effectiveness of vanilla self-attention and the efficiency of sparse attention. LISA scales linearly with the sequence length, while enabling full contextual attention via computing differentiable histograms of codeword distributions. Meanwhile, unlike some efficient attention methods, our method poses no restriction on casual masking or sequence length. We evaluate our method on four real-world datasets for sequential recommendation. The results show that LISA outperforms the state-of-the-art efficient attention methods in both performance and speed; and it is up to 57x faster and 78x more memory efficient than vanilla self-attention.
KW - Efficient-attention
KW - Quantization
KW - Self-attention
KW - Sequential recommendation
UR - https://www.scopus.com/pages/publications/85107995132
U2 - 10.1145/3442381.3449946
DO - 10.1145/3442381.3449946
M3 - Conference article published in proceeding or book
AN - SCOPUS:85107995132
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 1262
EP - 1273
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery, Inc
T2 - 30th World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -