Take input attention masks to support left-padded sequences

#1
by hiyouga - opened

The previous implementation does not accept attention masks as inputs, so it will cause some unexpected behaviours at batched inference (commonly using left-padding). So I reimplemented the alibi encodings to take attention masks in user inputs. Note that this implementation largely depends on [1].

[1] https://github.com/huggingface/transformers/blob/main/src/transformers/models/bloom/modeling_bloom.py

Of course, the above implementation requires re-computing alibi tensors at each inference time. We cannot use cached tensors while using input attention masks. Thus, the inference efficiency will be slightly worse than the original version.

Baichuan Intelligent Technology org

Could alibi fused with expanded mask and do not need to take causal mask into consideration? Because alibi mask is like causal mask which is a lower triu?

hiyouga changed pull request status to closed

Sign up or log in to comment