Activation function
#56
by aboros98 - opened
Hello!
The model should use exact gelu or approximate gelu as a gating function in mlp?
I am asking this because in PyTorch and HF the model uses exact gelu, while in keras and JAX approximate gelu is used.
Thanks!
Hi @aboros98
It should be approximate gelu I think, see: https://twitter.com/danielhanchen/status/1763613620909580505