Missing SWA implementation?
#3
by
hell0ks
- opened
Hello,
I'm currently implementing Trillion architecture support for llama.cpp.
However, during testing I found model is unstable at long context. With trial-and-error, it looks like trained with SWA at window size 4096 as model card says, but its implementation is missing in transformer modeling code.
Can you confirm this is correct? Thanks.
hell0ks
changed discussion status to
closed