Missing SWA implementation?

#3
by hell0ks - opened

Hello,

I'm currently implementing Trillion architecture support for llama.cpp.

However, during testing I found model is unstable at long context. With trial-and-error, it looks like trained with SWA at window size 4096 as model card says, but its implementation is missing in transformer modeling code.

Can you confirm this is correct? Thanks.

You are working on this too. I see, SWA is required. Closing.

hell0ks changed discussion status to closed

Sign up or log in to comment