Replies: 1 comment
-
Agree. We have ongoing discussions inside the team about this. However, it might take some time for us to figure out a better structure. You can check the discussion in the closed PR |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello World!
Upfront disclaimer: I'm no LLM researcher.
Looking at the additional heads, I'm wondering if the model could benefit from having a residual connection from head N to N+1. Given that token N+1 strongly depends on token N, I expect the accuracy to improve, especially for an increasing number of heads.
In its easiest form:
Beta Was this translation helpful? Give feedback.
All reactions