-
Notifications
You must be signed in to change notification settings - Fork 62
Fix: Add missing attributes to hugging face conversion functions #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @danieldjohnson, I submitted this PR from a fork and it looks like the CI checks haven’t been triggered yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left a few comments.
Also, I notice that the PR is currently not passing our CI formatting checks. Mind fixing this? If you run uv run pyink penzai tests
it should reformat everything automatically.
(See here for the full set of checks that gets run in CI.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I realized something else about the design here and left a few comments.
Also, looks like somehow there's a uv versioning issue that is messing with our CI. It looks like you may have used a version of uv that is newer than the version GitHub is currently configured with. Would you mind reverting your changes to uv.lock
? I don't think you've changed anything that requires updating the lockfile in this case.
@@ -595,7 +599,7 @@ def llamalike_from_huggingface_model( | |||
mlp_hidden_dim=hf_config.intermediate_size, | |||
num_decoder_blocks=hf_config.num_hidden_layers, | |||
vocab_size=hf_config.vocab_size, | |||
mlp_variant="swiglu", | |||
mlp_variant=hf_config.hidden_act, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I realized there is a subtlety that makes this a bit complicated, and we probably need to handle it in a different way.
Specifically, in the current llamalike_common
implementation, the mlp_variant
is named based on the overall MLP design, e.g. "geglu" or "swiglu", and not the activation used inside that MLP, which would be "gelu" or "silu". These are related but not the same: a geglu MLP takes the product of a gelu and a linear layer, whereas a gelu MLP traditionally refers to just a normal MLP with a gelu activation and no separate linear layer.
On the other hand, in the HuggingFace transformers ecosystem, the hidden_act
appears to refer specifically to the activation function used. Thus, when the model uses a geglu MLP, the hidden_act
would be "gelu", and it's just assumed from context that there will be a separate linear layer.
Eventually it might make sense to try and switch conventions, but for now it seems simplest to keep the convention the same and avoid doing a bunch of refactoring, since in practice I think most of today's llama-like models use either geglu/gelu or swiglu/silu and not any other alternatives.
Would you mind making the following changes?
- In the type for the
mlp_variant
arg, make itLiteral["geglu_exact", "geglu_approx", "swiglu"]
(so "geglu_exact" instead of "gelu_exact", and no silu/relu, - In
build_llamalike_feedforward
, map the "geglu_exact" MLP variant to thefunctools.partial(jax.nn.gelu, approximate=False)
activation function, and don't allow anything other than "geglu_exact", "geglu_approx", "swiglu", - In
llamalike_from_huggingface_model
, have a separate mapping from HuggingFace'shidden_act
to the Penzaimlp_variant
, which would be something like(Note also that "gelu" needs to be mapped to the "geglu_exact" codepath because "gelu" in HuggingFace refers to the non-approximate version, whereas{"silu": "swiglu", "gelu": "geglu_exact", "gelu_new": "geglu_approx"}[hf_config.hidden_act]
jax.nn.gelu
is the approximate version by default.)
Thanks, and sorry for the complexity here!
name_or_path="hf-internal-testing/tiny-random-LlamaForCausalLM", | ||
vocab_size=11, | ||
hidden_size=64, | ||
intermediate_size=256, | ||
num_hidden_layers=3, | ||
num_attention_heads=num_attention_heads, | ||
num_key_value_heads=num_key_value_heads, | ||
attention_bias=False, | ||
attention_dropout=0.0, | ||
bos_token_id=0, | ||
eos_token_id=1, | ||
hidden_act="silu", | ||
initializer_range=0.02, | ||
max_position_embeddings=2048, | ||
mlp_bias=False, | ||
model_type="llama", | ||
pad_token_id=-1, | ||
pretraining_tp=1, | ||
rms_norm_eps=1e-06, | ||
rope_scaling=None, | ||
rope_theta=10000.0, | ||
tie_word_embeddings=False, | ||
torch_dtype="float32", | ||
transformers_version="4.44.2", | ||
use_cache=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mind adding a comment that says where these config settings came from? e.g. # This config is based on pretrained model "..."
(Is it the config for "hf-internal-testing/tiny-random-LlamaForCausalLM"? I'm curious whether that's representative of the configs people actually use, I wonder if we could take the config args from e.g. meta-llama/Llama-3.1-8B
instead but with a smaller number of layers / hidden size.)
Same comment also applies to the other modified tests below.
Changes
This PR adds missing attributes to the lists of handled/ignored configuration attributes in the model conversion functions for:
llama_from_huggingface_model
)mistral_from_huggingface_model
)gpt_neox_from_huggingface_model
)Problem
When converting HuggingFace models to Penzai models, the conversion would fail if the model configuration contained certain non-critical attributes like
pad_token_id
and_name_or_path
. These attributes are present in many HuggingFace models (including test models) but don't affect the model's functionality.Solution
Added attributes to the
handled_or_ignored_attributes
set in llama/mistral/gpt_neox model conversion functions, allowing the conversion to proceed while ignoring these non-critical configuration values.Testing
The fix has been tested with:
transformer_consistency_test.py
hf-internal-testing/tiny-random-[Llama/Mistral/GPTNeoX]ForCausalLM
). These new test fail without the updates in this PR.Related Issue
Fixes #115