Skip to content

Fix: Add missing attributes to hugging face conversion functions #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ealt
Copy link

@ealt ealt commented Apr 22, 2025

Changes

This PR adds missing attributes to the lists of handled/ignored configuration attributes in the model conversion functions for:

  • Llama models (llama_from_huggingface_model)
  • Mistral models (mistral_from_huggingface_model)
  • GPT-NeoX models (gpt_neox_from_huggingface_model)

Problem

When converting HuggingFace models to Penzai models, the conversion would fail if the model configuration contained certain non-critical attributes like pad_token_id and _name_or_path. These attributes are present in many HuggingFace models (including test models) but don't affect the model's functionality.

Solution

Added attributes to the handled_or_ignored_attributes set in llama/mistral/gpt_neox model conversion functions, allowing the conversion to proceed while ignoring these non-critical configuration values.

Testing

The fix has been tested with:

  • Existing test cases in transformer_consistency_test.py
  • Newly implemented test cases that load tiny models from huggingface (hf-internal-testing/tiny-random-[Llama/Mistral/GPTNeoX]ForCausalLM). These new test fail without the updates in this PR.

Related Issue

Fixes #115

@ealt
Copy link
Author

ealt commented Apr 23, 2025

Hi @danieldjohnson, I submitted this PR from a fork and it looks like the CI checks haven’t been triggered yet.
Could you please approve or manually run the unittest-job workflow so the checks can proceed?
Thanks!

Copy link
Collaborator

@danieldjohnson danieldjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left a few comments.

Also, I notice that the PR is currently not passing our CI formatting checks. Mind fixing this? If you run uv run pyink penzai tests it should reformat everything automatically.

(See here for the full set of checks that gets run in CI.)

Copy link
Collaborator

@danieldjohnson danieldjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I realized something else about the design here and left a few comments.

Also, looks like somehow there's a uv versioning issue that is messing with our CI. It looks like you may have used a version of uv that is newer than the version GitHub is currently configured with. Would you mind reverting your changes to uv.lock? I don't think you've changed anything that requires updating the lockfile in this case.

@@ -595,7 +599,7 @@ def llamalike_from_huggingface_model(
mlp_hidden_dim=hf_config.intermediate_size,
num_decoder_blocks=hf_config.num_hidden_layers,
vocab_size=hf_config.vocab_size,
mlp_variant="swiglu",
mlp_variant=hf_config.hidden_act,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I realized there is a subtlety that makes this a bit complicated, and we probably need to handle it in a different way.

Specifically, in the current llamalike_common implementation, the mlp_variant is named based on the overall MLP design, e.g. "geglu" or "swiglu", and not the activation used inside that MLP, which would be "gelu" or "silu". These are related but not the same: a geglu MLP takes the product of a gelu and a linear layer, whereas a gelu MLP traditionally refers to just a normal MLP with a gelu activation and no separate linear layer.

On the other hand, in the HuggingFace transformers ecosystem, the hidden_act appears to refer specifically to the activation function used. Thus, when the model uses a geglu MLP, the hidden_act would be "gelu", and it's just assumed from context that there will be a separate linear layer.

Eventually it might make sense to try and switch conventions, but for now it seems simplest to keep the convention the same and avoid doing a bunch of refactoring, since in practice I think most of today's llama-like models use either geglu/gelu or swiglu/silu and not any other alternatives.

Would you mind making the following changes?

  • In the type for the mlp_variant arg, make it Literal["geglu_exact", "geglu_approx", "swiglu"] (so "geglu_exact" instead of "gelu_exact", and no silu/relu,
  • In build_llamalike_feedforward, map the "geglu_exact" MLP variant to the functools.partial(jax.nn.gelu, approximate=False) activation function, and don't allow anything other than "geglu_exact", "geglu_approx", "swiglu",
  • In llamalike_from_huggingface_model, have a separate mapping from HuggingFace's hidden_act to the Penzai mlp_variant, which would be something like
    {"silu": "swiglu", "gelu": "geglu_exact", "gelu_new": "geglu_approx"}[hf_config.hidden_act]
    
    (Note also that "gelu" needs to be mapped to the "geglu_exact" codepath because "gelu" in HuggingFace refers to the non-approximate version, whereas jax.nn.gelu is the approximate version by default.)

Thanks, and sorry for the complexity here!

Comment on lines +39 to +63
name_or_path="hf-internal-testing/tiny-random-LlamaForCausalLM",
vocab_size=11,
hidden_size=64,
intermediate_size=256,
num_hidden_layers=3,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
attention_bias=False,
attention_dropout=0.0,
bos_token_id=0,
eos_token_id=1,
hidden_act="silu",
initializer_range=0.02,
max_position_embeddings=2048,
mlp_bias=False,
model_type="llama",
pad_token_id=-1,
pretraining_tp=1,
rms_norm_eps=1e-06,
rope_scaling=None,
rope_theta=10000.0,
tie_word_embeddings=False,
torch_dtype="float32",
transformers_version="4.44.2",
use_cache=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mind adding a comment that says where these config settings came from? e.g. # This config is based on pretrained model "..."

(Is it the config for "hf-internal-testing/tiny-random-LlamaForCausalLM"? I'm curious whether that's representative of the configs people actually use, I wonder if we could take the config args from e.g. meta-llama/Llama-3.1-8B instead but with a smaller number of layers / hidden size.)

Same comment also applies to the other modified tests below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conversion from pretrained HuggingFace models
2 participants