Fix: Add missing attributes to hugging face conversion functions #116

ealt · 2025-04-22T23:12:10Z

Changes

This PR adds missing attributes to the lists of handled/ignored configuration attributes in the model conversion functions for:

Llama models (llama_from_huggingface_model)
Mistral models (mistral_from_huggingface_model)
GPT-NeoX models (gpt_neox_from_huggingface_model)

Problem

When converting HuggingFace models to Penzai models, the conversion would fail if the model configuration contained certain non-critical attributes like pad_token_id and _name_or_path. These attributes are present in many HuggingFace models (including test models) but don't affect the model's functionality.

Solution

Added attributes to the handled_or_ignored_attributes set in llama/mistral/gpt_neox model conversion functions, allowing the conversion to proceed while ignoring these non-critical configuration values.

Testing

The fix has been tested with:

Existing test cases in transformer_consistency_test.py
Newly implemented test cases that load tiny models from huggingface (hf-internal-testing/tiny-random-[Llama/Mistral/GPTNeoX]ForCausalLM). These new test fail without the updates in this PR.

Related Issue

Fixes #115

ealt · 2025-04-23T20:46:03Z

Hi @danieldjohnson, I submitted this PR from a fork and it looks like the CI checks haven’t been triggered yet.
Could you please approve or manually run the unittest-job workflow so the checks can proceed?
Thanks!

danieldjohnson

Thanks for the PR! Left a few comments.

Also, I notice that the PR is currently not passing our CI formatting checks. Mind fixing this? If you run uv run pyink penzai tests it should reformat everything automatically.

(See here for the full set of checks that gets run in CI.)

penzai/models/transformer/variants/mistral.py

tests/models/transformer_consistency_test.py

danieldjohnson

Thanks! I realized something else about the design here and left a few comments.

Also, looks like somehow there's a uv versioning issue that is messing with our CI. It looks like you may have used a version of uv that is newer than the version GitHub is currently configured with. Would you mind reverting your changes to uv.lock? I don't think you've changed anything that requires updating the lockfile in this case.

danieldjohnson · 2025-05-05T05:25:17Z

penzai/models/transformer/variants/llamalike_common.py

@@ -595,7 +599,7 @@ def llamalike_from_huggingface_model(
      mlp_hidden_dim=hf_config.intermediate_size,
      num_decoder_blocks=hf_config.num_hidden_layers,
      vocab_size=hf_config.vocab_size,
-      mlp_variant="swiglu",
+      mlp_variant=hf_config.hidden_act,


Looking at this again, I realized there is a subtlety that makes this a bit complicated, and we probably need to handle it in a different way.

Specifically, in the current llamalike_common implementation, the mlp_variant is named based on the overall MLP design, e.g. "geglu" or "swiglu", and not the activation used inside that MLP, which would be "gelu" or "silu". These are related but not the same: a geglu MLP takes the product of a gelu and a linear layer, whereas a gelu MLP traditionally refers to just a normal MLP with a gelu activation and no separate linear layer.

On the other hand, in the HuggingFace transformers ecosystem, the hidden_act appears to refer specifically to the activation function used. Thus, when the model uses a geglu MLP, the hidden_act would be "gelu", and it's just assumed from context that there will be a separate linear layer.

Eventually it might make sense to try and switch conventions, but for now it seems simplest to keep the convention the same and avoid doing a bunch of refactoring, since in practice I think most of today's llama-like models use either geglu/gelu or swiglu/silu and not any other alternatives.

Would you mind making the following changes?

In the type for the mlp_variant arg, make it Literal["geglu_exact", "geglu_approx", "swiglu"] (so "geglu_exact" instead of "gelu_exact", and no silu/relu,

In build_llamalike_feedforward, map the "geglu_exact" MLP variant to the functools.partial(jax.nn.gelu, approximate=False) activation function, and don't allow anything other than "geglu_exact", "geglu_approx", "swiglu",

In llamalike_from_huggingface_model, have a separate mapping from HuggingFace's hidden_act to the Penzai mlp_variant, which would be something like
{"silu": "swiglu", "gelu": "geglu_exact", "gelu_new": "geglu_approx"}[hf_config.hidden_act]
(Note also that "gelu" needs to be mapped to the "geglu_exact" codepath because "gelu" in HuggingFace refers to the non-approximate version, whereas jax.nn.gelu is the approximate version by default.)

Thanks, and sorry for the complexity here!

danieldjohnson · 2025-05-05T05:29:12Z

tests/models/transformer_consistency_test.py

+        name_or_path="hf-internal-testing/tiny-random-LlamaForCausalLM",
        vocab_size=11,
        hidden_size=64,
        intermediate_size=256,
        num_hidden_layers=3,
        num_attention_heads=num_attention_heads,
        num_key_value_heads=num_key_value_heads,
+        attention_bias=False,
+        attention_dropout=0.0,
+        bos_token_id=0,
+        eos_token_id=1,
+        hidden_act="silu",
+        initializer_range=0.02,
+        max_position_embeddings=2048,
+        mlp_bias=False,
+        model_type="llama",
+        pad_token_id=-1,
+        pretraining_tp=1,
+        rms_norm_eps=1e-06,
+        rope_scaling=None,
+        rope_theta=10000.0,
+        tie_word_embeddings=False,
+        torch_dtype="float32",
+        transformers_version="4.44.2",
+        use_cache=True,


nit: mind adding a comment that says where these config settings came from? e.g. # This config is based on pretrained model "..."

(Is it the config for "hf-internal-testing/tiny-random-LlamaForCausalLM"? I'm curious whether that's representative of the configs people actually use, I wonder if we could take the config args from e.g. meta-llama/Llama-3.1-8B instead but with a smaller number of layers / hidden size.)

Same comment also applies to the other modified tests below.

Add missing ignored attributes

d2cdc48

danieldjohnson reviewed Apr 26, 2025

View reviewed changes

ealt added 5 commits May 1, 2025 13:54

Update uv.lock

2ed22c0

Formatting

16c3428

Fix typo

3ed53ca

Extend configurable activation types

d51c2f8

Remove _from_pretrained tests

db502c8

danieldjohnson requested changes May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Add missing attributes to hugging face conversion functions #116

Fix: Add missing attributes to hugging face conversion functions #116

ealt commented Apr 22, 2025 •

edited

Loading

ealt commented Apr 23, 2025

danieldjohnson left a comment

danieldjohnson left a comment

danieldjohnson May 5, 2025

danieldjohnson May 5, 2025

Fix: Add missing attributes to hugging face conversion functions #116

Are you sure you want to change the base?

Fix: Add missing attributes to hugging face conversion functions #116

Conversation

ealt commented Apr 22, 2025 • edited Loading

Changes

Problem

Solution

Testing

Related Issue

ealt commented Apr 23, 2025

danieldjohnson left a comment

Choose a reason for hiding this comment

danieldjohnson left a comment

Choose a reason for hiding this comment

danieldjohnson May 5, 2025

Choose a reason for hiding this comment

danieldjohnson May 5, 2025

Choose a reason for hiding this comment

ealt commented Apr 22, 2025 •

edited

Loading