Skip to content

Merge qss-poc branch for HCC back to main #1059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

Merge qss-poc branch for HCC back to main #1059

wants to merge 66 commits into from

Conversation

danjuan-81
Copy link
Collaborator

@danjuan-81 danjuan-81 commented Apr 4, 2025

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind enhancement
/kind new quick start solution for HyperCompute Cluster

What this PR does / Why we need it:
Add a new folder hcc under applications and update the cloudbuild.xml by adding new test cases for the new qss.

Which issue(s) this PR fixes:

Closes #

Special notes for your reviewer:

ACW101 and others added 30 commits December 6, 2024 23:51
…consumption model. Rename reservation as reservation_name
* Remove checkpoint bucket input

* Fix llama7b config and incorrect recipe enum values

* Remove region intput and add logic to lookup region by zone
* Add missing GCS module

* Small fixes.
ACW101 and others added 23 commits February 24, 2025 16:41
* Fix reservation toggle for A3U

* Modify output based on UI requirement

* Enable HOST_MAINTENANCE=PERIODIC for A3 Mega

* Fix nccl test workload delete failure due to dependency with cluster module
add readme for scripts to update zones and regions
* Update UI to temporarily remove GPU recipes and update 'Suggested next steps' section

* Remove schedulingGate from A3U NCCL test
* add kueue for nccl tests

* clean up codes for kueue

* cleanup ultra test

* solve comments

* rename kueue
* Make reservation name variable required

* Add a3ultra Llama3.1-7b recipe
Update the network to vpc1...9

* Add a3ultra Llama3.1-70b recipe

* Remove unused cluster name from Nemo module.
…n >2 nodes (#1007)

* add kueue to nemo

* enable nccl tests on >2 nodes

* fix nemo with kueue

* cleanup

* resolve conflict

* modify reservation for a3u
* Add Llama3.1 70B using MaxText

* Add mixtral8 70b NeMo recipe

* Add A3Mega Mixtral8-7b NeMo recipe and A3Ultra Mixtral8-7b maxtext recipe

* Use the maxtext docker image provided by the GPU recipe team

* Update MaxText recipes using 16 nodes
* fix nemo

* fix
* Update image

* Fix Mixtral model parsing error.
Use new image for MaxText
* add a3m placement not null validation

* fix permission problem

* fix api issue

* future verify: api enable
* initial cloud build for hcc

* change project for cloudbuild

* disablr mega cluster

* cleanup

* cleanup

* modify deploy name

* finalize cloudbuild

* remove reservation for gke only

* change cloudbuild name

* cleanup cloudbuild
#1044)

Make reservation name as top properties, set the default value of consumption option to reservation
#1058)

Add subtext for reservatioin name and modify the subtext for consumption options.
danjuan-81 and others added 5 commits April 4, 2025 10:55
* fix cicd pipeline

* fix cicd pipeline
* add cloudbuild scripts to update regions and zones

* add cloudbuild scripts to update regions and zones

* fix path for python script

* modify readme
* add cloudbuild scripts to update regions and zones

* add cloudbuild scripts to update regions and zones

* fix path for python script

* modify readme

* fix: a3mge valication error(the cluster creation fails when placement policy is null)

* fix: a3ultra nccl test rename

* fix: reformat metadata.display.yaml
* feat: add maintenance exlusions in a3mega and update name of maintenance exclusions in a3ultra

* feat: add cluster-director identification label to cluster

* fix: remove quote around 'gke_product_type'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants