-
Notifications
You must be signed in to change notification settings - Fork 239
Merge qss-poc branch for HCC back to main #1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
danjuan-81
wants to merge
66
commits into
main
Choose a base branch
from
qss-poc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tion blocks inputs
…consumption model. Rename reservation as reservation_name
* Fix reservation toggle for A3U * Modify output based on UI requirement * Enable HOST_MAINTENANCE=PERIODIC for A3 Mega * Fix nccl test workload delete failure due to dependency with cluster module
add readme for scripts to update zones and regions
* Update UI to temporarily remove GPU recipes and update 'Suggested next steps' section * Remove schedulingGate from A3U NCCL test
* add kueue for nccl tests * clean up codes for kueue * cleanup ultra test * solve comments * rename kueue
* Make reservation name variable required * Add a3ultra Llama3.1-7b recipe Update the network to vpc1...9 * Add a3ultra Llama3.1-70b recipe * Remove unused cluster name from Nemo module.
…n >2 nodes (#1007) * add kueue to nemo * enable nccl tests on >2 nodes * fix nemo with kueue * cleanup * resolve conflict * modify reservation for a3u
* Add Llama3.1 70B using MaxText * Add mixtral8 70b NeMo recipe * Add A3Mega Mixtral8-7b NeMo recipe and A3Ultra Mixtral8-7b maxtext recipe * Use the maxtext docker image provided by the GPU recipe team * Update MaxText recipes using 16 nodes
fix nemo
* fix nemo * fix
* Update image * Fix Mixtral model parsing error.
Use new image for MaxText
basic validation test
* add a3m placement not null validation * fix permission problem * fix api issue * future verify: api enable
* initial cloud build for hcc * change project for cloudbuild * disablr mega cluster * cleanup * cleanup * modify deploy name * finalize cloudbuild * remove reservation for gke only * change cloudbuild name * cleanup cloudbuild
#1044) Make reservation name as top properties, set the default value of consumption option to reservation
#1058) Add subtext for reservatioin name and modify the subtext for consumption options.
...munity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py
Dismissed
Show dismissed
Hide dismissed
...munity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py
Dismissed
Show dismissed
Hide dismissed
* fix cicd pipeline * fix cicd pipeline
* add cloudbuild scripts to update regions and zones * add cloudbuild scripts to update regions and zones * fix path for python script * modify readme
* add cloudbuild scripts to update regions and zones * add cloudbuild scripts to update regions and zones * fix path for python script * modify readme * fix: a3mge valication error(the cluster creation fails when placement policy is null) * fix: a3ultra nccl test rename * fix: reformat metadata.display.yaml
* feat: add maintenance exlusions in a3mega and update name of maintenance exclusions in a3ultra * feat: add cluster-director identification label to cluster * fix: remove quote around 'gke_product_type'
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
What this PR does / Why we need it:
Add a new folder hcc under applications and update the cloudbuild.xml by adding new test cases for the new qss.
Which issue(s) this PR fixes:
Closes #
Special notes for your reviewer: