-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add google_dataproc_job resource #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2d632d2
to
bb3f879
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for helping out with dataproc, @nickithewatt!
google/resource_dataproc_job.go
Outdated
}, | ||
|
||
Schema: map[string]*schema.Schema{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for this particular case I don't mind you top-leveling project
so we can easily use the getProject
helper)
google/resource_dataproc_job.go
Outdated
|
||
Schema: map[string]*schema.Schema{ | ||
|
||
"cluster": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think people might want access to clusterUuid also? If so let's just have a block for https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#JobPlacement
google/resource_dataproc_job.go
Outdated
Elem: &schema.Schema{Type: schema.TypeString}, | ||
}, | ||
|
||
"status": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would also make sense to include the rest of the status fields in https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#JobStatus?
google/resource_dataproc_job.go
Outdated
Computed: true, | ||
}, | ||
|
||
"outputUri": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to just call this driver_output_resource_uri to match the API in case another type of output URI comes along in the future?
google/resource_dataproc_job.go
Outdated
Optional: true, | ||
Computed: true, | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's super necessary but it probably would be very little effort to also add driver_control_files_uri
google/resource_dataproc_job.go
Outdated
} | ||
} | ||
|
||
func getPySparkJob(config map[string]interface{}) *dataproc.PySparkJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we call these functions expand[object]
, so this would be called expandPySparkJob
google/resource_dataproc_job.go
Outdated
|
||
// ---- Spark Job ---- | ||
|
||
func sparkTFSchema() *schema.Schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above apply here and for the rest of the types of jobs
google/resource_dataproc_job.go
Outdated
Type: schema.TypeString, | ||
Optional: true, | ||
ForceNew: true, | ||
ConflictsWith: []string{"hadoop_config.main_jar"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the way lists are represented in state, this should (I believe) be hadoop_config.0.main_jar
Manages a job resource within a Dataproc cluster within GCE. For more information see | ||
[the official dataproc documentation](https://cloud.google.com/dataproc/). | ||
|
||
!> **Note:** This resource does not really support 'update' functionality. Once created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be more definitive in these statements? This resource does not support 'update'
and changing any attributes will cause the creation
|
||
* `labels` - (Optional) The list of labels (key/value pairs) to add to the job. | ||
|
||
The **pyspark_config** supports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax we've been standardizing on for these would be:
The `pyspark_config` block supports:
Thanks, aiming to try and get round to these changes some time in the next few days.
|
Hey @nickithewatt, think you'll have time for this one? I'm happy to take it over, I understand how much of a pain I was during the other review (though I'm quite confident this one will be less painful :) ) |
Hi @danawillow, no worries, appreciate the time taken by you too to review from your side :) Happy to try and get this one going again. |
@danawillow have a few other items on my plate but hope to be able get to this in just over a week or so |
Hi @danawillow, have also done the changes for |
@danawillow Do you want me to squash and rebase? |
Only if you want- I usually squash PRs when they get merged anyway, so it doesn't make a difference to me. Thanks @nickithewatt! I'll make sure to get to this this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nickithewatt! Looks good overall, just a few things I had comments on.
google/provider_test.go
Outdated
@@ -84,6 +85,21 @@ func TestProvider_getRegionFromZone(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestConvertStringMap(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be moved to utils_test.go
google/resource_dataproc_job.go
Outdated
Type: schema.TypeInt, | ||
Description: "Maximum number of times per hour a driver may be restarted as a result of driver terminating with non-zero code before job is reported failed.", | ||
Optional: true, | ||
ForceNew: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a validation.IntAtMost(10) for this
google/resource_dataproc_job.go
Outdated
|
||
"placement": { | ||
Type: schema.TypeList, | ||
Optional: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, changed ...
google/resource_dataproc_job.go
Outdated
Type: schema.TypeBool, | ||
Default: false, | ||
Optional: true, | ||
ForceNew: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it makes sense to make this updatable, in case the user doesn't realize the attribute exists (or forgets to set it) and then later realizes they want to force delete the job. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, added
google/resource_dataproc_job.go
Outdated
}, | ||
} | ||
|
||
func flattenJobReference(r *dataproc.JobReference) []map[string]interface{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move all the functions that apply to all dataproc jobs into one area- these seem to have found themselves in the middle of the PySpark section.
google/resource_dataproc_job_test.go
Outdated
|
||
jobCompleteTimeoutMins := 3 | ||
waitErr := dataprocJobOperationWait(config, region, project, job.Reference.JobId, | ||
"Awaiting Dataproc job completion of failure", jobCompleteTimeoutMins, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should that be "completion or failure"?
google/resource_dataproc_job_test.go
Outdated
config := testAccProvider.Meta().(*Config) | ||
jobId := s.RootModule().Resources[n].Primary.ID | ||
found, err := config.clientDataproc.Projects.Regions.Jobs.Get( | ||
config.Project, rs.Primary.Attributes["region"], jobId).Do() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use config.Project
here but getTestProject
above? I'm fine with either one, but may as well be consistent
Manages a job resource within a Dataproc cluster within GCE. For more information see | ||
[the official dataproc documentation](https://cloud.google.com/dataproc/). | ||
|
||
!> **Note:** This resource does not support 'update' and changing any attributes will cause the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does "changing any attributes will cause the resource to be recreated" sound?
In addition to the arguments listed above, the following computed attributes are | ||
exported: | ||
|
||
* `reference.cluster_uuid` - A cluster UUID generated by the Cloud Dataproc service when the job is submitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of how nested objects work, this is actually reference.0.cluster_uuid
(likewise for status
)
MaxItems: 1, | ||
Elem: &schema.Resource{ | ||
Schema: map[string]*schema.Schema{ | ||
"job_id": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to block on this, but I'd love to see a ValidateFunc here (I think the regex you want is ^[a-zA-Z0-9_-]{1,100}$
)
9133cb3
to
578952b
Compare
@danawillow changes made, think I have addressed them all. Did a squash and rebase as well. |
Looks good @nickithewatt! I pushed a few small changes to your branch so we didn't have to do another back-and-forth for small things, hope that's all right with you! Merging now. |
Thanks @danawillow |
* Add google_dataproc_job resource * Correct state ref in docs * make tests parallel * cleanup, mostly whitespace related * docs fmt
<!-- This change is generated by MagicModules. --> /cc @chrisst
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
@danawillow As requested ...
This is PR 2 of 2 (splitting #231 up) and looking to address #31 (Add support for Google Cloud Dataproc) specifically adding the google_dataproc_job resource.
!! NOTE: This PR is dependent on (rebased on) PR #252 !!
To Recap:
The jobs are pretty much fleshed out. There is one
google_dataproc_job
resource with differentxxx_config
blocks for the different job typesgoogle_dataproc_job
pyspark_config
support for PySpark jobsspark_config
support for Spark jobssparksql_config
support for Spark-SQL jobshadoop_config
support for Hadoop jobshive_config
support for Hive jobspig_config
support for Pig jobsSpecifically for
google_dataproc_job
, create essentially submits a job to the cluster to let it run. It does not wait for it to finish. Updating of jobs doesn't really make much sense, and for delete I am genuinely deleting the job. Under normal circumstances, Dataproc won't let you delete active jobs. I have therefore added aforce_destroy
option which if true, will first cancel the job before deleting.