Skip to content

Add google_dataproc_job resource #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 21, 2017

Conversation

nickithewatt
Copy link
Contributor

@nickithewatt nickithewatt commented Jul 26, 2017

@danawillow As requested ...
This is PR 2 of 2 (splitting #231 up) and looking to address #31 (Add support for Google Cloud Dataproc) specifically adding the google_dataproc_job resource.

!! NOTE: This PR is dependent on (rebased on) PR #252 !!

To Recap:
The jobs are pretty much fleshed out. There is one google_dataproc_job resource with different xxx_config blocks for the different job types

  • google_dataproc_job
    • via pyspark_config support for PySpark jobs
    • via spark_config support for Spark jobs
    • via sparksql_config support for Spark-SQL jobs
    • via hadoop_config support for Hadoop jobs
    • via hive_config support for Hive jobs
    • via pig_config support for Pig jobs

Specifically for google_dataproc_job, create essentially submits a job to the cluster to let it run. It does not wait for it to finish. Updating of jobs doesn't really make much sense, and for delete I am genuinely deleting the job. Under normal circumstances, Dataproc won't let you delete active jobs. I have therefore added a force_destroy option which if true, will first cancel the job before deleting.

@nickithewatt nickithewatt force-pushed the dataproc_job branch 2 times, most recently from 2d632d2 to bb3f879 Compare August 3, 2017 22:30
Copy link
Contributor

@danawillow danawillow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for helping out with dataproc, @nickithewatt!

},

Schema: map[string]*schema.Schema{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for this particular case I don't mind you top-leveling project so we can easily use the getProject helper)


Schema: map[string]*schema.Schema{

"cluster": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think people might want access to clusterUuid also? If so let's just have a block for https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#JobPlacement

Elem: &schema.Schema{Type: schema.TypeString},
},

"status": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would also make sense to include the rest of the status fields in https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#JobStatus?

Computed: true,
},

"outputUri": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to just call this driver_output_resource_uri to match the API in case another type of output URI comes along in the future?

Optional: true,
Computed: true,
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's super necessary but it probably would be very little effort to also add driver_control_files_uri

}
}

func getPySparkJob(config map[string]interface{}) *dataproc.PySparkJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we call these functions expand[object], so this would be called expandPySparkJob


// ---- Spark Job ----

func sparkTFSchema() *schema.Schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above apply here and for the rest of the types of jobs

Type: schema.TypeString,
Optional: true,
ForceNew: true,
ConflictsWith: []string{"hadoop_config.main_jar"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the way lists are represented in state, this should (I believe) be hadoop_config.0.main_jar

Manages a job resource within a Dataproc cluster within GCE. For more information see
[the official dataproc documentation](https://cloud.google.com/dataproc/).

!> **Note:** This resource does not really support 'update' functionality. Once created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more definitive in these statements? This resource does not support 'update' and changing any attributes will cause the creation


* `labels` - (Optional) The list of labels (key/value pairs) to add to the job.

The **pyspark_config** supports:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax we've been standardizing on for these would be:

The `pyspark_config` block supports:

@nickithewatt
Copy link
Contributor Author

nickithewatt commented Aug 11, 2017 via email

@danawillow
Copy link
Contributor

Hey @nickithewatt, think you'll have time for this one? I'm happy to take it over, I understand how much of a pain I was during the other review (though I'm quite confident this one will be less painful :) )

@nickithewatt
Copy link
Contributor Author

Hi @danawillow, no worries, appreciate the time taken by you too to review from your side :) Happy to try and get this one going again.

@nickithewatt
Copy link
Contributor Author

@danawillow have a few other items on my plate but hope to be able get to this in just over a week or so

@nickithewatt
Copy link
Contributor Author

Hi @danawillow, have also done the changes for dataproc_job now as well. Thanks

@nickithewatt
Copy link
Contributor Author

@danawillow Do you want me to squash and rebase?

@danawillow
Copy link
Contributor

Only if you want- I usually squash PRs when they get merged anyway, so it doesn't make a difference to me. Thanks @nickithewatt! I'll make sure to get to this this week.

Copy link
Contributor

@danawillow danawillow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nickithewatt! Looks good overall, just a few things I had comments on.

@@ -84,6 +85,21 @@ func TestProvider_getRegionFromZone(t *testing.T) {
}
}

func TestConvertStringMap(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be moved to utils_test.go

Type: schema.TypeInt,
Description: "Maximum number of times per hour a driver may be restarted as a result of driver terminating with non-zero code before job is reported failed.",
Optional: true,
ForceNew: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a validation.IntAtMost(10) for this


"placement": {
Type: schema.TypeList,
Optional: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, changed ...

Type: schema.TypeBool,
Default: false,
Optional: true,
ForceNew: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it makes sense to make this updatable, in case the user doesn't realize the attribute exists (or forgets to set it) and then later realizes they want to force delete the job. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, added

},
}

func flattenJobReference(r *dataproc.JobReference) []map[string]interface{} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all the functions that apply to all dataproc jobs into one area- these seem to have found themselves in the middle of the PySpark section.


jobCompleteTimeoutMins := 3
waitErr := dataprocJobOperationWait(config, region, project, job.Reference.JobId,
"Awaiting Dataproc job completion of failure", jobCompleteTimeoutMins, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should that be "completion or failure"?

config := testAccProvider.Meta().(*Config)
jobId := s.RootModule().Resources[n].Primary.ID
found, err := config.clientDataproc.Projects.Regions.Jobs.Get(
config.Project, rs.Primary.Attributes["region"], jobId).Do()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to use config.Project here but getTestProject above? I'm fine with either one, but may as well be consistent

Manages a job resource within a Dataproc cluster within GCE. For more information see
[the official dataproc documentation](https://cloud.google.com/dataproc/).

!> **Note:** This resource does not support 'update' and changing any attributes will cause the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does "changing any attributes will cause the resource to be recreated" sound?

In addition to the arguments listed above, the following computed attributes are
exported:

* `reference.cluster_uuid` - A cluster UUID generated by the Cloud Dataproc service when the job is submitted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how nested objects work, this is actually reference.0.cluster_uuid (likewise for status)

MaxItems: 1,
Elem: &schema.Resource{
Schema: map[string]*schema.Schema{
"job_id": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to block on this, but I'd love to see a ValidateFunc here (I think the regex you want is ^[a-zA-Z0-9_-]{1,100}$)

@nickithewatt
Copy link
Contributor Author

@danawillow changes made, think I have addressed them all. Did a squash and rebase as well.

@danawillow
Copy link
Contributor

Looks good @nickithewatt! I pushed a few small changes to your branch so we didn't have to do another back-and-forth for small things, hope that's all right with you! Merging now.

@danawillow danawillow merged commit 68f7d77 into hashicorp:master Nov 21, 2017
@nickithewatt
Copy link
Contributor Author

Thanks @danawillow

chrisst pushed a commit to chrisst/terraform-provider-google that referenced this pull request Nov 9, 2018
* Add google_dataproc_job resource

* Correct state ref in docs

* make tests parallel

* cleanup, mostly whitespace related

* docs fmt
luis-silva pushed a commit to luis-silva/terraform-provider-google that referenced this pull request May 21, 2019
<!-- This change is generated by MagicModules. -->
/cc @chrisst
@ghost
Copy link

ghost commented Mar 30, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants