Skip to content

cannot create dataflow jobs with the enableStreamingEngine boolean set #8649

Closed
@n-oden

Description

@n-oden

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

$ terraform -v
Terraform v0.13.4
+ provider registry.terraform.io/hashicorp/google v3.59.0
+ provider registry.terraform.io/hashicorp/google-beta v3.59.0

Affected Resource(s)

  • google_dataflow_job

Terraform Configuration Files

resource "google_pubsub_topic" "test-topic" {
  name = "test-metrics-sink"
}

resource "google_pubsub_subscription" "test-sub" {
  name  = "test-metrics-source"
  topic = "projects/myproject/topics/sourcetopic" // this should be a topic with some traffic
  expiration_policy { ttl = "86400s" }
  message_retention_duration = "600s"
}

resource "google_dataflow_job" "test_job" {
  name              = "test-ps2ps-tf"
  template_gcs_path = "gs://dataflow-templates/2021-02-15-00_RC00/Cloud_PubSub_to_Cloud_PubSub"
  temp_gcs_location = "gs://mybucket/temp"
  zone              = "us-east1-b"
  max_workers       = 2
  machine_type      = "n1-standard-2"
  on_delete         = "drain"
  additional_experiments = [
    "enable_windmill_service",
    "enable_streaming_engine",
  ]

  labels = {
    # These labels get auto-magically set in dataflow when it detects you're using a template that
    # the gcloud team wrote. If you don't manually specify them then terraform thinks you've
    # removed them and redeploys the job every time you apply regardless if you changed anything.
    goog-dataflow-provided-template-name    = "cloud_pubsub_to_cloud_pubsub"
    goog-dataflow-provided-template-version = "2021-02-15-00_rc00"
  }

  parameters = {
    inputSubscription = google_pubsub_subscription.test-sub.id
    outputTopic       = google_pubsub_topic.test-topic.id
  }
}

Debug Output

https://gist.github.com/n-oden/d5fd36c7b54fb68a50afce095a9a591b

Expected Behavior

Terraform should launch a job using the google pubsub-to-pubsub template, and the streaming engine feature should be enabled for the job.

It's not so much that terraform is misbehaving per se here -- the API request it makes to dataflow.googleapis.com is correct per the manifest above. The problem is that there is no support for setting an important boolean in the JSON document that gets posted to /v1b3/projects/myproject/locations/us-east1/templates. Read on below:

Actual Behavior

The job created by terraform does not have streaming engine enabled, and worse yet does not actually process any data.

The issue here appears to be that streaming engine is no longer enableable via the additional_experiments list: there is now a first-class configuration option in the environment section of the json document that is posted to google to create a new job.

If you create a dataflow job using a google-provided template with the gcloud cli tool, the --enable-streaming-engine flag will cause a key to be added to the environment object in the POST data.

There is no way to do this presently with terraform: there is no enable_streaming_engine argument for a google_dataflow_job resource, and passing enable_streaming_engine as a string inside the additional_experiments block as previously noted produces a broken job.

Steps to Reproduce

  1. terraform apply

To see what should happen, you can use the gcloud cli tool:

gcloud --log-http dataflow jobs run test-ps2ps \
  --enable-streaming-engine \
  --gcs-location gs://dataflow-templates/latest/Cloud_PubSub_to_Cloud_PubSub \
  --parameters=inputSubscription=projects/myproject/subscriptions/test-metrics-source,outputTopic=projects/myproject/topics/test-metrics-sink \
  --staging-location=gs://mybucket/staging/

You'll see in the log-http output that the cli makes the following API call:

==== request start ====
uri: https://dataflow.googleapis.com/v1b3/projects/myproject/locations/us-central1/templates?alt=json
method: POST
== headers start ==
accept: application/json
accept-encoding: gzip, deflate
authorization: --- Token Redacted ---
content-length: 385
content-type: application/json
== headers end ==
== body start ==
{
  "environment": {
    "enableStreamingEngine": true,
    "tempLocation": "gs://mybucket/staging/"
  },
  "gcsPath": "gs://dataflow-templates/latest/Cloud_PubSub_to_Cloud_PubSub",
  "jobName": "test-ps2ps2",
  "location": "us-central1",
  "parameters": {
    "inputSubscription": "projects/myproject/subscriptions/test-metrics-source",
    "outputTopic": "projects/myproject/topics/test-metrics-sink"
  }
}
== body end ==

Important Factoids

To my intense aggravation, the enableStreamingEngine key is not documented in google's official docs for the environment object: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#environment but the gcloud tool is absolutely using it. :(

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions