Skip to content

[pulsar-spark] added option for configuring Pulsar client #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atezs82
Copy link

@atezs82 atezs82 commented May 26, 2021

Motivation

It is currently not possible to configure the Pulsar Client when creating a SparkStreamingPulsarReceiver (currently only the service URL, the authentication plugin and the consumer configuration can be set). This renders the receiver unusable when eg. the Pulsar cluster is configured to use TLS, and custom TLS certificates need to be set.

Modifications

  • Added an optional Map to configure the client to the constructor of the receiver. Existing constructors still remain usable, but they do not set this parameter. When the parameter is an empty map, loadConf on the client builder does not get called. Existing constructors set this parameter to an empty map to maintain backward compatibility.
  • If specified, (since ClientBuilder works like this) the configuration overrides other client configuration.
  • Added a simple integration test around the functionality (it barely checks whether serviceUrl can be overriden by this feature).
  • Fixed some unit tests around SparkStreamingPulsarReceiver (since they seemed to use an older interface). This issue was similar to Integration tests fail #12 (but not exactly the same).
  • If the modifications look good, I also plan to modify the documentation accordingly.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added a simple integration test around the functionality (it barely checks whether serviceUrl can be overriden by this feature).
  • Existing integration tests for the receiver were also fixed and observed to be working during local run.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: yes (ability to set client configuration when creating receiver)
  • The schema: don't know
  • The default values of configurations: no
  • The wire protocol: no
  • The rest endpoints: no
  • The admin cli options: no
  • Anything that affects deployment: don't know

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? will do docs and Javadocs once the feature looks good in this PR

@atezs82 atezs82 force-pushed the pulsar_spark_add_client_config branch from 5e238c8 to 2ffe7c4 Compare May 26, 2021 08:10
@eolivelli
Copy link
Contributor

@atezs82 can you please rebase this patch?

@atezs82 atezs82 force-pushed the pulsar_spark_add_client_config branch from 2ffe7c4 to cfec027 Compare June 19, 2021 19:22
@atezs82
Copy link
Author

atezs82 commented Jun 19, 2021

@eolivelli Rebased on top of master, but I might have some additional work to do, since tests are failing still locally for Apache Pulsar :: Tests :: Pulsar Kafka Compat Client Tests. Please let me know what you think about this change, I can work further on fixing local issues in the meantime.

@eolivelli
Copy link
Contributor

eolivelli commented Jun 19, 2021

We are already working on fixing integration tests

as you are touching only the spark adapter then you can ignore the Kafka stuff

Thanks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changloola good to me.
I left couple of minor comments
PTAL

@codelipenghui you are going to cut a release soon.
What about including this patch?

@atezs82 atezs82 force-pushed the pulsar_spark_add_client_config branch from cfec027 to 0ecd8e5 Compare June 19, 2021 20:00
@atezs82
Copy link
Author

atezs82 commented Jun 19, 2021

Thanks for the comments, removed all non-Spark related changes from PR. Due to some issues, I still do not have tests working. I'm not sure whether this is a local or general problem but I get package org.apache.pulsar.tests.integration.suites does not exist during test run, will need to take a look into it a bit later.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

please remove the star import

@lhotari
Copy link
Member

lhotari commented Jun 20, 2021

@eolivelli Rebased on top of master, but I might have some additional work to do, since tests are failing still locally for Apache Pulsar :: Tests :: Pulsar Kafka Compat Client Tests. Please let me know what you think about this change, I can work further on fixing local issues in the meantime.

That is due to the fact the the integration test library hasn't been published to maven central. The workaround is to compile and install it locally. Here are the commands used in CI:

- name: install org.apache.pulsar.tests:integration:jar:tests:2.8.0
if: ${{ steps.check_changes.outputs.docs_only != 'true' }}
run: |
cd ~
git clone --depth 50 --single-branch --branch v2.8.0 https://github.com/apache/pulsar
cd pulsar
mvn -B -ntp -f tests/pom.xml -pl org.apache.pulsar.tests:tests-parent,org.apache.pulsar.tests:integration install
- name: build apachepulsar/pulsar-test-latest-version:latest
if: ${{ steps.check_changes.outputs.docs_only != 'true' }}
run: |
docker pull apachepulsar/pulsar-all:2.8.0
docker pull apachepulsar/pulsar:2.8.0
docker tag apachepulsar/pulsar-all:2.8.0 apachepulsar/pulsar-all:latest
docker tag apachepulsar/pulsar:2.8.0 apachepulsar/pulsar:latest
cd ~/pulsar
mvn -B -ntp -f tests/docker-images/pom.xml install -pl org.apache.pulsar.tests:latest-version-image -am -Pdocker,-main -DskipTests

The commands also cover building apachepulsar/pulsar-test-latest-version:latest used by integration tests since the one in docker hub isn't up-to-date.

@atezs82 atezs82 force-pushed the pulsar_spark_add_client_config branch from 0ecd8e5 to 0aea629 Compare June 20, 2021 18:38
@atezs82
Copy link
Author

atezs82 commented Jun 20, 2021

@lhotari Thanks for the workaround, I have managed to run all tests successfully locally for the PR.

@eolivelli Sorry for the mess, I have corrected the import statement.

@eolivelli
Copy link
Contributor

@codelipenghui @sijie PTAL

this(StorageLevel.MEMORY_AND_DISK_2(), serviceUrl, clientConfig, consumerConfig, authentication);
}

public SparkStreamingPulsarReceiver(StorageLevel storageLevel,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are introducing a new parameter storageLevel. But it was not used.

Copy link
Author

@atezs82 atezs82 Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the finding, corrected this in the upcoming patchset.

this(StorageLevel.MEMORY_AND_DISK_2(), serviceUrl, new HashMap<>(), consumerConfig, authentication);
}

public SparkStreamingPulsarReceiver(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are going to introduce so many new builders?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind adding 2 new constructors to the 2 existing ones is I wanted to allow users of the interface to be able to configure Pulsar clients regardless whether they want to specify a storage level or leave it to default. Also, I wanted to keep previous constructors as well to keep backward compatibility.

Please let me know what do you think about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to remove eg. the one having a default storage level:

public SparkStreamingPulsarReceiver(
            String serviceUrl,
            Map<String,Object> clientConfig,
            ConsumerConfigurationData<byte[]> consumerConfig,
            Authentication authentication) {
        this(StorageLevel.MEMORY_AND_DISK_2(), serviceUrl, clientConfig, consumerConfig, authentication);
    }

In that way we might have broader functionality (we just enforce the user to set the storage level whenever client configuration is provided).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie What do you think about this proposal? Shall I change the code like this?

@atezs82 atezs82 force-pushed the pulsar_spark_add_client_config branch from 0aea629 to d0e959d Compare June 24, 2021 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants