Google Dataflow

We currently only support Python pipelines. Java support is coming soon.

You can read the official documentation for Apache Beam here on adding non-Python dependencies, which we used to implement this installation method.

Copy the Dataflow setup file over to where you wish to start your Dataflow jobs from. Replace the values of service_name and gprofiler_token:

  • Replace <token> in the command line with your token you got from the gProfiler Performance Studio site.

  • Replace <service name> in the command line with the service name you wish to use.

Whenever you start a Dataflow job, add the --setup_file /path/to/setup.py flag with your setup.py copy (PLEASE NOTE - the flag is --setup_file and not --setup-file). For example, here's a command that starts an example Apache Beam job with gProfiler:

python3 -m apache_beam.examples.complete.top_wikipedia_sessions \
--region us-central1 \
--runner DataflowRunner \
--project my_project \
--temp_location gs://my-cloud-storage-bucket/temp/ \
--output gs://my-cloud-storage-bucket/output/ \
--setup_file /path/to/setup.py

If you are already using the --setup_file flag for your own setup, you can merge your setup file with the gProfiler one. Copy over all of the code in the gProfiler setup file except the setuptools.setup call, and add the following keyword argument to your setuptools.setup call:

cmdclass={
        "build": build,
        "ProfilerInstallationCommands": ProfilerInstallationCommands,
    }

For example:

setuptools.setup(
    name="my_custom_package",
    version="1.5.3",
    author="MyCompany",
    cmdclass={
        "build": build,
        "ProfilerInstallationCommands": ProfilerInstallationCommands,
    },
)

Last updated