Environment
- Python 3.9.9
- EMR Serverless: 6.13
- TensorFlow: 2.11
Reference
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html
I had to jump through a few hoops to get a PySpark application running on EMR Serverless. Below are the steps I followed, along with final functioning configuration, and at the bottom of this post is a few errors I encountered along the way.
Steps
1. Setup Build Environment
For a packaged application to work it must be built in an environment very similar to that of EMR Serverless; specifically, Amazon Linux 2. Plenty of mention is made online about using platform=linux/arm64 amazonlinux:2
to achieve this in Docker. I could not get this to work – when attempting to output the image it would just hang forever – I suspect because I’m on OSX, so I ended up spinning up an EC2 instance for my build environment, based on Amazon Linux 2 image.
2. Setup and Run Build Script
Mine was almost identical to the one found here, just with a few tweaks. Place your project requirements.txt
in your build environment working directory and:
sudo yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make xz-devel lzma
wget https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tgz && \
tar xzf Python-3.9.9.tgz && \
cd Python-3.9.9 && \
./configure --enable-optimizations && \
sudo make altinstall
sudo yum install -y python3-pip
# Create python venv with Python 3.9.9
python3.9 -m venv pyspark_venv_python_3.9.9 --copies
# copy system python3 libraries to venv
cp -r /usr/local/lib/python3.9/* ./pyspark_venv_python_3.9.9/lib/python3.9/
# package venv to archive.
# **Note** that you have to supply --python-prefix option
# to make sure python starts with the path where your
# copied libraries are present.
# Copying the python binary to the "environment" directory.
source pyspark_venv_python_3.9.9/bin/activate && \
pip install venv-pack && \
pip install -r requirements.txt
sudo mkdir -p /home/hadoop/environment
source pyspark_venv_python_3.9.9/bin/activate && \
venv-pack -f -o pyspark_venv_python_3.9.9.tar.gz --python-prefix /home/hadoop/environment
# You'll need to reference this path/file in your EMR Serverless job config
aws s3 cp pyspark_venv_python_3.9.9.tar.gz s3://<path_to>/<project_artifacts>/
3. Align Python Lib with EMR Requirements
If you’re smart, you started out with EMR-capable lib versions and worked backward from there. If, like me, you were handed a project where this was not the case, you’ll likely have to backoff dependency versions to make them compatible with EMR Serverless.
4. Zip and Upload Custom Python Modules
- From directory containing your application code:
zip -r my_custom_modules my_custom_modules/
aws s3 cp my_custom_modules.zip s3://<path_to>/<project_artifacts>/my_custom_modules.zip
4. Configure EMR Serverless
- In EMR Studio, create an application. I allowed it create and use a default IAM role.
- Upload your entry point script to S3 and define it under ‘Script location’.
- Add any script arguments you need to pass (and your app is prepared to parse).
- I landed on the following Spark properties in order to get the job to run:
--conf spark.archives=s3://<path_to>/<project_artifacts>/pyspark_venv_python_3.9.9.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.submit.pyFiles=s3://<path_to>/<project_artifacts>/my_custom_modules.zip
--conf spark.files=s3://<path_to>/<project_artifacts>/some_other_file.yml, s3://<path_to>/<project_artifacts>/second_other_file.yml
Errors
Encountered along the way.
Error | Resolution |
---|---|
Traceback (most recent call last): File "/home/hadoop/environment/lib/python3.9/site-packages/fastavro/read.py", line 2, in <module> from . import _read File "fastavro/_read.pyx", line 11, in init fastavro._read File "/home/hadoop/environment/lib/python3.9/lzma.py", line 27, in <module> from _lzma import * ModuleNotFoundError: No module named '_lzma' | In build environment:sudo yum install lzma |
ModuleNotFoundError: No module named '<my custom modules>' | These steps, from the base of your application code: * zip -r my_custom_modules.zip my_custom_modules/ * upload zip file to s3 bucket * add to your job spark properties: --conf spark.submit.pyFiles=s3://<path_to>/my_custom_modules.zip |
ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with... | Downgraded to urllib3=1.26.6 |
ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/home/hadoop/environment/lib/python3.9/site-packages/google/protobuf/internal/__init__.py) | Fetch latest version – e.g., from fully-updated installation – of protobuf’s ‘builder.py’ to your project’s Python packages at Lib/site-packages/google/protobuff/internal . See here for details. |
Docker build env: “copying <n> files…” Never finishes. | I had to abandon a Dockerized Amazon Linux 2 build environment, I suspect it had something to do with my Apple silicon. I ended up spinning up a VM on AWS and using their Amazon Linux 2. |