Tag: aws
Custom Python App on EMR Serverless
Environment
- Python 3.9.9
- EMR Serverless: 6.13
- TensorFlow: 2.11
Reference
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html
I had to jump through a few hoops to get a PySpark application running on EMR Serverless. Below are the steps I followed, along with final functioning configuration, and at the bottom of this post is a few errors I encountered along the way.
Steps
1. Setup Build Environment
For a packaged application to work it must be built in an environment very similar to that of EMR Serverless; specifically, Amazon Linux 2. Plenty of mention is made online about using platform=linux/arm64 amazonlinux:2
to achieve this in Docker. I could not get this to work – when attempting to output the image it would just hang forever – I suspect because I’m on OSX, so I ended up spinning up an EC2 instance for my build environment, based on Amazon Linux 2 image.
2. Setup and Run Build Script
Mine was almost identical to the one found here, just with a few tweaks. Place your project requirements.txt
in your build environment working directory and:
sudo yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make xz-devel lzma
wget https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tgz && \
tar xzf Python-3.9.9.tgz && \
cd Python-3.9.9 && \
./configure --enable-optimizations && \
sudo make altinstall
sudo yum install -y python3-pip
# Create python venv with Python 3.9.9
python3.9 -m venv pyspark_venv_python_3.9.9 --copies
# copy system python3 libraries to venv
cp -r /usr/local/lib/python3.9/* ./pyspark_venv_python_3.9.9/lib/python3.9/
# package venv to archive.
# **Note** that you have to supply --python-prefix option
# to make sure python starts with the path where your
# copied libraries are present.
# Copying the python binary to the "environment" directory.
source pyspark_venv_python_3.9.9/bin/activate && \
pip install venv-pack && \
pip install -r requirements.txt
sudo mkdir -p /home/hadoop/environment
source pyspark_venv_python_3.9.9/bin/activate && \
venv-pack -f -o pyspark_venv_python_3.9.9.tar.gz --python-prefix /home/hadoop/environment
# You'll need to reference this path/file in your EMR Serverless job config
aws s3 cp pyspark_venv_python_3.9.9.tar.gz s3://<path_to>/<project_artifacts>/
3. Align Python Lib with EMR Requirements
If you’re smart, you started out with EMR-capable lib versions and worked backward from there. If, like me, you were handed a project where this was not the case, you’ll likely have to backoff dependency versions to make them compatible with EMR Serverless.
4. Zip and Upload Custom Python Modules
- From directory containing your application code:
zip -r my_custom_modules my_custom_modules/
aws s3 cp my_custom_modules.zip s3://<path_to>/<project_artifacts>/my_custom_modules.zip
4. Configure EMR Serverless
- In EMR Studio, create an application. I allowed it create and use a default IAM role.
- Upload your entry point script to S3 and define it under ‘Script location’.
- Add any script arguments you need to pass (and your app is prepared to parse).
- I landed on the following Spark properties in order to get the job to run:
--conf spark.archives=s3://<path_to>/<project_artifacts>/pyspark_venv_python_3.9.9.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.submit.pyFiles=s3://<path_to>/<project_artifacts>/my_custom_modules.zip
--conf spark.files=s3://<path_to>/<project_artifacts>/some_other_file.yml, s3://<path_to>/<project_artifacts>/second_other_file.yml
Errors
Encountered along the way.
Error | Resolution |
---|---|
Traceback (most recent call last): File "/home/hadoop/environment/lib/python3.9/site-packages/fastavro/read.py", line 2, in <module> from . import _read File "fastavro/_read.pyx", line 11, in init fastavro._read File "/home/hadoop/environment/lib/python3.9/lzma.py", line 27, in <module> from _lzma import * ModuleNotFoundError: No module named '_lzma' | In build environment:sudo yum install lzma |
ModuleNotFoundError: No module named '<my custom modules>' | These steps, from the base of your application code: * zip -r my_custom_modules.zip my_custom_modules/ * upload zip file to s3 bucket * add to your job spark properties: --conf spark.submit.pyFiles=s3://<path_to>/my_custom_modules.zip |
ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with... | Downgraded to urllib3=1.26.6 |
ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/home/hadoop/environment/lib/python3.9/site-packages/google/protobuf/internal/__init__.py) | Fetch latest version – e.g., from fully-updated installation – of protobuf’s ‘builder.py’ to your project’s Python packages at Lib/site-packages/google/protobuff/internal . See here for details. |
Docker build env: “copying <n> files…” Never finishes. | I had to abandon a Dockerized Amazon Linux 2 build environment, I suspect it had something to do with my Apple silicon. I ended up spinning up a VM on AWS and using their Amazon Linux 2. |
Pass HTTP Headers with Non Proxy Lambda Integration in AWS API Gateway
I set out to pass an HTTP header through API Gateway by mapping it in the method and integration request configurations (specifically using Serverless framework/template), based on various documentation I found online indicating I should do so. While troubleshooting, I at one point removed the mappings entirely and noticed that it *just worked*.
I.e., with no configuration in the method or integration request mappings, the HTTP header of interest (in this case, Authorization) was passed through API Gateway to my Lambda and accessible in the event object @ event[‘headers’][‘Authorization’]. I have seen no mention of this online, but perhaps it was quietly added by AWS at some point.
Not sure if anyone else has run into this…
Limiting User to SFTP for Uploading Web Content
I required the following:
- System user that could upload content to a directory in root web directory (default root: /var/www/html)
- Limit user from interactive SSH
- Limit user from other areas of OS
Specifically, I am working within the AWS distribution on a hosted EC2 instance.
I found posts online that accomplished part of what I needed. But my steps to achieving this were:
- Create the user. In my case, user webpub. This creates an entry in /etc/passwd as well as a home directory under /home:
sudo useradd webpub
- These next few steps I found here. Create a ‘jail’ directory that we will constrain the user. I created it in /var.
sudo mkdir /var/jail
- An important note is that the jail directory and all directories beneath it must be owned by user root in order for the Chroot declaration to work. If you get setup and notice that you are correctly authenticating but then the connection immediately drops, this could be your problem. Now create a sub-directory that will serve as the access point for the user to the web content:
sudo mkdir /var/jail/www
- The directory created above can also be owned by root. Create a sub-directory under web content root that we will restrict this user to. In this case, the same name as the user:
sudo mkdir /var/www/html/webpub
- The directory created above can also be owned by root. Now create the link between the jail and the content directory by binding the two:
sudo mount -o bind /var/www/html/webpub /var/jail/www
- In /etc/passwd, update the user webpub‘s home directory (where they will land upon logging in) to /var/jail/www.
- Update /etc/ssh/sshd_config to jail the user upon logging in. Start by commenting the line Subsystem sftp /usr/libexec/openssh/sftp-server and then adding configuration for the internal-sftp sub-system. When done, it will look like (commented line and all):
#Subsystem sftp /usr/libexec/openssh/sftp-server Subsystem sftp internal-sftp Match User webpub ChrootDirectory /var/jail ForceCommand internal-sftp X11Forwarding no AllowTcpForwarding no
- The ChrootDirectory jails the user while ForceCommand internal-sftp lists the user to only being able to login via SFTP. Now restart openssh:
sudo /etc/init.d/sshd restart
- In my setup, I have password authentication disabled, so the last step is create a private/public key pair and install client/server side. Remember that authorized_keys (and its parent directory .ssh) must reside in the home directory for webpub, which we set earlier as /var/jail/www. Since that directory is bound to /var/www/html/webpub, though, these artifacts reside in the latter directory.