Category: Software
Custom Python App on EMR Serverless
Environment
- Python 3.9.9
- EMR Serverless: 6.13
- TensorFlow: 2.11
Reference
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html
- https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html
I had to jump through a few hoops to get a PySpark application running on EMR Serverless. Below are the steps I followed, along with final functioning configuration, and at the bottom of this post is a few errors I encountered along the way.
Steps
1. Setup Build Environment
For a packaged application to work it must be built in an environment very similar to that of EMR Serverless; specifically, Amazon Linux 2. Plenty of mention is made online about using platform=linux/arm64 amazonlinux:2
to achieve this in Docker. I could not get this to work – when attempting to output the image it would just hang forever – I suspect because I’m on OSX, so I ended up spinning up an EC2 instance for my build environment, based on Amazon Linux 2 image.
2. Setup and Run Build Script
Mine was almost identical to the one found here, just with a few tweaks. Place your project requirements.txt
in your build environment working directory and:
sudo yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make xz-devel lzma
wget https://www.python.org/ftp/python/3.9.9/Python-3.9.9.tgz && \
tar xzf Python-3.9.9.tgz && \
cd Python-3.9.9 && \
./configure --enable-optimizations && \
sudo make altinstall
sudo yum install -y python3-pip
# Create python venv with Python 3.9.9
python3.9 -m venv pyspark_venv_python_3.9.9 --copies
# copy system python3 libraries to venv
cp -r /usr/local/lib/python3.9/* ./pyspark_venv_python_3.9.9/lib/python3.9/
# package venv to archive.
# **Note** that you have to supply --python-prefix option
# to make sure python starts with the path where your
# copied libraries are present.
# Copying the python binary to the "environment" directory.
source pyspark_venv_python_3.9.9/bin/activate && \
pip install venv-pack && \
pip install -r requirements.txt
sudo mkdir -p /home/hadoop/environment
source pyspark_venv_python_3.9.9/bin/activate && \
venv-pack -f -o pyspark_venv_python_3.9.9.tar.gz --python-prefix /home/hadoop/environment
# You'll need to reference this path/file in your EMR Serverless job config
aws s3 cp pyspark_venv_python_3.9.9.tar.gz s3://<path_to>/<project_artifacts>/
3. Align Python Lib with EMR Requirements
If you’re smart, you started out with EMR-capable lib versions and worked backward from there. If, like me, you were handed a project where this was not the case, you’ll likely have to backoff dependency versions to make them compatible with EMR Serverless.
4. Zip and Upload Custom Python Modules
- From directory containing your application code:
zip -r my_custom_modules my_custom_modules/
aws s3 cp my_custom_modules.zip s3://<path_to>/<project_artifacts>/my_custom_modules.zip
4. Configure EMR Serverless
- In EMR Studio, create an application. I allowed it create and use a default IAM role.
- Upload your entry point script to S3 and define it under ‘Script location’.
- Add any script arguments you need to pass (and your app is prepared to parse).
- I landed on the following Spark properties in order to get the job to run:
--conf spark.archives=s3://<path_to>/<project_artifacts>/pyspark_venv_python_3.9.9.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.submit.pyFiles=s3://<path_to>/<project_artifacts>/my_custom_modules.zip
--conf spark.files=s3://<path_to>/<project_artifacts>/some_other_file.yml, s3://<path_to>/<project_artifacts>/second_other_file.yml
Errors
Encountered along the way.
Error | Resolution |
---|---|
Traceback (most recent call last): File "/home/hadoop/environment/lib/python3.9/site-packages/fastavro/read.py", line 2, in <module> from . import _read File "fastavro/_read.pyx", line 11, in init fastavro._read File "/home/hadoop/environment/lib/python3.9/lzma.py", line 27, in <module> from _lzma import * ModuleNotFoundError: No module named '_lzma' | In build environment:sudo yum install lzma |
ModuleNotFoundError: No module named '<my custom modules>' | These steps, from the base of your application code: * zip -r my_custom_modules.zip my_custom_modules/ * upload zip file to s3 bucket * add to your job spark properties: --conf spark.submit.pyFiles=s3://<path_to>/my_custom_modules.zip |
ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with... | Downgraded to urllib3=1.26.6 |
ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/home/hadoop/environment/lib/python3.9/site-packages/google/protobuf/internal/__init__.py) | Fetch latest version – e.g., from fully-updated installation – of protobuf’s ‘builder.py’ to your project’s Python packages at Lib/site-packages/google/protobuff/internal . See here for details. |
Docker build env: “copying <n> files…” Never finishes. | I had to abandon a Dockerized Amazon Linux 2 build environment, I suspect it had something to do with my Apple silicon. I ended up spinning up a VM on AWS and using their Amazon Linux 2. |
ICellRendererAngularComp + ICellRendererParams: DOMException: Failed to execute ‘removeChild’ on ‘Node’: The node to be removed is no longer a child of this node. Perhaps it was moved in a ‘blur’ event handler?
I recently encountered a collision between Angular and ag-grid, where multiple – i.e., two – attempts were being made to remove the same child node each time a cell renderer underwent a change, one attempt made by ag-grid and the second, failing attempt made by Angular change detection event that attempted a re-render via DefaultDomRenderer2. Specifically, these competing events occurred when calling ICellRendererParams.setValue() from one of my ICellRendererAngularComp methods. Even more specifically, it only occurred for rows 12 and beyond of the grid; a change to rows 1-11 did not trigger the re-render by Angular.
To resolve this, I implemented ICellRendererAngularComp.refresh(), like such:
refresh(params: ICellRendererParams): boolean {
this.params = params;
return false;
}
to have ag-grid perform a refresh of the cell renderer instead of a destroy/re-init, and rely on Angular alone perform the removal of the child node. I then invoke the ‘refresh’ method immediately after calling ICellRendererParams.setValue().
This was not straightforward to track down.
Snowflake SDK Configuration: CertificateError
I had to wrestle with getting the Snowflake SDK working, kept encountering the error:
250001: Could not connect to Snowflake backend after 0 attempt(s).Aborting.
The underlying error to which was:
SSLError(CertificateError("hostname 'sk78217.us-east-2.snowflakecomputing.com' doesn't match either of '*.us-west-2.snowflakecomputing.com', '*.us-west-2.aws.snowflakecomputing.com', '*.global.snowflakecomputing.com', '*.snowflakecomputing.com', '*.prod1.us-west-2.aws.snowflakecomputing.com', '*.prod2.us-west-2.aws.snowflakecomputing.com'"))
I read much on the tubes about how region needs to be set, the account name configured, some places indicating region and cloud provider should be passed with account name, etc., etc. If you haven’t already, it helps to read:
https://docs.snowflake.com/en/user-guide/admin-account-identifier.html
But the following is what worked for me:
- In snowSet region to ‘us-west-2’ (I did this even though my account is in ‘us-east-2’)
- For account name, pass ‘<org>-<account>’. The URL becomes ‘<org>-<account>.snowflakecomputing.com`.
So with my config file having:
region = us-west-2
username = <username>
password = <password>
dbname = demo
schemaname = public
warehousename = compute_wh
the following was able to work:
√ ~/.snowsql 18:22:53 % snowsql -a <org>-<account>
* SnowSQL * v1.2.23
Type SQL statements or !help
<username>#COMPUTE_WH@DEMO.PUBLIC>
Pass HTTP Headers with Non Proxy Lambda Integration in AWS API Gateway
I set out to pass an HTTP header through API Gateway by mapping it in the method and integration request configurations (specifically using Serverless framework/template), based on various documentation I found online indicating I should do so. While troubleshooting, I at one point removed the mappings entirely and noticed that it *just worked*.
I.e., with no configuration in the method or integration request mappings, the HTTP header of interest (in this case, Authorization) was passed through API Gateway to my Lambda and accessible in the event object @ event[‘headers’][‘Authorization’]. I have seen no mention of this online, but perhaps it was quietly added by AWS at some point.
Not sure if anyone else has run into this…
Default Argument Value Does Not Refresh Between Function Calls
Something struck me as unexpected today while working in Python. I had a function to take a datetime object and convert it into epoch milliseconds:
import datetime import time this_tz = 'US/Eastern' def get_epch_ms(dttm=datetime.datetime.now(pytz.timezone(this_tz))): # Returns milliseconds since epoch for datetime object passed. # If no argument is passed, uses *now* as time basis. # DOES NOT APPEAR TO REFRESH 'dttm' BETWEEN EXECUTIONS. return int(time.mktime(dttm.astimezone(pytz.timezone(this_tz)).timetuple()) * 1000.0 + round(dttm.microsecond / 1000.0))
This function works fine: call it with get_epch_ms() and the epoch millisecond value for *now* is returned; however, I noticed during subsequent calls to the function within the same execution of the broader application that the value of dttm did not update each time. I.e., it appears as if the logic used to populate a default value – dttm=datetime.datetime.now(pytz.timezone(this_tz)) – was executed only during the first call to the function, and that same value was used for subsequent calls. It took me a bit to track this down, not sure if it’s just something I’ve never come up against before.
The fix is simple enough, though involved a couple of additional lines of code:
import datetime import time this_tz = 'US/Eastern' def get_epch_ms(dttm=None):
# Returns milliseconds since epoch for datetime object passed. # If no argument is passed, uses *now* as time basis. # Refreshes 'dttm' between calls to this function. if dttm is None: dttm = datetime.datetime.now(pytz.timezone(this_tz)) return int(time.mktime(dttm.astimezone(pytz.timezone(this_tz)).timetuple()) * 1000.0 + round(dttm.microsecond / 1000.0))
The updated function properly provides an updated timestamp at each invocation, when called as get_epch_ms().
Web Browser Cookies Between Sessions (IE, Firefox, Chrome)
Was looking into this for a client, and I’ve come to the following conclusion based on various reading across the ‘tubes:
How cookies are handled between browser instances varies between web browsers. Why do we care? Well various web applications are going to get wonky if you try opening multiple instances of them when those instances share cookies. And by “wonky” I mean it’s just not going to work. So isolating browser instances allows us to have multiple sessions of that web application open simultaneously.
Internet Explorer
IE7 does *not* share cookies if you start another instance of it (e.g. double-clicking on the icon when an instance is already open) but will share them across tabs or if you use “New Window” to open new window.
IE8 *does* share cookies between instances by default, but it can be made to not do this by either:
– Going to File–>New Session
– Starting IE8 with “iexplore -nomerge” (custom shortcut)
Mozilla Firefox
It would appear Firefox shares cookies between tabs and windows if those windows are created under the same Firefox profile. If you’re like me (and using Firefox), you probably only have one Firefox profile setup for yourself. You can force Firefox to use a different profile by creating a custom shortcut that looks like this:
firefox.exe -no-remote -p “myProfile2”
where myProfile2 is the name of the profile you want Firefox to use. If the profile does not exist, Mozilla will bring up the profile management tool which will let you create it. From then on you can then open two instances of Firefox, running under two different profiles, which will *not* share cookies and, thus, will allow you to run two simultaneous sessions of your favorite web application (I know what mine is).
Chrome
Allegedly, Chrome shares cookies between instances unless you use its Incognito feature by clicking on the wrench and going to “New Incognito Window” (Ctl-Shift-N).
PDF: Windows vs Linux File Size
I’ve recently switched to Linux (Ubuntu 8.10) as my main operating system. I find it’s a more effective workspace for most of my tasks. Check it out if you haven’t already; Linux really is growing up. I do keep Windows around for a couple tasks, mainly gaming, but Linux is closing the gap on that, too, through the latest implementations of Wine.
One thing I’ve noticed, though, that I haven’t been able to pin down a reason for, is that PDF file sizes in Linux seem high compared to those generated in Windows. I know, this is a somewhat generic statement given the fact that, Linux or Windows, the process is dependent on the software doing the compression. Yet there seems to be a consistent discrepancy between the two operating systems when it comes to PDF file sizes. Looking around online, my observations seem to be somewhat validated. A popular solution on forums is to use the DjVu compression scheme, but I’d prefer sticking with the fairly universal PDF file format. To its credit, DjVu seems to match or better PDF when it comes to black-and-white documents, but it falls behind in grayscale.
So I ran a little test, scanning the front page of my offer letter for my new job. It consists of a company logo at the top and a full page of text. It is somewhat indicative of what I archive. All scans were done in black-and-white or grayscale. Results (file size in bytes):
18474 150dpiLinuxDjVu-BW.djvu
241812 150dpiLinuxDjVu-Gray.djvu
55298 150dpiLinuxLZW-BW.pdf
813876 150dpiLinuxLZW-Gray.pdf
50213 150dpiWin-BW.pdf
29172 150dpiWinG4-BW.tif
34410 150dpiWinG4-Gray.tif
58947 150dpiWin-Gray.pdf
47280 150dpiWinLZW-BW.tif
1304736 150dpiWinLZW-Gray.tif
29229 300dpiLinuxDjVu-BW.djvu
688967 300dpiLinuxDjVu-Gray.djvu
113726 300dpiLinuxLZW-BW.pdf
2670089 300dpiLinuxLZW-Gray.pdf
81978 300dpiWin-BW.pdf
59188 300dpiWinG4-BW.tif
73842 300dpiWinG4-Gray.tif
114967 300dpiWin-Gray.pdf
5024631 300dpiWin-Gray-300dpiPDF.pdf
5024632 300dpiWin-Gray-600dpiPDF.pdf
5040863 300dpiWin-GrayThenPDF.pdf
8955576 300dpiWin-Gray.tif
132170 300dpiWinLZW-BW.tif
5577814 300dpiWinLZW-Gray.tif
759067 CNNLinux.pdf
237794 CNNWin600dpi.pdf
In order of size:
18474 150dpiLinuxDjVu-BW.djvu
29172 150dpiWinG4-BW.tif
29229 300dpiLinuxDjVu-BW.djvu
34410 150dpiWinG4-Gray.tif
47280 150dpiWinLZW-BW.tif
50213 150dpiWin-BW.pdf
55298 150dpiLinuxLZW-BW.pdf
58947 150dpiWin-Gray.pdf
59188 300dpiWinG4-BW.tif
73842 300dpiWinG4-Gray.tif
81978 300dpiWin-BW.pdf
113726 300dpiLinuxLZW-BW.pdf
114967 300dpiWin-Gray.pdf
132170 300dpiWinLZW-BW.tif
237794 CNNWin600dpi.pdf
241812 150dpiLinuxDjVu-Gray.djvu
688967 300dpiLinuxDjVu-Gray.djvu
759067 CNNLinux.pdf
813876 150dpiLinuxLZW-Gray.pdf
1304736 150dpiWinLZW-Gray.tif
2670089 300dpiLinuxLZW-Gray.pdf
5024631 300dpiWin-Gray-300dpiPDF.pdf
5024632 300dpiWin-Gray-600dpiPDF.pdf
5040863 300dpiWin-GrayThenPDF.pdf
5577814 300dpiWinLZW-Gray.tif
8955576 300dpiWin-Gray.tif
Make note of the file extensions; there are actually three different file types in those listings. The file names lead with resolution, with the exception of the two starting with “CNN.” Those two were PDF’s created by printing cnn.com’s cover page to PDF in Linux and Windows (using PDF Creator). The cover page contained slightly different content but not enough to explain the file size difference. After the resolution in the file name comes the operating system, followed by compression algorithm where applicable. Immediately after the hyphen is the grayscale/black-and-white indicactor and in those cases where there is a second hyphen, it indicates the file was post-processed with a PDF printer at the stated resolution.
For Windows, where a compression algorithm is not listed, I used the software included with my Canon LiDE 50 scanner, which saves directly to PDF. In Linux, I used the popular gscan2pdf GUI. Having OCR on or off did not seem to make much of a difference, as far as file size. For gscan2pdf, the file was also processed with Unpaper, which should optimize the file further (it also creates blockiness in the document’s whitespace that is undesirable to me, but it’s fine for archiving documents).
So there you go. The difference is significant. One would have to dig into the underpinnings of the software, I think, to expose the reason for this, but I’m definitely curious. Again, DjVu pulls close and surpasses PDF when it comes to black-and-white scanning, but even it falls short when using grayscale (which happens to by my method of choice). I’ll admit I don’t relish the idea of booting into Windows simply to archive documents.
Windows Explorer Folder Shortcuts
I sometimes like to make shortcuts to various folders on my Windows machine. I am annoyed, though, that when executed, this shortcut brings up an explorer window without the folder tree on the left. I found the solution to this here:
In short, the command line in your shortcut should read
%SystemRoot%\EXPLORER.EXE /n,/e,d:\
where “d:\” should be replaced by the path to the file.
Microsoft Office Documents Opening In Internet Explorer
Ever click on a hyperlink to a MS Office document and watch it open awkwardly in Internet Explorer, with all sorts of wonky results? I have. As kind as Microsoft is, they have a knowledge base entry addressing the problem:
http://support.microsoft.com/kb/q162059/
Hyperlink to Specific Page of PDF
I was recently posed with the question of whether or not you could hyperlink (yeah, I’m using the term as a verb) to a specific page of a PDF. In looking around, I am under the impression this can be done if the PDF resides on a web server by adding “#page=2” to the hyperlink (for Page 2, that is).
http://foo.com/file.pdf#page=2
nule mentioned this might be browser specific, but I have not run into that.
The situation becomes more complicated if the hyperlink points to a PDF residing on a mapped network drive, etc. I have read of solutions involving VBA scripts for this particular case, though I did not delve into it (nor do I intend to).