Custom Python App on EMR Serverless


  • Python 3.9.9
  • EMR Serverless: 6.13
  • TensorFlow: 2.11



I had to jump through a few hoops to get a PySpark application running on EMR Serverless. Below are the steps I followed, along with final functioning configuration, and at the bottom of this post is a few errors I encountered along the way.


1. Setup Build Environment

For a packaged application to work it must be built in an environment very similar to that of EMR Serverless; specifically, Amazon Linux 2. Plenty of mention is made online about using platform=linux/arm64 amazonlinux:2 to achieve this in Docker. I could not get this to work – when attempting to output the image it would just hang forever – I suspect because I’m on OSX, so I ended up spinning up an EC2 instance for my build environment, based on Amazon Linux 2 image.

2. Setup and Run Build Script

Mine was almost identical to the one found here, just with a few tweaks. Place your project requirements.txt in your build environment working directory and:

sudo yum install -y gcc openssl-devel bzip2-devel libffi-devel tar gzip wget make xz-devel lzma

wget && \
    tar xzf Python-3.9.9.tgz && \
    cd Python-3.9.9  && \
    ./configure --enable-optimizations && \
    sudo make altinstall

sudo yum install -y python3-pip

# Create python venv with Python 3.9.9
python3.9 -m venv pyspark_venv_python_3.9.9 --copies

# copy system python3 libraries to venv
cp -r /usr/local/lib/python3.9/* ./pyspark_venv_python_3.9.9/lib/python3.9/

# package venv to archive.
# **Note** that you have to supply --python-prefix option
# to make sure python starts with the path where your
# copied libraries are present.
# Copying the python binary to the "environment" directory.
source pyspark_venv_python_3.9.9/bin/activate && \
    pip install venv-pack && \
    pip install -r requirements.txt

sudo mkdir -p /home/hadoop/environment
source pyspark_venv_python_3.9.9/bin/activate &&  \
    venv-pack -f -o pyspark_venv_python_3.9.9.tar.gz --python-prefix /home/hadoop/environment

# You'll need to reference this path/file in your EMR Serverless job config
aws s3 cp pyspark_venv_python_3.9.9.tar.gz s3://<path_to>/<project_artifacts>/

3. Align Python Lib with EMR Requirements

If you’re smart, you started out with EMR-capable lib versions and worked backward from there. If, like me, you were handed a project where this was not the case, you’ll likely have to backoff dependency versions to make them compatible with EMR Serverless.

4. Zip and Upload Custom Python Modules

  • From directory containing your application code: zip -r my_custom_modules my_custom_modules/
  • aws s3 cp s3://<path_to>/<project_artifacts>/

4. Configure EMR Serverless

  • In EMR Studio, create an application. I allowed it create and use a default IAM role.
  • Upload your entry point script to S3 and define it under ‘Script location’.
  • Add any script arguments you need to pass (and your app is prepared to parse).
  • I landed on the following Spark properties in order to get the job to run:
    --conf spark.archives=s3://<path_to>/<project_artifacts>/pyspark_venv_python_3.9.9.tar.gz#environment
    --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
    --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
    --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
    --conf spark.submit.pyFiles=s3://<path_to>/<project_artifacts>/
    --conf spark.files=s3://<path_to>/<project_artifacts>/some_other_file.yml, s3://<path_to>/<project_artifacts>/second_other_file.yml


Encountered along the way.

Traceback (most recent call last): File "/home/hadoop/environment/lib/python3.9/site-packages/fastavro/", line 2, in <module> from . import _read File "fastavro/_read.pyx", line 11, in init fastavro._read File "/home/hadoop/environment/lib/python3.9/", line 27, in <module> from _lzma import * ModuleNotFoundError: No module named '_lzma'In build environment:
sudo yum install lzma
ModuleNotFoundError: No module named '<my custom modules>'These steps, from the base of your application code:
* zip -r my_custom_modules/
* upload zip file to s3 bucket
* add to your job spark properties: --conf spark.submit.pyFiles=s3://<path_to>/
ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with...Downgraded to urllib3=1.26.6
ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/home/hadoop/environment/lib/python3.9/site-packages/google/protobuf/internal/ latest version – e.g., from fully-updated installation – of protobuf’s ‘’ to your project’s Python packages at Lib/site-packages/google/protobuff/internal. See here for details.
Docker build env: “copying <n> files…” Never finishes.I had to abandon a Dockerized Amazon Linux 2 build environment, I suspect it had something to do with my Apple silicon. I ended up spinning up a VM on AWS and using their Amazon Linux 2.

Posted on

ICellRendererAngularComp + ICellRendererParams: DOMException: Failed to execute ‘removeChild’ on ‘Node’: The node to be removed is no longer a child of this node. Perhaps it was moved in a ‘blur’ event handler?

I recently encountered a collision between Angular and ag-grid, where multiple – i.e., two – attempts were being made to remove the same child node each time a cell renderer underwent a change, one attempt made by ag-grid and the second, failing attempt made by Angular change detection event that attempted a re-render via DefaultDomRenderer2. Specifically, these competing events occurred when calling ICellRendererParams.setValue() from one of my ICellRendererAngularComp methods. Even more specifically, it only occurred for rows 12 and beyond of the grid; a change to rows 1-11 did not trigger the re-render by Angular.

To resolve this, I implemented ICellRendererAngularComp.refresh(), like such:

  refresh(params: ICellRendererParams): boolean {
    this.params = params;
    return false;

to have ag-grid perform a refresh of the cell renderer instead of a destroy/re-init, and rely on Angular alone perform the removal of the child node. I then invoke the ‘refresh’ method immediately after calling ICellRendererParams.setValue().

This was not straightforward to track down.

Posted on

Snowflake SDK Configuration: CertificateError

I had to wrestle with getting the Snowflake SDK working, kept encountering the error:

250001: Could not connect to Snowflake backend after 0 attempt(s).Aborting.

The underlying error to which was:

SSLError(CertificateError("hostname '' doesn't match either of '*', '*', '*', '*', '*', '*'"))

I read much on the tubes about how region needs to be set, the account name configured, some places indicating region and cloud provider should be passed with account name, etc., etc. If you haven’t already, it helps to read:

But the following is what worked for me:

  • In snowSet region to ‘us-west-2’ (I did this even though my account is in ‘us-east-2’)
  • For account name, pass ‘<org>-<account>’. The URL becomes ‘<org>-<account>`.

So with my config file having:

region = us-west-2
username = <username>
password = <password>
dbname = demo
schemaname = public
warehousename = compute_wh

the following was able to work:

√ ~/.snowsql 18:22:53 % snowsql -a <org>-<account>
* SnowSQL * v1.2.23
Type SQL statements or !help

Posted on

Pass HTTP Headers with Non Proxy Lambda Integration in AWS API Gateway

I set out to pass an HTTP header through API Gateway by mapping it in the method and integration request configurations (specifically using Serverless framework/template), based on various documentation I found online indicating I should do so. While troubleshooting, I at one point removed the mappings entirely and noticed that it *just worked*.

I.e., with no configuration in the method or integration request mappings, the HTTP header of interest (in this case, Authorization) was passed through API Gateway to my Lambda and accessible in the event object @ event[‘headers’][‘Authorization’]. I have seen no mention of this online, but perhaps it was quietly added by AWS at some point.

Not sure if anyone else has run into this…

Posted on

Satisfying Email SPF Spam Checks (Emphasis: Gmail)

I ran into an issue where mail from my server/domain ended up in Gmail users’ spam folders and so endeavored to resolve it. I didn’t find anything in my online searching where someone was seeing exactly what I was seeing.

This link was helpful for general information around SPF:

For purposes of this documentation, configuration/values used:

Server IP :
Server hostname : (yes, different than domain)
SMTP HELO (exim4) name :

Personal address:
Gmail address:

As for the symptoms, most importantly, email was ending up in Gmail users’ spam folders. In viewing the email header, I could see the reason was a softfail:

Received: ⁨from ( [2001:61f5:41:82c::235]) by with ESMTPS id u26si4316688wrd.422.2017. for <> (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 26 Oct 2017 12:05:14 -0700 (PDT)⁩
Authentication-Results: ⁨; spf=softfail ( domain of transitioning does not designate 2001:61f5:41:82c::235 as permitted sender)

Now I never did find out (neither from hosting provider nor in my online searching) why Google is printing what appears to be an IPv6 (hexadecimal) address in the response (generally, you should see the IPv4 address of the server instead) log it adds to the header, but I did get the gist that the host was not explicitly authorized to send mail on behalf of

I’m not going to go into detail about SPF usage (see link at beginning of post), but when sending email from a domain, a SPF DNS entry is required for that domain so that receiving servers can validate email by linking the sending machine to the domain (the SPF entry defines which machines can send mail on behalf of the domain). A single entry may contain multiple rules, separated by spaces, which are read sequentially until one is satisfied. If none are satisfied, the “*all” rule at the end directs the calling server how/if to fail the email message.

At the time I was encountering this issue, the SPF entry for my domain looked like: IN TXT "v=spf1 ip4: ~all”

My impression was that specifying the server IP address here (from which email will be sent) would satisfy ISP checks: all emails sent from would be caught and validated by the rule “ip4:,” and both (HELO name) and (host name) resolve to Evidently, this was not the case. I viewed the addition of “,” seen above, as extraneous but worth a shot (read: desperation). Still, no dice. Note: the ‘~’ part of ‘~all’ directs a server to softfail a message if no rules in the entry are satisfied, as opposed to ‘-all’, which directs a server to hardfail.

I initially suspected Google was performing a HELO/EHLO (hereon just “HELO”) check that was failing due to no SPF entry for Indeed, to satisfy servers which employ 100% HELO checks (or those scenarios where Mail from is empty in the message), a separate SPF entry is required for the HELO name itself (this is a best practice, though I’m not sure how often this check is employed). And so I added an appropriate DNS entry: IN TXT "v=spf1 a -all"

In other words, explicitly allow email from this HELO FQDN (via the “a” rule), This resulted in no change; I was still encountering the softfail.

Next I homed on the fact that, in addition to reporting the HELO name, Google was printing my actual hostname in the email header:

Received: ⁨from ( [2001:61f5:41:82c::235]) by with ESMTPS id u26si4316688wrd.422.2017. for <> (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 26 Oct 2017 12:05:14 -0700 (PDT)⁩

See that That’s the hostname of my server. More specifically, it’s the name associated (via rDNS) with IP Google must be performing a reverse DNS lookup to retrieve that hostname. I wondered if it was then performing an SPF check based on that name, so I added a discrete rule for it in my SPF entry: IN TXT "v=spf1 ip4: ~all”

This fixed it:

Received: ⁨from ( [2001:61f5:41:82c::235]) by with ESMTPS id p19si797419wrf.42.2017. for <> (version=TLS1_2 cipher=AES128-SHA bits=128/128); Wed, 01 Nov 2017 08:37:26 -0700 (PDT)⁩
Authentication-Results: ⁨; spf=pass ( domain of designates 2001:61f5:41:82c::235 as permitted sender)

Success! Still not sure what’s up with the IPv6 address, but success, nonetheless.

Upon further investigation (i.e., empirical testing), it appears that Google was not using a HELO check at all in this scenario, which I suppose isn’t to say it never does or never will.

So for Gmail, it looks like the most important thing is for that domain name resolving via rDNS (in my case, to be present as explicitly ‘allowed’ in the SPF entry instead of relying on the IPv4 rule alone. Whether this would be required if the hostname and HELO name are the same, I don’t know. And whether that IPv6 address getting returned instead of IPv4 has anything to do with it, also not sure, but I’m eyeing that warily and have reached out to my hosting provider.

In summary:
– To satisfy Google (and other ISPs performing check via rDNS) spam check, add rule for server hostname (more specifically, the FQDN returned from rDNS of server IP) as explicitly allowable sender in SPF record for any and all domains from which mail will be sent.
– Additional best practice (not sure how often/ever this is implemented by an email server): and add new SPF entry for HELO name (e.g., “v=spf1 a -all”).

Posted on

Limiting User to SFTP for Uploading Web Content

I required the following:

  • System user that could upload content to a directory in root web directory (default root: /var/www/html)
  • Limit user from interactive SSH
  • Limit user from other areas of OS

Specifically, I am working within the AWS distribution on a hosted EC2 instance.

I found posts online that accomplished part of what I needed. But my steps to achieving this were:

  1. Create the user. In my case, user webpub. This creates an entry in /etc/passwd as well as a home directory under /home: 
    sudo useradd webpub
  2. These next few steps I found here. Create a ‘jail’ directory that we will constrain the user. I created it in /var.
    sudo mkdir /var/jail
  3. An important note is that the jail directory and all directories beneath it must be owned by user root in order for the Chroot declaration to work. If you get setup and notice that you are correctly authenticating but then the connection immediately drops, this could be your problem. Now create a sub-directory that will serve as the access point for the user to the web content:
    sudo mkdir /var/jail/www
  4. The directory created above can also be owned by root. Create a sub-directory under web content root that we will restrict this user to. In this case, the same name as the user:
    sudo mkdir /var/www/html/webpub
  5. The directory created above can also be owned by root. Now create the link between the jail and the content directory by binding the two:
    sudo mount -o bind /var/www/html/webpub /var/jail/www
  6. In /etc/passwd, update the user webpub‘s home directory (where they will land upon logging in) to /var/jail/www.
  7. Update /etc/ssh/sshd_config to jail the user upon logging in. Start by commenting the line Subsystem sftp /usr/libexec/openssh/sftp-server and then adding configuration for the internal-sftp sub-system. When done, it will look like (commented line and all): 
    #Subsystem sftp /usr/libexec/openssh/sftp-server
    Subsystem sftp internal-sftp
    Match User webpub
            ChrootDirectory /var/jail
            ForceCommand internal-sftp
            X11Forwarding no
            AllowTcpForwarding no
  8. The ChrootDirectory jails the user while ForceCommand internal-sftp lists the user to only being able to login via SFTP. Now restart openssh:
    sudo /etc/init.d/sshd restart
  9. In my setup, I have password authentication disabled, so the last step is create a private/public key pair and install client/server side. Remember that authorized_keys (and its parent directory .ssh) must reside in the home directory for webpub, which we set earlier as /var/jail/www. Since that directory is bound to /var/www/html/webpub, though, these artifacts reside in the latter directory.

Posted on

Ubuntu Yielding Noisy Black/White Scans

I did a fresh install of Ubuntu 14.10 today with Cinnamon as a desktop and am pleased with the interface.

I noticed something when scanning some documents in lineart mode, though: the resulting images had a ton of noise, noise that I did not see in scans prior to my upgrade. After snooping around the various options in the gscan2pdf application, I stumbled upon this one which, when toggled, causes the noise to disappear: Disable dynamic lineart. After checking that box, my scans seem to be noise free.

Posted on

Default Argument Value Does Not Refresh Between Function Calls

Something struck me as unexpected today while working in Python. I had a function to take a datetime object and convert it into epoch milliseconds:

import datetime
import time

this_tz = 'US/Eastern'

def get_epch_ms(
    # Returns milliseconds since epoch for datetime object passed.
    # If no argument is passed, uses *now* as time basis.

    return int(time.mktime(dttm.astimezone(pytz.timezone(this_tz)).timetuple()) * 1000.0 + round(dttm.microsecond / 1000.0))

This function works fine: call it with get_epch_ms() and the epoch millisecond value for *now* is returned; however, I noticed during subsequent calls to the function within the same execution of the broader application that the value of dttm did not update each time. I.e., it appears as if the logic used to populate a default value – – was executed only during the first call to the function, and that same value was used for subsequent calls. It took me a bit to track this down, not sure if it’s just something I’ve never come up against before.

The fix is simple enough, though involved a couple of additional lines of code:

import datetime
import time

this_tz = 'US/Eastern'

def get_epch_ms(dttm=None):
    # Returns milliseconds since epoch for datetime object passed.
    # If no argument is passed, uses *now* as time basis.
    # Refreshes 'dttm' between calls to this function.

    if dttm is None:
        dttm =

    return int(time.mktime(dttm.astimezone(pytz.timezone(this_tz)).timetuple()) * 1000.0 + round(dttm.microsecond / 1000.0))

The updated function properly provides an updated timestamp at each invocation, when called as get_epch_ms().

Posted on

Right and Wrong, Politically Speaking

A friend recently advanced the notion that one of our political parties is more “right” than the other when it comes to economic policy. As an admitted layman in economics, I disagree:


Interesting that you’d specifically mention macroeconomic policy, as it may be considered particularly confounding as the subject of an exercise seeking to discern “right” from “wrong.” Approaches and proposals – along with underlying principles – vary between the two major political parties, sure, but to unequivocally deem one as altogether more economically sound or, dare I say it, *enlightened* than the other seems disingenuous.

From the 2008 economic stimulus to recent quantitative easing, I could line up for you an equivalent number of economics doctoral degrees and professional accolades on either of two polarized viewpoints. “The amount of the stimulus should be doubled.” “There should be no stimulus at all.” “QE is critical in loosening credit markets.” “QE encourages risky investment at exactly the wrong time.” No statement above is correct, none is incorrect; each has sound economic theory which can (and has) been cited in its favor.

More to the point, if there were instilled in me a personal bias, I could line up for you a greater number of economic doctoral degrees and profession accolades on either side of two polarized viewpoints, the viewpoint of my choosing. This is convenient for my political agenda; I can leverage the sheer complexity and, really, nuance attached to (macro)economics to form in the shroud a convincing argument that serves my purpose. It is not crucial for my agenda that my argument be “right;” it is more important that it be polarizing, feigning a bright line where none exists.

Economics is fodder for this, as it can be so difficult to quantify. Compounding the matter is the fact that meaningful retrospection is tough because causality is so elusive. As for “right” and “wrong,” though, neither is neither. The “whole point” I originally mentioned (somewhat in passing, wasn’t it?) alludes to the fact that we are constructed (politically) so that powers (i.e., parties) – neither more correct than the other – gnash teeth and thump chests, fighting with equal conviction to accomplish their respective myopic visions and, in doing so, arrive at something in between. Neither party was meant to succeed entirely, nor would we want them to; even the staunchest partisan would find him or herself regretting the unilateral success of his or her own party.

Posted on

Gaming System Builds (~$500 and ~$1000)

Recently, a couple of friends have tapped me (or did I volunteer?) to spec out parts for a new gaming rig. The first friend was looking in the $500-600 range in order to get his League of Legends on, the second wants to replace his aging PC before the WoW expansions drops in a week or two. I figured I would capture here what I came up with.

The $500 (oh, okay, “sub-$600”) gaming rig.

This did prove a little challenging. The price point is low enough where some serious consideration has to be given to where to cut corners and still outfit what can be considered a complete PC. Admittedly, I assembled this list a few months ago, so prices may have dropped and “best value” components shifted a bit since then (gotta love technology).

Motherboard: ASUS M5A97 R2.0 Socket AM3+ ATX ($90)

CPU: AMD FX-6300 Vishera 6-Core 3.5GHz (4.1GHz Turbo) Socket AM3+ ($110)

Video Card: EVGA 02G-P4-2742-KR GeForce GT 740 Superclocked 2GB 128-Bit DDR3 ($90)

Memory: CORSAIR Vengeance 8GB 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) ($80)

Power Supply: CORSAIR CX series CX600 600W ATX12V v2.3 ($80)

Hard Drive: Seagate Barracuda ST1000DM003 1TB 7200 RPM 64MB Cache SATA ($55)

Optical Drive: Asus or Samsung ($20)

Case: Antec Three Hundred ($65)

Total cost: $590


Upgrades that could be made to the above:

Video Card: EVGA 03G-P4-2667-KR G-SYNC Support GeForce GTX 660 FTW Signature 2 3GB 192-bit GDDR5 (+$90)

CPU/Motherboard: Upgrade to Intel i5 (+$150)

New total cost: $830


Downgrades that could be made to above:

Motherboard: Asus to MSI (-$20)

PSU: 600w to 500w PSU (-$15)

Memory: 8GB to 4GB (-$40)

New total cost: $515


The $1k gaming rig.

A little more breathing room here, but (and it’s a big ‘but’) this particular friend has his sights set on a Core i7. There goes about a third of the budget.

He was also looking at a pre-built (some great values to be had here) system, the ASUS M32AD-US032S Desktop PC, selling for $969, which comes with the following specs:

Intel Core i7 4790 (3.6GHz)

Chipset: Intel H81


Windows 8.1 64-Bit

NVIDIA GeForce GT 740 4 GB

300W PSU

I sought to come up with a similarly priced alternative that might be more tuned to the discerning builder/gamer. A few notes driving my decision-making:

  • The box above is put together by Asus, and Asus knows what it’s doing. I’m fairly certain it’s going to run your games just fine, and right out of the box, no less. That said…
  • 300w struck me as a borderline. Again, I’m sure the PC is going to run fine, but how about a little overhead for those future upgrades?
  • I couldn’t find much information on the specific components actually used…I’m going to go ahead and venture they’ll be mainly Asus, but who knows. When *I* build a system, though, I _do_ know.
  • There is some real value here to those who need a Windows license, which are running north of $100 a pop right now. I disregard such license in my builds, but if you need one, that’s $100 right off the bat.
  • Input devices. It’s not much of a consideration for me – I like to latch onto my own – but the Asus prebuilt comes with keyboard and mouse.

My answer to the Asus pre-built:

CPU: Intel Core i7-4790 Haswell Quad-Core 3.6GHz LGA 1150 ($310)

Video Card: EVGA 03G-P4-2667-KR G-SYNC Support GeForce GTX 660 FTW Signature 2 3GB 192-bit GDDR5 ($180)

Motherboard: ASUS Z97-A LGA 1150 Intel Z97 ($150)

Memory: CORSAIR Vengeance 8GB 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) ($80)

Power Supply: CORSAIR CX series CX600 600W ATX12V v2.3 ($80)

Hard Drive: Seagate Hybrid Drive ST1000DX001 1TB MLC/8GB 64MB Cache SATA ($80)

Case: Antec Nine Hundred ($95) or Antec Three Hundred ($65)

Total cost: $975

The above gets you into features offered by the Z97 chipset that the H81 does not have. I also give the video card a pretty serious bump. I cut corners with memory, going from 16GB to 8GB. Some people will scream about this, but 8GB is going to be fine right now and RAM is a straightforward upgrade down the line. I spec a robust power supply with room to grow, and an accompanying big, cool, quiet Antec Nine Hundred. I sacrificed some storage in exchange for the speed benefits of Hybrid. If you’re hoarding media, that might prove unpalatable, but I might also recommend going out and getting a giant, slow(er) drive for such things (unless you’re doing a bunch of editing of said media, etc., in which case you’re peripheral to my target audience, anyway).


As usual, I’m amazed at the caliber of hardware that can be gotten for a reasonable price. I put a rig together about four years ago, in that ~$1k range, and it still goes strong with WoW (the only game I still really play, on occasion) cranked. The more demanding games, running at higher resolutions (1080p widescreen, etc.) than what I’m running would make it sweat, I’m sure, but my general point is that you can reasonably expect to get some quality time from a system in this range. Even the “sub-$600” system offers some upgrade paths that will keep you chugging for a bit.

Posted on