Python 3 on Spark - Return of the PYTHONHASHSEED

If you’re anything like me, you’ve been stuck using Python 2 for the last 10 years, and for 8 of them you’ve been trying to switch to 3. Since the release of Spark and PySpark 1.4, Apache has started supporting Python 3 - fantastic! But then appears the omen of doom, like Death cackling at your final moments:

Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

Try in vain you may to set PYTHONHASHSEED: export it on the master? Set it in spark-env.sh? Possibly pssh -h slaves 'export PYTHONHASHSEED=0'?

None will work, and Spark will chuckle silently as it drinks the tears slowly rolling down your cheek as they slip through the cracks of your keyboard. But do not lose hope, valiant warrior! With this simple incantation you may distinct(), reduceByKey(), and join() to your hearts content!

The solution is simple: after launching the cluster, execute the following on your master:

# Set PYTHONHASHSEED locally
echo "export PYTHONHASHSEED=0" >> /root/.bashrc
source /root/.bashrc

# Set PYTHONHASHSEED on all slaves
pssh -h /root/spark-ec2/slaves 'echo "export PYTHONHASHSEED=0" >> /root/.bashrc'

# Restart all slaves
sh /root/spark/sbin/stop-slaves.sh
sh /root/spark/sbin/start-slaves.sh