Thought Vector Blog

Python 3 on Spark - Return of the PYTHONHASHSEED

Posted September 9, 2015

If you’re anything like me, you’ve been stuck using Python 2 for the last 10 years, and for 8 of them you’ve been trying to switch to 3. Since the release of Spark and PySpark 1.4, Apache has started supporting Python 3 - fantastic! But then appears the omen of doom, like Death cackling at your final moments:

Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

Try in vain you may to set PYTHONHASHSEED: export it on the master? Set it in spark-env.sh? Possibly pssh -h slaves 'export PYTHONHASHSEED=0'?

None will work, and Spark will chuckle silently as it drinks the tears slowly rolling down your cheek as they slip through the cracks of your keyboard. But do not lose hope, valiant warrior! With this simple incantation you may distinct(), reduceByKey(), and join() to your hearts content!

The solution is simple: after launching the cluster, execute the following on your master:

# Set PYTHONHASHSEED locally
echo "export PYTHONHASHSEED=0" >> /root/.bashrc
source /root/.bashrc

# Set PYTHONHASHSEED on all slaves
pssh -h /root/spark-ec2/slaves 'echo "export PYTHONHASHSEED=0" >> /root/.bashrc'

# Restart all slaves
sh /root/spark/sbin/stop-slaves.sh
sh /root/spark/sbin/start-slaves.sh

READ THIS NEXT:

Pipelining - A Successful Data Processing Model

It’s finally time to implement that new personalization service — the one you’ve been pushing for for months. With it, your app will be serving up relevant, personalized content to every user. But the...


author Stuart AxelbrookeBy Stuart Axelbrooke, who does data science and text analytics. You should follow him on Twitter