If you’re anything like me, you’ve been stuck using Python 2 for the last 10 years, and for 8 of them you’ve been trying to switch to 3. Since the release of Spark and PySpark 1.4, Apache has started supporting Python 3 - fantastic! But then appears the omen of doom, like Death cackling at your final moments:
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
Try in vain you may to set
PYTHONHASHSEED: export it on the master? Set it in
pssh -h slaves 'export PYTHONHASHSEED=0'?
None will work, and Spark will chuckle silently as it drinks the tears slowly rolling down your cheek as they slip through the cracks of your keyboard. But do not lose hope, valiant warrior! With this simple incantation you may
join() to your hearts content!
The solution is simple: after launching the cluster, execute the following on your master:
# Set PYTHONHASHSEED locally echo "export PYTHONHASHSEED=0" >> /root/.bashrc source /root/.bashrc # Set PYTHONHASHSEED on all slaves pssh -h /root/spark-ec2/slaves 'echo "export PYTHONHASHSEED=0" >> /root/.bashrc' # Restart all slaves sh /root/spark/sbin/stop-slaves.sh sh /root/spark/sbin/start-slaves.sh
It’s finally time to implement that new personalization service — the one you’ve been pushing for for months. With it, your app will be serving up relevant, personalized content to every user. But the...
By Stuart Axelbrooke, who does data science and text analytics. You should follow him on Twitter