After reading this, you will be able to execute python files and jupyter notebooks that execute Apache Spark code in your local environment. This tutorial applies to OS X and Linux systems. We assume you already have knowledge on python and a console environment.
1. Download Apache Spark
Visual Studio Python Extension Mac
We will download the latest version currently available at the time of writing this: 3.0.1 from the official website.
Download it and extract it in your computer. The path I'll be using for this tutorial is /Users/myuser/bigdata/spark
This folder will contain all the files, like this
Visual Studio on a Mac: The Best of Both Worlds. With these tweaks, I’ve come to love using Visual Studio on a Mac. The performance is good, and by running Windows in a virtual machine, I get the best of both OS worlds. Want to see what I’m building with this setup? Check out our open-source.NET SDK on Github. Python support is not presently available in Visual Studio for Mac, but is available on Mac and Linux through Visual Studio Code. See questions and answers. Visual Studio 2019 and Visual Studio 2017 Download and run the latest Visual Studio installer. Python 3; Install Visual Studio Code and the Python Extension. If you have not already done so, install VS Code. Next, install the Python extension for VS Code from the Visual Studio Marketplace. For additional details on installing extensions, see Extension Marketplace. The Python extension is named Python and it's published by Microsoft.
Now, I will edit the .bashrc
file, located in the home of your user
Then we will update our environment variables so we can execute spark programs and our python environments will be able to locate the spark libraries.
Save the file and load the changes executing $ source ~/.bashrc
. If this worked, you will be able to open an spark shell.
We are now done installing Spark.
2. Install Visual Studio Code
One of the good things of this IDE is that allows us to run Jupyter notebooks within itself. Follow the Set-up instructions and then install python and the VSCode Python extension.
Then, open a new terminal and install the pyspark package via pip $ pip install pyspark
. Note: depending on your installation, the command changes to pip3
.
3. Run your pyspark code
Create a new file or notebook in VS Code and you should be able to execute and get some results using the Pi example provided by the library itself.
Troubleshoot
Visual Studio Community
If you are in a distribution that by default installs python3 (e.g. Ubuntu 20.04), pyspark will mostly fail with a message error like pysparkenv: 'python': No such file or directory
.
The first option to fix it is to add to your .profile
or .bashrc
files the following content
Remember to always reload the configuration via source .bashrc
In this case, the solution worked if I executed pyspark from the command line but not from VSCode's notebook. Since I am using a distribution based on debian, installing tehe following package fixed it:
sudo apt-get install python-is-python3