Python
Python is a scripting language.[1]
Python 2 End of Life
Python 2 reached its end of life on January 1st, 2020[2][3] The statement from Guido van Rossum is reproduced here:
Let's not play games with semantics. The way I see the situation for 2.7 is that EOL is January 1st, 2020, and there will be no updates, not even source-only security patches, after that date. Support (from the core devs, the PSF, and python.org) stops completely on that date. If you want support for 2.7 beyond that day you will have to pay a commercial vendor. Of course it's open source so people are also welcome to fork it. But the core devs have toiled long enough, and the 2020 EOL date (an extension from the originally annouced 2015 EOL!) was announced with sufficient lead time and fanfare that I don't feel bad about stopping to support it at all.
Picotte
RHEL 8 comes with Python 3.6 providing the commands python
and
python3
.
Recommendation
We recommend using one of the locally-compiled Pythons, e.g.
python/gcc/3.10
, and then creating virtual environments (or
venvs)[4][5] Venvs also have the advantage that they can be easily used
by multiple people, e.g. all members of a research group. For an
example, see Installing TensorFlow 2.11.0 using pip and venv
We do not recommend Anaconda as it distributes various libraries. Some of these libraries, specifically GPU-related CUDA libraries, may be incompatible with the drivers installed on Picotte’s GPU nodes, and result in either job crashes or even node crashes. Additionally, trying to share a single Anaconda installation between multiple people can be error-prone.
RHEL 8 Python
In addition to Python 3.6, RHEL8 also provides
- Python 2.7.16, with the command
python2
Bright-Provided Python
Bright Cluster Manager provides
- Python 3.7
with the modulefiles python3
and python37
(these refer to the same
module).
Locally-Installed Versions
There are several versions installed:
- Anaconda Python - modulefile
python/anaconda3
- Python 3.8- modulefile
python/gcc/3.8.6
- Python 3.9 - modulefile
python/gcc/3.9.1
- Python 3.10 - modulefile
python/gcc/3.10
- Python 3.11 - modulefile
python/gcc/3.11
- Intel Python 3.7.7 - modulefile
python/intel/2020.4.912
Anaconda Python
To use Anaconda Python, it is not enough to load the modulefile. Additional setup needs to be done, which will modify your login script and provide Anaconda Python for all future login sessions. See: Anaconda#URCF-Installed Anaconda
Other Versions
For the others, just load the listed modulefile.
Intel® Distribution for Python[6] is a distribution of Python including various packages which are optimized for performance on Intel processors.
Jupyter
See: Jupyter
QIIME
- See QIIME
Intel-Optimized Versions
Intel has Intel-optimized (using MKL etc) versions of Python 2.7, 3.5, and 3.6 available. There is no charge for the "community-supported" version. Download at:
https://registrationcenter.intel.com/en/forms/?productid=2810
Python Virtualenvs (venv)
You can make use of an existing version of Python on Picotte to install a virtualenv (venv) containing some set of packages/modules that you require. This can be shared with other members of your group. For an example, see: TensorFlow#Installing a Private TensorFlow for Your Research Group with venv
Conda Environments
Conda environments[16] are a similar idea to virtual environments above, or pip environments. They allow easy switching between sets of Python modules to make up one "application", avoiding the possibility that one module may have a dependency that clashes with another module.
On the GPU Nodes
The GPU nodes have an anaconda3
module which provides several
different conda environments with different versions of Python. Use the
modulefile:
python/anaconda3
This gives Python 3.6
There are some Python frameworks which have requirements which clash: these frameworks are separated into their own conda environments:
- caffe
- caffe2
- python27
- python37 -- this contains PyCUDA and PyOpenCL; see NVIDIA CUDA#PyCUDA
- pytorch
- tensorflow -- see Tensorflow
To activate an environment named "caffe
":
conda activate caffe
Once you have completed working in that environment, you can deactivate it to return to the default Anaconda Python with its modules:
conda deactivate
Your own Anaconda installations
If you have your own Anaconda installation, you must set it up in your job scripts before running any conda commands.
In your job script, you must have the line:
export PATH=/my/path/anaconda3/bin:$PATH
before you have any "conda activate ..." commands, or even anything that uses the Python provided by your installation of Anaconda.
Jupyter
Please see: JupyterLab
Installing Your Own Python
This should be your last resort.
The first two below are popular Python distributions. Both use Intel Math Kernel Library (MKL)[17] for accelerated performance on Intel CPUs (but not AMD). Both have their own Python module management system; both also have pip for installing modules not provided by their repositories. The third, Intel Distribution for Python, is an Anaconda-based distribution that also provides optimized libraries for Intel hardware.
- Continuum Analytics' Anaconda - This may break modulefile on Proteus due to differing versions of the Tcl scripting language. Note also that the Anaconda distribution is large, and may exceed the 15 GB quota set on Proteus home directories. (Picotte home directories have a larger quota.)
- Enthought Canopy
- Intel Distribution for Python
You may also compile your own copy. See: Compiling Python
Notes about Anaconda
General
Anaconda assumes, by default, that is installed for a single user.
Modifications must be made to the user's login scripts (using
conda init
), which make that installation of Anaconda active on every
login. This makes it awkward to use more than one Anaconda installation
(switching between installations on demand).
Picotte
With the increased default storage allocation in home directories (64 GiB), you should be able to install Anaconda to your home directory rather than using Miniconda.
However, note that each conda environment you create will consume space, the amount depending on exactly what you install in each conda env.
Setup for Job Scripts
If you do install your own Python, you will likely have to modify your job script. Add the following line after the module load commands:
. ~/.bashrc
Installing Your Own Python Modules
NOTE before installing your own Python modules, check to see if the modules you need are already installed in one of the system-wide Pythons. Some modules need modifications to the standard installation procedure, and a simple "pip install" will not result in a correct installation.
Without using the Software Collections, you can install python modules for yourself using pip. E.g.
[juser@proteusi01 ~]$ python3 -m pip install --user scikit-learn
If you use one of the locally-compiled Pythons, i.e. you did "module load python/2.7-current", the command is still the same. All locally-compiled Pythons provide the pip command.
You may also want to investigate Python Virtual Environments (virtualenv).
If you get warnings like:
InsecurePlatformWarning: A true SSLContext object is not available.
just install the following:
[juser@proteusa01 ~]$ python3 -m pip install --user "requests[security]"
If the package you want is not listed on PyPI, you can do the following
in the directory containing all the package files that you downloaded
and expanded -- it looks for "setup.py
":
[juser@proteusa01 somepkg]$ python3 -m pip install --user .
Python in the Cloud
Google Colaboratory
Google Colaboratory is a cloud-hosted Python environment based on Jupyter. Jupyter Notebooks are stored on your existing Google Drive. Both Python 2 and Python 3 are available. The hosted instances run on GPUs.
- Google Colaboratory - Welcome Notebook
- Towards Data Science blog - Getting Started With Google Colab
Speeding Up Python
- Cython provides a relatively easy way of generating fast compiled modules for Python.
Accelerating Python with Numba
Numba is an open source Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code.[18] It also allows use of GPUs.
For a brief introduction, see this Youtube video: Make Python code 1000x Faster with Numba
Parallel Execution
Python, in general, does not do parallelism across physical nodes. There are ways to do it, but your software must specifically state that it can do this. This requires one or more Python packages which provide this facility, e.g. MPI for Python.
If the software you use does not specifically state that it can do parallelism across physical nodes, it is at best multithreaded. Please consult your software documentation for details.
For multithreaded code, use the shm PE. See Writing Job Scripts for details.
joblib.Parallel
It is NOT RECOMMENDED to use joblib.Parallel
. Instead, write a job
array. See: Writing Job Scripts#Array Jobs
Or, use concurrent.futures.ProcessPoolExecutor
[19] (to bypass the
Global Interpreter Lock) or the ThreadPoolExecutor
Python Multiprocessing
It is NOT RECOMMENDED to use multiprocessing
. Instead, write a job
array: Writing Job Scripts#Array Jobs
Or, use concurrent.futures.ProcessPoolExecutor
[20] (to bypass the
Global Interpreter Lock) or the ThreadPoolExecutor
Python has a multiprocessing module (also available in Python 2.7), which provides threaded execution capability.
However, multiprocessing ignores the cgroups system which restricts jobs to using only the number of slots requested in the jobs. The multiprocessing module always uses all available CPU cores on the node. This happens even if there are other jobs running on that node, leading to an overload condition, where there are more threads/processes running than there are CPU cores.
The workaround is this -- to request all "full" nodes, by requesting all cores.
- If you want to run on AMDs with 64 cores:
#$ -pe shm 64
- If you want to run on 16-core Intels (the "ua=sandybridge" request is to avoid the 20-core Haswell nodes)
#$ -pe shm 16
#$ -l ua=sandybridge
- If you want to run on 20-core Intels (n.b. there are only 4 of these in Proteus, so the wait time may be much longer)
#$ -pe shm 20
#$ -l ua=haswell
In particular, the gensim
models.ldamulticore
object uses multiprocessing and suffers this
issue. See also: Gensim
Another workaround, but which depends on your being able to control the use of the multiprocessing package, is to use the following to count the number of CPU cores actually available to your program:
import multiprocessing as mp
def f(x):
return x*x
ncores = len(os.sched_getaffinity(0))
with mp.Pool(ncores) as p:
print(p.map(f, [1, 2, 3, 4, 5]))
References
[2] Python.org - Sunsetting Python 2
[3] Python 2 End-of-Life Statement by Guido van Rossum
[4] Python Documentation » The Python Standard Library » Software Packaging and Distribution » venv
[5] Real Python - Python Virtual Environments: A Primer
[6] Intel® Distribution for Python
[7] Numpy website
[8] Scipy website
[10] Pandas website
[11] BioPython website
[12] SciKit-bio website
[13] SciKit-Learn website
[14] What is Anaconda?
[16] Conda User Guide: Managing Environments
[17] Intel MKL product website
[18] Numba website
[19] [https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor The Python Standard Library - concurrent.futures - ProcessPoolExecutor
[20] [https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor The Python Standard Library - concurrent.futures - ProcessPoolExecutor