Running Python in RStudio using reticulate

Introduction

Since I first started using Python about 2 years ago, one of the things I found most frustrating was that there was no IDE which matched the aesthetics and functionality of RStudio and RMarkdown. I’ve tried a bunch of them: Spyder, PyCharm, Jupyter Notebooks, etc.: they just don’t do it for me. Plus, most of my Python workflows could be drastically improved by being able to incorporate the likes of dplyr, stringr, and other tidyverse packages into my data processing workflows.

I had heard about reticulate a bunch - the package that can magically run Python code in RStudio - but every time I tried to get it up and running I seemed to have some kind of issue. There were a bunch of things that kept tripping me up: for example, being unsure of which calls needed to be running Python vs. via reticulate (for example, to install modules into your Python build, you need to do this via reticulate’s py_install() function - this took time to figure out and caused me headaches along the way!).

On my (I think) 5th attempt, I have now got a basic setup working, and am writing this post to give others the condensed version of how to do this smoothly. A couple of things to note:

I am running a Macbook Pro with an Apple M1 processor
I am running RStudio 2022.02.1 Build 461
I am running Python version 3.8.13
I am making this guide using RMarkdown, since it makes for the easiest transitions between R and Python (i.e., by using code chunks)

Setting up Python in RStudio

Step 1: Install reticulate

OK, so let’s get into it. The first thing you need is, of course, to install reticulate - the R package that makes the magic happen. Do so by running this code:

install.packages("reticulate")
library(reticulate)

Step 2: Install or locate Python

So far, so good. R now has a package installed that can help RStudio speak in Python. But before it can run Python code, it also needs to be told where Python is stored on the machine! There are two options here: either we can create a fresh install of Python, or we can point R to an existing directory on your machine where Python is installed. Personally, I prefer to work with an existing install of Python.

If you don’t have a Python install already on your machine, you can get it from here. If you do, then you need to tell reticulate where to find it. Do this by using the following code (in R):

use_python("add_python_directory_location_here")

Once you’ve done this, you should be able to start actually running Python code in RStudio already! There are other things you can faff around with, like installing virtual environments with custom installs of Python, but to be honest as someone whose “native” coding language is R, this was the rabbit hole that confused me on previous tries with reticulate. It’s not necessary to use these virtual environments - they’re just nice to have if you want to make your project a little more modular.

Step 2.5: Run (or fail to run) Python code

So, let’s run some Python code. To do this, you create a Markdown chunk just as you do with R. However, instead of writing “r” between the curly brackets of the chunk (like {"```{r}```"}) you instead write {"```{python}```"}.

Let’s do this and try to run some very simple Python code:

# import numpy and pandas
import numpy as np 
import pandas as pd

# create a Numpy array and a pandas dataframe
array = np.random.binomial(10, .1, 100)
df = pd.DataFrame(array)

# print the array and dataframe
array, df

## (array([0, 0, 2, 1, 0, 0, 1, 1, 2, 2, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
##        1, 2, 2, 0, 1, 2, 0, 0, 1, 2, 0, 0, 3, 2, 0, 0, 1, 0, 0, 0, 0, 2,
##        0, 0, 0, 1, 3, 0, 2, 1, 1, 1, 2, 0, 0, 2, 0, 2, 2, 2, 1, 1, 3, 1,
##        0, 0, 1, 1, 2, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 1, 0, 0, 0,
##        1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0]),     0
## 0   0
## 1   0
## 2   2
## 3   1
## 4   0
## .. ..
## 95  1
## 96  1
## 97  0
## 98  1
## 99  0
## 
## [100 rows x 1 columns])

Step 3: Celebrate! (Or fix errors)

Eureka! Simple as that, the code is working. Now, if you tried to run the exact code above, you will get an error if you don’t have the Python modules numpy or pandas installed (in the same way that you’d get an error if you try to run a tidyverse function without tidyverse installed). In typical Python code, we could install these packages using terminal commands and the ‘pip’ package installer of Python.

With reticulate, we don’t need to get bogged down in that. We can use calls in R to install packages directly to our Python environment. For example, if we want to install pandas or numpy, we can feed a vector of Python package names to the py_install() function from reticulate:

py_install(c("pandas", "numpy"))

If you didn’t have the packages installed before, you will now, and the code chunk above should work!

Where do Python objects go?

One thing you may notice is that when you start running the Python code, objects you create are not saved in the R environment. In fact, a separate environment (namely, the Python environment) is where those variables are saved to. You can switch between viewing the R and Python environments in the RStudio “environment” as below:

This means you can have a variable x in the R environment assigned to one value, and a variable with the same name x with a different value, datatype, etc. within the Python environment.

Accessing R objects in Python and Python objects in R

So, the question now becomes: how can we pass data from Python to R (and vice versa)? This is relatively simple to do, but the method differs if you’re using standard scripts vs. RMarkdown. I will show how to use the .Rmd approach. For the standard scripts approach, check out the reticulate documentation.

So first, let’s create some variables in R and Python.

x <- 5
y <- 6 

z <- x * y

Python:

a = 10
b = 20

c = a + b

To pass data from Python to R, we need to tell R to look in the Python environment. We do this using the $ operator on the object py, which is the Python environment!

# get product of z and c

z_c_product <- z * py$c

py$c

## [1] 30

z_c_product

## [1] 900

To do the opposite (i.e., pass data from R to Python) we use the exact same logic, but in Python code. To refer to the R environment in Python, we use . on the r object (think of all features of the R environment as being attributes of that environment).

# get sum of z and c
z_c_sum = r.z + c

r.z, z_c_sum

## (30.0, 60.0)

Conclusion

Voila! You now have the basic building blocks needed for integrating your R and Python workflows. For data structures like vectors, matrices, arrays, and dataframes (assuming you’re importing pandas), you should be able to trivially move these between the two environments, manipulate them, and move them back again.

For data structures like tensors, these are less trivial (particularly tensors created within Python’s Tensorflow). I’ll write another post in the future discussing optimal workflows for these once I’ve figured them out myself. But in the meantime, hopefully this post has given you one less excuse to avoid using Python!