Fastest way to download from WRDS

Two typical ways to download data from WRDS are (1) Web and (2) PostgresSQL. To download from Web, one simply logins the WRDS site and clicks the mouses. To download from the PostgreSQL database, a little programming is needed. In R, one uses dbSendQuery(); In Python, one can use conn.get_table().

I don’t encourage Web downloading because:

  • It’s error-prone and not programable. Say you need to update your data every week. Each time you need to go through the mouse-clicking procedure again. You may forget to include one variable, you may choose the wrong date formate, or the browser simply crashes. Either way, you need to start over again.

  • It’s slow. Even if you have 1Gbps Internet connection, you may still be restricted to less than 5MB/s by WRDS.

Downloading from the PostgreSQL database is much more recommended. In WRDS, all data is stored into two copies: SAS and PostgreSQL. When you query from the Web, the default database being queried is SAS. But as a legacy from the 60s, SAS is too old to work smoothly with modern programming tools. WRDS knows that, so they created a second copy of all the data and stored it in a PostgreSQL database. By doing that, you can now use familiar languages such as R or Python to directly query from WRDS.

While bing flexible and easy to use, PostgreSQL has a pain point: It’s still too slow. Image how long it will take if you want to download the whole RavenPack database (~2.2T)!

Rsync is a command line-based tool for file syncing and transferring. When we use Rsync, it still try to download the SAS dataset (the .sas7bdat file, to be specific). But compared with directly query the Web, Rsync has the following merits.

  • It’s command line-based, meaning you can easily incorporate the download command line into your R or Python program.

  • It’s super fast! Rsync has a -z option for compressing. You may already find that many datasets have sparse columns: columns that most of the observations are none. In SAS (.sas7bdat), null and not-null values take the same space. That is, the number 10000000000 and a null value occupies the same number of bytes. By enabling the -z option, Rsync can drastically reduce the file size before it’s transferred, and hence significantly boost speed.

  • Rsync only downloads when files are changed. If a file is not modified, Rsync will not download it however many times you run the command. Even if a file is modified, only the portion being modified are transferred, which is call the “delta sync.”

But I don’t want to mislead you here. While Rsync is my favorite, it may not suit you because of the following reasons:

  • It requires a working knowledge of command line and SSH. Yes, you need to first SSHs to WRDS cloud before using Rsync. This may be a bit scary if you have no experience with the command line.
  • The downloaded data is in SAS. Personally, I keep a backend program running to automatically convert any .sas7bdat files into arrow:feather (or RDS if you use R).

As I mentioned above, we need to SSH to WRDS to use Rsync. Depending on whether you use password or SSH key as the authentication method, there’re two ways to connect. Either works.

Say we want to download the funda (Annual fundamentals) dataset from the Compustat database.

First, install the sshpass tool. On a linux machine, you can use sudo apt install sshpass. After successful installation, run the following command:

bash

sshpass -p '123456' rsync -hrztvP yuz@wrds-cloud.wharton.upenn.edu:/wrds/comp/sasdata/d_na/funda.sas7bdat ~/funda.sas7bdat

In the above command, user “yuz” connects to the WRDS Cloud using password “123456”. He downloads the dataset located at /wrds/comp/sasdata/d_na/funda.sas7bdat and saves it to the local path ~/funda.sas7bdat.

To check where the dataset is located on the WRDS Cloud, you can go to WRDS website:

WRDS site

You can also check using the web-based SAS Studio (https://wrds-cloud.wharton.upenn.edu/SASStudio/index?locale=en_US):

SAS Studio

The second method is un-officially documented. It doesn’t require you to install sshpass, but you need to get a copy of your SSH private key. The SSH key is located in your user folder (~/.ssh) when you login the WRDS Cloud. Download it and save it to, for example, ~/.ssh/id_rsa on your local machine.

Then run the following command

bash

rsync -hrztvP -e 'ssh -i ~/.ssh/id_rsa' yuzhu@wrds-cloud.wharton.upenn.edu:/wrds/comp/d_na/funda.sas7bdat

The ssh -i ~/.ssh/id_rsa part tells Rsync to read the SSH key from your local disk. By providing the key, you don’t need to type the password. There’re many other benefits to connect to WRDS Cloud using SSH key. But I can’t say too much here (legal issues concerned!).