Fastest way to download from WRDS
1 Why you may not want to download using Web or PostgreSQL
Two typical ways to download data from WRDS are (1) Web and (2) PostgresSQL. To download from Web, one simply logins the WRDS site and clicks the mouses. To download from the PostgreSQL database, a little programming is needed. In R, one uses dbSendQuery()
; In Python, one can use conn.get_table()
.
I don’t encourage Web downloading because:
It’s error-prone and not programable. Say you need to update your data every week. Each time you need to go through the mouse-clicking procedure again. You may forget to include one variable, you may choose the wrong date formate, or the browser simply crashes. Either way, you need to start over again.
It’s slow. Even if you have 1Gbps Internet connection, you may still be restricted to less than 5MB/s by WRDS.
Downloading from the PostgreSQL database is much more recommended. In WRDS, all data is stored into two copies: SAS and PostgreSQL. When you query from the Web, the default database being queried is SAS. But as a legacy from the 60s, SAS is too old to work smoothly with modern programming tools. WRDS knows that, so they created a second copy of all the data and stored it in a PostgreSQL database. By doing that, you can now use familiar languages such as R or Python to directly query from WRDS.
While bing flexible and easy to use, PostgreSQL has a pain point: It’s still too slow. Image how long it will take if you want to download the whole RavenPack database (~2.2T)!
2 Why I recommend Rsync
Rsync is a command line-based tool for file syncing and transferring. When we use Rsync, it still try to download the SAS dataset (the .sas7bdat
file, to be specific). But compared with directly query the Web, Rsync has the following merits.
It’s command line-based, meaning you can easily incorporate the download command line into your R or Python program.
It’s super fast! Rsync has a
-z
option for compressing. You may already find that many datasets have sparse columns: columns that most of the observations are none. In SAS (.sas7bdat
), null and not-null values take the same space. That is, the number10000000000
and a null value occupies the same number of bytes. By enabling the-z
option, Rsync can drastically reduce the file size before it’s transferred, and hence significantly boost speed.Rsync only downloads when files are changed. If a file is not modified, Rsync will not download it however many times you run the command. Even if a file is modified, only the portion being modified are transferred, which is call the “delta sync.”
But I don’t want to mislead you here. While Rsync is my favorite, it may not suit you because of the following reasons:
- It requires a working knowledge of command line and SSH. Yes, you need to first SSHs to WRDS cloud before using Rsync. This may be a bit scary if you have no experience with the command line.
- The downloaded data is in SAS. Personally, I keep a backend program running to automatically convert any
.sas7bdat
files intoarrow:feather
(orRDS
if you use R).
3 Examples: Use Rsync
to download funda
As I mentioned above, we need to SSH to WRDS to use Rsync. Depending on whether you use password or SSH key as the authentication method, there’re two ways to connect. Either works.
Say we want to download the funda
(Annual fundamentals) dataset from the Compustat database.
3.1 (Method 1: Easier) username/password authentication
First, install the sshpass
tool. On a linux machine, you can use sudo apt install sshpass
. After successful installation, run the following command:
sshpass -p '123456' rsync -hrztvP yuz@wrds-cloud.wharton.upenn.edu:/wrds/comp/sasdata/d_na/funda.sas7bdat ~/funda.sas7bdat
In the above command, user “yuz” connects to the WRDS Cloud using password “123456”. He downloads the dataset located at /wrds/comp/sasdata/d_na/funda.sas7bdat
and saves it to the local path ~/funda.sas7bdat
.
To check where the dataset is located on the WRDS Cloud, you can go to WRDS website:
You can also check using the web-based SAS Studio (https://wrds-cloud.wharton.upenn.edu/SASStudio/index?locale=en_US):
3.2 (Method 2: Hacker) public/private Key authentication
The second method is un-officially documented. It doesn’t require you to install sshpass
, but you need to get a copy of your SSH private key. The SSH key is located in your user folder (~/.ssh
) when you login the WRDS Cloud. Download it and save it to, for example, ~/.ssh/id_rsa
on your local machine.
Then run the following command
rsync -hrztvP -e 'ssh -i ~/.ssh/id_rsa' yuzhu@wrds-cloud.wharton.upenn.edu:/wrds/comp/d_na/funda.sas7bdat
The ssh -i ~/.ssh/id_rsa
part tells Rsync to read the SSH key from your local disk. By providing the key, you don’t need to type the password. There’re many other benefits to connect to WRDS Cloud using SSH key. But I can’t say too much here (legal issues concerned!).