NAICS Download
A Simple Crawler for NAICS Code
1 Update (2024-1-15)
I found that there’s a NAICS api on Github with 91 stars. If you’re comfortable with API then this may be for you. But I still feel my solution offers the following benefits:
I offered a ready-to-download table that you can download and use even you don’t know what’s an API.
The Github repository seems to be stale (the last commit is 11 years ago) with the 2022 update missing. My solution covers all the three updates (2012, 2017, and 2022).
Let me know if there’re other good NAICS solutions!
2 TL;DR
While NAICS (North American Industrial Classification System) data is public on its own website and US Census’s website , utilizing it is tedious, as it’s presented as an HTML page and users have to convert it into a 2D table.
I wrote a simple Scrapy crawler to collect all the NACIS classification results. (There’re three versions of NAICS: 2012, 2017, and 2022). The source code is hosted on my GitHub.
How to download:
- Go to the GitHub repository
- Find the
results
folder and download thenaics_complete.feather
file.
3 How to Read the Data with R/Python?
feather
is an amazing format provided by Apache that supports both R and Python without the need for convertion.
If you use R (first install arrow
):
library(arrow)
df = read_feather('results/naics_complete.feather')
If you use Python (first install pyarrow
):
from pyarrow.feather import read_feather
df = read_feather('results/naics_complete.feather')
4 How I Organize the Result
As you can see, there’re four columns:
year
: 2012, 2017, or 2022code
: the NAICS codedesc
: description of the industrylevel
:"2 digits"
: the highest level (about 20 industries)"4 digits"
: the next level (about 300 industries)"6 digits"
: the finest level (about 1000 industries)
5 How to Run the Crawler
At the root directory, simpy run scrapy crawl naics
. Of course, you need to install Scarpy first. See its documentation.
6 Funny Facts about NAICS and WRDS
NAICS is one of the most popular, and to my personal view, the go-to industry classification systems for North America companies. Quoting WRDS:
NAICS is designed to replace the old SIC system. It was developed jointly by the U.S. Economic Classification Policy Committee (ECPC), Statistics Canada, and Mexico’s Instituto Nacional de Estadistica y Geografia.
Many databases in WRDS do offer NAICS code, but to my best knowledge, they don’t offer textual description. So what’s the use of an NAICS code, say 5311, if nobody tells you that it means “Lessors of Real Estate?” To get this textual description, we have to go to NAICS’s official website, which only shows the data in HTML, not a downloadable tabular spreadsheet: