NAICS Download

A Simple Crawler for NAICS Code

I found that there’s a NAICS api on Github with 91 stars. If you’re comfortable with API then this may be for you. But I still feel my solution offers the following benefits:

  • I offered a ready-to-download table that you can download and use even you don’t know what’s an API.

  • The Github repository seems to be stale (the last commit is 11 years ago) with the 2022 update missing. My solution covers all the three updates (2012, 2017, and 2022).

Let me know if there’re other good NAICS solutions!

While NAICS (North American Industrial Classification System) data is public on its own website and US Census’s website , utilizing it is tedious, as it’s presented as an HTML page and users have to convert it into a 2D table.

I wrote a simple Scrapy crawler to collect all the NACIS classification results. (There’re three versions of NAICS: 2012, 2017, and 2022). The source code is hosted on my GitHub.

How to download:

  • Go to the GitHub repository
  • Find the results folder and download the naics_complete.feather file.

feather is an amazing format provided by Apache that supports both R and Python without the need for convertion.

If you use R (first install arrow):

r

library(arrow)

df = read_feather('results/naics_complete.feather')

If you use Python (first install pyarrow):

python

from pyarrow.feather import read_feather

df = read_feather('results/naics_complete.feather')
Preview of the results

As you can see, there’re four columns:

  • year: 2012, 2017, or 2022

  • code: the NAICS code

  • desc: description of the industry

  • level:

    • "2 digits": the highest level (about 20 industries)

    • "4 digits": the next level (about 300 industries)

    • "6 digits": the finest level (about 1000 industries)

At the root directory, simpy run scrapy crawl naics. Of course, you need to install Scarpy first. See its documentation.

NAICS is one of the most popular, and to my personal view, the go-to industry classification systems for North America companies. Quoting WRDS:

Quote
The three most popular industry classifications systems are SIC (Standard Industrial Classification), NAICS (North American Industrial Classification System), and GICS (Global Industry Classification Standard).

NAICS is designed to replace the old SIC system. It was developed jointly by the U.S. Economic Classification Policy Committee (ECPC), Statistics Canada, and Mexico’s Instituto Nacional de Estadistica y Geografia.

Many databases in WRDS do offer NAICS code, but to my best knowledge, they don’t offer textual description. So what’s the use of an NAICS code, say 5311, if nobody tells you that it means “Lessors of Real Estate?” To get this textual description, we have to go to NAICS’s official website, which only shows the data in HTML, not a downloadable tabular spreadsheet:

Screenshot of the NACIS website