One common task to parse 10K/Q files is to extract “items” or “sections” from the filing. Typically, a 10K filing has the following items:
Business
Risk factors
Selected financial data
Management’s discussion and analysis
Financial statements and supplementary data
In finance research, the “Management’s discussion” item receives special attention. However, there’s no universal flags for each item, so researchers have to develop their own text extraction rules. An example as follows:
This article provide an example code to extract all major items from 10K and 10Q.
2 Where can I find “cleaned” 10K/Q files?
Before parsing, we first need to clean the 10X filings since the original EDGAR files contain a lot of noisy HTML tags and special characters. Luckily, we don’t have to do this ourselves since there’re two reliable sources for the clean version of 10X filings.
2.1 Option 1 (free): From Loughran-McDonald’s website
If you ever studied the literature on company filings, then you must know Loughran and McDonald. They provide cleaned 10K/Q filings on their website. The cleaning details can be found here.
2.2 Option 2 (paid): WRDS’s SEC suite
WRDS also provides their version of cleaned 10X filings. It’s not free since your school must subscribe the SEC suite before accessing the data. But the WRDS’s data has several advantages:
Not limited to 10X. In addition to 10K and 10Q, WRDS provides cleaned versions of ALL filings on EDGAR. These cleaned filings come with the format of txt, totaling to over 2TB.
Frequent update. WRDS clean and update the filings on a daily basis. So theoretically if a company files something on EDGAR, you can get the cleaned version on the next day.
Value-added products. WRDS also provide value-add products based on the SEC cleaned filings, such as Ngram and sentiment.
2.3 Option 3 (paid): sec-api.io
sec-api.io is a paid service to provide fully parsed SEC filings. By fully parsed I mean you can directly query parsed items from it!
For example, if you want to get item 1A (Risk Factors) in clear text from Tesla’s recent 10-K filing, you can use the following http query:
Sounds too good to be true, right? But wait, it’s not cheap. The monthly fee ranges from 50 to 240, depending on if you’re an individual or commercial entity. But the real deal breaker for me is that it caps monthly data usage to 15GB! Since even ten year’s 10K filings will be well over 15GB, this service becomes useless to me. A minor issue is that unlike WRDS or Loughran-McDonald, the parsing method of sec-api.io is not open sourced, so you can’t not verify the results.
3 Python code to extract items
Tip
The key idea is to use RegEx
There’re no perfect regex rules. My current version has a failure rate < 0.5%
The following code assume you’re using Loughran-McDonald’s version of cleaned filings. The two function, get_itemized_10k and get_itemized_10q, extract items from 10K and 10Q filings.
# get file path as dict[int, list[str]] where # key is the year and value is the list of file paths# break the text into itemized sectionsdefget_itemized_10k(fname,sections:list[str]=['business','risk','mda','7a']):'''Extract ITEM from 10k filing text.
Args:
fname: str, the file name (ends with .txt)
sections: list of sections to extract
Returns:
itemized_text: dict[str, str], where key is the section name and value is the text
'''withopen(fname,encoding='utf-8')asf:text=f.read()defextract_text(text,item_start,item_end):'''
Args:
text: 10K filing text
item_start: compiled regex pattern
item_end: compiled regex pattern
'''item_start=item_startitem_end=item_endstarts=[i.start()foriinitem_start.finditer(text)]ends=[i.start()foriinitem_end.finditer(text)]# if no matches, return empty stringiflen(starts)==0orlen(ends)==0:returnNone# get possible start/end positions# we may end up with multiple start/end positions, and we'll choose the longest# item text.positions=list()forsinstarts:control=0foreinends:ifcontrol==0:ifs<e:control=1positions.append([s,e])# get the longest item textitem_length=0item_position=list()forpinpositions:if(p[1]-p[0])>item_length:item_length=p[1]-p[0]item_position=pitem_text=text[item_position[0]:item_position[1]]returnitem_text# extract text for each sectionresults={}forsectioninsections:# ITEM 1: Business# if there's no ITEM 1A then it ends at ITEM 2ifsection=='business':try:item1_start=re.compile("i\s?tem[s]?\s*[1I]\s*[\.\;\:\-\_]*\s*\\b",re.IGNORECASE)item1_end=re.compile("item\s*1a\s*[\.\;\:\-\_]*\s*Risk|item\s*2\s*[\.\,\;\:\-\_]*\s*(Desc|Prop)",re.IGNORECASE)business_text=extract_text(text,item1_start,item1_end)results['business']=business_textexceptExceptionase:print(f'Error extracting ITEM 1: Business for {fname}')# ITEM 1A: Risk Factors# it ends at ITEM 2ifsection=='risk':try:item1a_start=re.compile("(?<!,\s)item\s*1a[\.\;\:\-\_]*\s*Risk",re.IGNORECASE)item1a_end=re.compile("item\s*2\s*[\.\;\:\-\_]*\s*(Desc|Prop)|item\s*[1I]\s*[\.\;\:\-\_]*\s*\\b",re.IGNORECASE)risk_text=extract_text(text,item1a_start,item1a_end)results['risk']=risk_textexceptExceptionase:print(f'Error extracting ITEM 1A: Risk Factors for {fname}')# ITEM 7: Management's Discussion and Analysis of Financial Condition and Results of Operations# it ends at ITEM 7A (if it exists) or ITEM 8ifsection=='mda':try:item7_start=re.compile("item\s*7\s*[\.\;\:\-\_]*\s*\\bM",re.IGNORECASE)item7_end=re.compile("item\s*7a\s*[\.\;\:\-\_]*[\s\n]*Quanti|item\s*8\s*[\.\,\;\:\-\_]*\s*Finan",re.IGNORECASE)item7_text=extract_text(text,item7_start,item7_end)results['mda']=item7_textexceptExceptionase:print(f'Error extracting ITEM 7: MD&A for {fname}')# ITEM 7A: Quantitative and Qualitative Disclosures About Market Risk# ifsection=='7a':try:item7a_start=re.compile("item\s*7a\s*[\.\;\:\-\_]*[\s\n]*Quanti",re.IGNORECASE)item7a_end=re.compile("item\s*8\s*[\.\,\;\:\-\_]*\s*Finan",re.IGNORECASE)item7a_text=extract_text(text,item7a_start,item7a_end)results['7a']=item7a_textexceptExceptionase:print(f'Error extracting ITEM 7A: for {fname}')returnresultsdefget_itemized_10q(fname,sections:list[str]=['mda']):'''Extract ITEM from 10k filing text.
Args:
fname: str, the file name (ends with .txt)
sections: list of sections to extract
Returns:
itemized_text: dict[str, str], where key is the section name and value is the text
'''withopen(fname,'r')asf:text=f.read()defextract_text(text,item_start,item_end):'''
Args:
text: 10K filing text
item_start: compiled regex pattern
item_end: compiled regex pattern
'''item_start=item_startitem_end=item_endstarts=[i.start()foriinitem_start.finditer(text)]ends=[i.start()foriinitem_end.finditer(text)]# if no matches, return empty stringiflen(starts)==0orlen(ends)==0:returnNone# get possible start/end positions# we may end up with multiple start/end positions, and we'll choose the longest# item text.positions=list()forsinstarts:control=0foreinends:ifcontrol==0:ifs<e:control=1positions.append([s,e])# get the longest item textitem_length=0item_position=list()forpinpositions:if(p[1]-p[0])>item_length:item_length=p[1]-p[0]item_position=pitem_text=text[item_position[0]:item_position[1]]returnitem_text# extract text for each sectionresults={}forsectioninsections:# ITEM 7: Management's Discussion and Analysis of Financial Condition and Results of Operations# it ends at ITEM 7A (if it exists) or ITEM 8ifsection=='mda':try:item2_start=re.compile("item\s*2\s*[\.\;\:\-\_]*[\s\n]*Man",re.IGNORECASE)item2_end=re.compile("item\s*3\s*[\.\;\:\-\_]*[\s\n]*Quanti",re.IGNORECASE)item2_text=extract_text(text,item2_start,item2_end)results['mda']=item2_textexceptExceptionase:print(f'Error extracting ITEM 2: MD&A for {fname}')returnresults