Once you have downloaded the zip file, you can easily explore the data just with a standard texteditor. But it is more fun to do it with Python.
Let me explain to you how to do this with just a few lines of code
import pandas as pd
import zipfile
from pathlib import Path
We will use pandas to explore the data. Moreover, since the content is in a zip-file, we will use zipfile. Path from pathlib helps to access the file in a simple way.
Inside the zip-file are three files:
For our examples, we will use the daily file from the 28th October, 2022. This was the day when also Apple published its annual report.
First, we define a few constants:
example_file = "daily/20221028.zip"
SUB_TXT = "sub.txt"
NUM_TXT = "num.txt"
PRE_TXT = "pre.txt"
Next we write a simple function that extracts the content of a single file from the zip-file directly as a pandas dataframe:
def read_df_from_file_in_zip(zip_file: str, file_to_extract: str, dtype = None, usecols=None) -> pd.DataFrame:
with zipfile.ZipFile(zip_file, "r") as zf:
file = Path(file_to_extract).name
return pd.read_csv(zf.open(file), header=0, delimiter="\t", dtype = dtype, usecols=usecols)
Let us read the content of the sub.txt, num.txt, pre.txt directly into panda dataframes:
sub_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=SUB_TXT)
num_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=NUM_TXT)
pre_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=PRE_TXT)
Now, we are ready to explore the data.
sub_df.head(5)
displays the first 5 rows of the file:
adsh | cik | name | sic | fye | form | period | filed | accepted | fy_real | fy | fp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000950170-22-020366 | 1000229 | CORE LABORATORIES N V | NaN | 1231 | 10-Q | 20220930 | 20221028 | NaN | 2022 | 2022 | Q3 |
1 | 0000100122-22-000037 | 100122 | TUCSON ELECTRIC POWER CO | NaN | 1231 | 10-Q | 20220930 | 20221028 | NaN | 2022 | 2022 | Q3 |
2 | 0001493152-22-029844 | 1006028 | PURE BIOSCIENCE, INC. | NaN | 731 | 10-K | 20220731 | 20221028 | NaN | 2022 | 2022 | FY |
3 | 0001493152-22-029857 | 1008653 | ECOMAX, INC | NaN | 630 | 10-Q | 20220930 | 20221028 | NaN | 2023 | 2023 | Q1 |
4 | 0000950170-22-020458 | 1015739 | AWARE INC /MA/ | NaN | 1231 | 10-Q | 20220930 | 20221028 | NaN | 2022 | 2022 | Q3 |
You find the details about the columns in this document: Financial Statements Data Sets
(Note: the daily-zip files contain only a subset of the attributes).
The most important attributes are:
In order to find the CIK for a company, you can go to EDGAR Company Filings. For instance, if you search for “apple” you find the CIK “320193” (Note: you don’t need the leading zeros.
So lets find the submission entry for apple:
apple_sub_df = sub_df[sub_df.cik==320193][:1]
apple_sub_df
adsh | cik | name | sic | fye | form | period | filed | accepted | fy_real | fy | fp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
98 | 0000320193-22-000108 | 320193 | APPLE INC. | NaN | 930 | 10-K | 20220930 | 20221028 | NaN | 2022 | 2022 | FY |
Let us store the adsh id of the annual report from apple of 2022 in its own variable:
apple_adsh = apple_sub_df.adsh.values[0]
The num.txt files contains the raw numbers of data inside the report and the pre.txt file tells us in which table this value appears.
Again, it is important to understand the basic attributes of these two tables, so let’s have a look at the document Financial Statements Data Sets again
Main attributes of the num.txt.
Main attributes of the pre.txt:
This will become more clear with the following example.
We are going to reproduce the information in the Balance Sheet of apple’s 10-k report for 2022. If you go to APPLE 10-K 2022 you’ll see the filed report of apple published at the sec website. Open the link, press on “Sections” right next to “Menu” on the toolbar, Open “Financial Statements” and choose “CONSOLIDATED BALANCE SHEETS”. This will directly show you the page of the report containing the balance sheet:
Every numeric entry in the table is actually a value that has to be present in the num.txt file. For instance if you click on the value of “Cash and Cash Equivalents” in the the column for September, 24th 2022, A pop up window appears that shows you the details of this entry (Tag: “CashAndCashEquivalentsAtCarryingValue”, Fact (value): 23,646,000,000, Period (ddate): As of 09/24/2022)
So let us find out what data for “CashAndCashEquivalentsAtCarryingValue” is available in the num.txt file:
num_df[(num_df.adsh==apple_adsh) & (num_df.tag=='CashAndCashEquivalentsAtCarryingValue')]
adsh | tag | version | ddate | qtrs | coreg | uom_ext | uom | value | decimals | footnote | |
---|---|---|---|---|---|---|---|---|---|---|---|
40281 | 0000320193-22-000108 | CashAndCashEquivalentsAtCarryingValue | us-gaap/2022 | 20210930 | 0 | NaN | USD | 34940000000.0 | 6.0 | NaN | |
40282 | 0000320193-22-000108 | CashAndCashEquivalentsAtCarryingValue | us-gaap/2022 | 20220930 | 0 | NaN | USD | 23646000000.0 | 6.0 | NaN |
Indeed, we find the two entries, one for the end of September of 2022 and one for the end of September of 2021. Note: the SEC does “round” the dates to the end of the month, so that is the reason why the “ddate” is 20220930, resp. 20210930.
Next, we look at the entries for “CashAndCashEquivalentsAtCarryingValue” in the pre.txt file:
pre_df[(pre_df.adsh==apple_adsh) & (pre_df.tag=='CashAndCashEquivalentsAtCarryingValue')]
adsh tag version report line stmt negating inpth rfile
18056 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 4 3 BS False 0 -
adsh | tag | version | report | line | stmt | negating | inpth | rfile | |
---|---|---|---|---|---|---|---|---|---|
18056 | 0000320193-22-000108 | CashAndCashEquivalentsAtCarryingValue | us-gaap/2022 | 4 | 3 | BS | False | 0 |
The “stmt” is BS which stands for balance sheet. Furthermore, the “line” is 3. If you have a look at the report itself, you see that in the first row of the balance sheet is the text “ASSETS:”. In the second row the text “Current assets:”. The third row finally contains the “Cash and cash equivalents”.
Next we want to merge the data from num.txt and pre.txt together. As you can see in the description of the Financial Statements Data Sets, the keys that define the relation between them are “adsh”, “tag”, and “version”.
First, let us create dataframes that only contain apple data and only the columns that we are interested in to “rebuild” the balance sheet:
apple_num_df = num_df[num_df.adsh==apple_adsh]
apple_pre_df = pre_df[pre_df.adsh==apple_adsh]
apple_num_short_df = apple_num_df[['adsh', 'tag', 'version', 'ddate', 'uom', 'value']]
apple_pre_short_df = apple_pre_df[['adsh', 'tag', 'version', 'report', 'line', 'stmt','inpth', 'negating' ]]
Next we use a simple pandas.merge() to merge the two datasets together:
apple_num_pre_merged_df = pd.merge(apple_num_short_df, apple_pre_short_df, on=['adsh', 'tag', 'version'])
apple_num_pre_merged_df.head(5) #display the first 5 rows of the result
idx | adsh | tag | version | ddate | uom | value | report | line | stmt | inpth | negating |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000320193-22-000108 | CommonStockDividendsPerShareDeclared | us-gaap/2022 | 20220930 | USD | 0.9 | 6 | 21 | EQ | 0 | False |
1 | 0000320193-22-000108 | CommonStockDividendsPerShareDeclared | us-gaap/2022 | 20210930 | USD | 0.85 | 6 | 21 | EQ | 0 | False |
2 | 0000320193-22-000108 | CommonStockDividendsPerShareDeclared | us-gaap/2022 | 20200930 | USD | 0.795 | 6 | 21 | EQ | 0 | False |
3 | 0000320193-22-000108 | CommonStockParOrStatedValuePerShare | us-gaap/2022 | 20220930 | USD | 0.0 | 5 | 1 | BS | 1 | False |
4 | 0000320193-22-000108 | CommonStockParOrStatedValuePerShare | us-gaap/2022 | 20210930 | USD | 0.0 | 5 | 1 | BS | 1 | False |
If we look at the last table, we see that we have different dates in the ddate column for the same tag. That was to expected, since every annual or quarterly report always contains data from the previous periods. Furthermore, the financial statements (like the balance sheet shown above) usually also contains the values of the previous period.
What we want to do now, is to pivot the table, so that we have the different ddates as columns.
apple_num_pre_merged_pivot_df = apple_num_pre_merged_df.pivot_table(index=["adsh", "tag", "version", "stmt", "report", "line", "uom", "inpth"] ,columns="ddate", values="value")
adsh |
tag |
version |
stmt |
report |
line |
uom |
negating |
ddate inpth |
20190930 |
20200930 |
20210930 |
20220331 |
20220930 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0000320193-22-000108 | AccountsPayableCurrent | us-gaap/2022 | BS | 4 | 18 | USD | False | 0 | NaN | NaN | 5.476300e+10 | NaN | 6.411500e+10 |
^ | AccountsReceivableNetCurrent | us-gaap/2022 | BS | 4 | 5 | USD | False | 0 | NaN | NaN | 2.627800e+10 | NaN | 2.818400e+10 |
^ | AccumulatedOtherComprehensiveIncomeLossNetOfTax | us-gaap/2022 | BS | 4 | 33 | USD | False | 0 | NaN | NaN | 1.630000e+08 | NaN | -1.110900e+10 |
^ | Assets | us-gaap/2022 | BS | 4 | 15 | USD | False | 0 | NaN | NaN | 3.510020e+11 | NaN | 3.527550e+11 |
^ | AssetsCurrent | us-gaap/2022 | BS | 4 | 9 | USD | False | 0 | NaN | NaN | 1.348360e+11 | NaN | 1.354050e+11 |
Note: after pivoting, the columns adsh, tag, version, stmt, report, line, uom, negating, and inpth are now in the idx. Furthermore, the columns themself also have an index “ddate”. Next, the column order is not really appropriate, and we also don’t really need the multilevel index, so let’s clean that up:
apple_num_pre_merged_pivot_df.rename_axis(None, axis=1, inplace=True)
apple_num_pre_merged_pivot_df.sort_values(['stmt', 'report', 'line', 'inpth'], inplace=True)
apple_num_pre_merged_pivot_df.reset_index(drop=False, inplace=True)
adsh | tag | version | stmt | report | line | uom | negating | inpth | 20190930 | 20200930 | 20210930 | 20220331 | 20220930 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0000320193-22-000108 | CashAndCashEquivalentsAtCarryingValue | us-gaap/2022 | BS | 4 | 3 | USD | False | 0 | NaN | NaN | 3.494000e+10 | NaN | 2.364600e+10 |
1 | 0000320193-22-000108 | MarketableSecuritiesCurrent | us-gaap/2022 | BS | 4 | 4 | USD | False | 0 | NaN | NaN | 2.769900e+10 | NaN | 2.465800e+10 |
2 | 0000320193-22-000108 | AccountsReceivableNetCurrent | us-gaap/2022 | BS | 4 | 5 | USD | False | 0 | NaN | NaN | 2.627800e+10 | NaN | 2.818400e+10 |
3 | 0000320193-22-000108 | InventoryNet | us-gaap/2022 | BS | 4 | 6 | USD | False | 0 | NaN | NaN | 6.580000e+09 | NaN | 4.946000e+09 |
4 | 0000320193-22-000108 | NontradeReceivablesCurrent | us-gaap/2022 | BS | 4 | 7 | USD | False | 0 | NaN | NaN | 2.522800e+10 | NaN | 3.274800e+10 |
Finally, we are going to produce the balance sheet as it appears in the report above.
First, we only select the columns we need. Since our dataset only contain the information of one report, we can get rid of the adsh. The version is also the same for all the tags, so we also get rid of it. Finally, the report only shows the data for September 2022 and 2021, we only filter for these columns:
apple_data_2021_2022_df = apple_num_pre_merged_pivot_df[["tag", "stmt", "report", "line", "uom", "inpth", "negating", 20220930,20210930]]
Note: the “dates” 20220930 and 20210930 were loaded as pure numbers, since we didn’t specify a schema for the data.
Now, we can just display the data for the balance sheet:
apple_data_2021_2022_df[apple_data_2021_2022_df.stmt == "BS"]
tag | stmt | report | line | uom | inpth | negating | 20220930 | 20210930 | |
---|---|---|---|---|---|---|---|---|---|
0 | CashAndCashEquivalentsAtCarryingValue | BS | 4 | 3 | USD | 0 | False | 2.364600e+10 | 3.494000e+10 |
1 | MarketableSecuritiesCurrent | BS | 4 | 4 | USD | 0 | False | 2.465800e+10 | 2.769900e+10 |
2 | AccountsReceivableNetCurrent | BS | 4 | 5 | USD | 0 | False | 2.818400e+10 | 2.627800e+10 |
3 | InventoryNet | BS | 4 | 6 | USD | 0 | False | 4.946000e+09 | 6.580000e+09 |
4 | NontradeReceivablesCurrent | BS | 4 | 7 | USD | 0 | False | 3.274800e+10 | 2.522800e+10 |
5 | OtherAssetsCurrent | BS | 4 | 8 | USD | 0 | False | 2.122300e+10 | 1.411100e+10 |
6 | AssetsCurrent | BS | 4 | 9 | USD | 0 | False | 1.354050e+11 | 1.348360e+11 |
7 | MarketableSecuritiesNoncurrent | BS | 4 | 11 | USD | 0 | False | 1.208050e+11 | 1.278770e+11 |
8 | PropertyPlantAndEquipmentNet | BS | 4 | 12 | USD | 0 | False | 4.211700e+10 | 3.944000e+10 |
9 | OtherAssetsNoncurrent | BS | 4 | 13 | USD | 0 | False | 5.442800e+10 | 4.884900e+10 |
10 | AssetsNoncurrent | BS | 4 | 14 | USD | 0 | False | 2.173500e+11 | 2.161660e+11 |
11 | Assets | BS | 4 | 15 | USD | 0 | False | 3.527550e+11 | 3.510020e+11 |
12 | AccountsPayableCurrent | BS | 4 | 18 | USD | 0 | False | 6.411500e+10 | 5.476300e+10 |
13 | OtherLiabilitiesCurrent | BS | 4 | 19 | USD | 0 | False | 6.084500e+10 | 4.749300e+10 |
14 | ContractWithCustomerLiabilityCurrent | BS | 4 | 20 | USD | 0 | False | 7.912000e+09 | 7.612000e+09 |
15 | CommercialPaper | BS | 4 | 21 | USD | 0 | False | 9.982000e+09 | 6.000000e+09 |
16 | LongTermDebtCurrent | BS | 4 | 22 | USD | 0 | False | 1.112800e+10 | 9.613000e+09 |
17 | LiabilitiesCurrent | BS | 4 | 23 | USD | 0 | False | 1.539820e+11 | 1.254810e+11 |
18 | LongTermDebtNoncurrent | BS | 4 | 25 | USD | 0 | False | 9.895900e+10 | 1.091060e+11 |
19 | OtherLiabilitiesNoncurrent | BS | 4 | 26 | USD | 0 | False | 4.914200e+10 | 5.332500e+10 |
20 | LiabilitiesNoncurrent | BS | 4 | 27 | USD | 0 | False | 1.481010e+11 | 1.624310e+11 |
21 | Liabilities | BS | 4 | 28 | USD | 0 | False | 3.020830e+11 | 2.879120e+11 |
22 | CommonStocksIncludingAdditionalPaidInCapital | BS | 4 | 31 | USD | 0 | False | 6.484900e+10 | 5.736500e+10 |
23 | RetainedEarningsAccumulatedDeficit | BS | 4 | 32 | USD | 0 | False | -3.068000e+09 | 5.562000e+09 |
24 | AccumulatedOtherComprehensiveIncomeLossNetOfTax | BS | 4 | 33 | USD | 0 | False | -1.110900e+10 | 1.630000e+08 |
25 | StockholdersEquity | BS | 4 | 34 | USD | 0 | False | 5.067200e+10 | 6.309000e+10 |
26 | LiabilitiesAndStockholdersEquity | BS | 4 | 35 | USD | 0 | False | 3.527550e+11 | 3.510020e+11 |
27 | CommonStockParOrStatedValuePerShare | BS | 5 | 1 | USD | 1 | False | 0.000000e+00 | 0.000000e+00 |
28 | CommonStockSharesAuthorized | BS | 5 | 2 | shares | 1 | False | 5.040000e+10 | 5.040000e+10 |
29 | CommonStockSharesIssued | BS | 5 | 3 | shares | 1 | False | 1.594342e+10 | 1.642679e+10 |
30 | CommonStockSharesOutstanding | BS | 5 | 4 | shares | 1 | False | 1.594342e+10 | 1.642679e+10 |