Daily-Sec-Financial-Statement-Dataset

GRATIS CON POSSIBILITร€ DI UPGRADE
Da Hansjรถrg Wingeier | Aggiornamento hace 2 meses | Finance
Popolaritร 

8.5 / 10

Latenza

1,496ms

Livello di servizio

96%

Health Check

N/A

Torna a tutti i tutorial (2)

Consuming the data

Consuming the data

Introduction

Once you have downloaded the zip file, you can easily explore the data just with a standard texteditor. But it is more fun to do it with Python.

Let me explain to you how to do this with just a few lines of code

Imports you need

import pandas as pd
import zipfile
from pathlib import Path

We will use pandas to explore the data. Moreover, since the content is in a zip-file, we will use zipfile. Path from pathlib helps to access the file in a simple way.

Reading the content of the zip file

Inside the zip-file are three files:

  • The sub.txt, which contains all the submissions, meaning all the reports that were filed at that day
  • The num.txt, which contains the all the detail data (numbers) of every report
  • The pre.txt, which contains the presentation of the data, meaning mainly in which table the data was presented (eg. BalanceSheet, CashFlow statement, โ€ฆ)

For our examples, we will use the daily file from the 28th October, 2022. This was the day when also Apple published its annual report.

First, we define a few constants:

example_file = "daily/20221028.zip"
SUB_TXT = "sub.txt"
NUM_TXT = "num.txt"
PRE_TXT = "pre.txt"

Next we write a simple function that extracts the content of a single file from the zip-file directly as a pandas dataframe:

def read_df_from_file_in_zip(zip_file: str, file_to_extract: str, dtype = None,  usecols=None) -> pd.DataFrame:
    with zipfile.ZipFile(zip_file, "r") as zf:
        file = Path(file_to_extract).name
        return pd.read_csv(zf.open(file), header=0, delimiter="\t", dtype = dtype, usecols=usecols)

Let us read the content of the sub.txt, num.txt, pre.txt directly into panda dataframes:

sub_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=SUB_TXT)
num_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=NUM_TXT)
pre_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=PRE_TXT)

Now, we are ready to explore the data.

The submission file

sub_df.head(5) displays the first 5 rows of the file:

adsh cik name sic fye form period filed accepted fy_real fy fp
0 0000950170-22-020366 1000229 CORE LABORATORIES N V NaN 1231 10-Q 20220930 20221028 NaN 2022 2022 Q3
1 0000100122-22-000037 100122 TUCSON ELECTRIC POWER CO NaN 1231 10-Q 20220930 20221028 NaN 2022 2022 Q3
2 0001493152-22-029844 1006028 PURE BIOSCIENCE, INC. NaN 731 10-K 20220731 20221028 NaN 2022 2022 FY
3 0001493152-22-029857 1008653 ECOMAX, INC NaN 630 10-Q 20220930 20221028 NaN 2023 2023 Q1
4 0000950170-22-020458 1015739 AWARE INC /MA/ NaN 1231 10-Q 20220930 20221028 NaN 2022 2022 Q3

You find the details about the columns in this document: Financial Statements Data Sets
(Note: the daily-zip files contain only a subset of the attributes).

The most important attributes are:

  • adsh: the โ€œaccession numberโ€ - every report receives its unique id
  • cik: the โ€œcentral index keyโ€ - every company that has to file reports with the SEC has its own cik
  • form: the โ€œsubmission type of the fillingโ€ - e.g, 10-k is an annual report, 10-q is a quarterly report
  • period: the โ€œbalance sheet dateโ€ - the date of the balance sheet
  • filed: the โ€œfiling dateโ€ - the date when the report was filed with the sec

In order to find the CIK for a company, you can go to EDGAR Company Filings. For instance, if you search for โ€œappleโ€ you find the CIK โ€œ320193โ€ (Note: you donโ€™t need the leading zeros.

So lets find the submission entry for apple:

apple_sub_df = sub_df[sub_df.cik==320193][:1]
apple_sub_df
adsh cik name sic fye form period filed accepted fy_real fy fp
98 0000320193-22-000108 320193 APPLE INC. NaN 930 10-K 20220930 20221028 NaN 2022 2022 FY

Let us store the adsh id of the annual report from apple of 2022 in its own variable:

apple_adsh = apple_sub_df.adsh.values[0]

Basics about num.txt and pre.txt

The num.txt files contains the raw numbers of data inside the report and the pre.txt file tells us in which table this value appears.

Again, it is important to understand the basic attributes of these two tables, so letโ€™s have a look at the document Financial Statements Data Sets again

Main attributes of the num.txt.

  • adsh, tag, version, and ddate: this 4 attributes identify a certain value within the whole num.txt file. The adsh defines to which report the entry belongs, the tag defines what it is (like โ€œCurrent Assetsโ€), the version defines in which xbrl taxonomy this tag is defined, and the ddate tells you for which date the value was produced for
  • uom: the unit of measure - either the currency or the unit of the value, like shares
  • value: the value itself

Main attributes of the pre.txt:

  • adsh, tag, version: identify which tag is presented. Again, the adsh is the unique number of the report, the tag tells what is displayed and the version defines the taxonomy where the tag is defined
  • stmt: this attribute defines the (financial) statement to which the value belongs to (BS=Balance Sheet, IS=Income Statement, CF=Cash Flow, EQ = Equity, โ€ฆ)
  • report: is used to group the information of a single table together
  • line: defines at which line of the table the tag is presented
  • inpth: define if the values of the tag are presented within parentheses in the text column of the table.
  • negating: shows if a number is displayed as a negative number (for instance, if a number appears in parentheses like โ€œ(203.30)โ€ the value of the corresponding entry in the num.txt file could actually be -203.30, or it could be 203.30 but with the negating flag set to True

This will become more clear with the following example.

We are going to reproduce the information in the Balance Sheet of appleโ€™s 10-k report for 2022. If you go to APPLE 10-K 2022 youโ€™ll see the filed report of apple published at the sec website. Open the link, press on โ€œSectionsโ€ right next to โ€œMenuโ€ on the toolbar, Open โ€œFinancial Statementsโ€ and choose โ€œCONSOLIDATED BALANCE SHEETSโ€. This will directly show you the page of the report containing the balance sheet:

Balance Sheet of Apple 2022

Every numeric entry in the table is actually a value that has to be present in the num.txt file. For instance if you click on the value of โ€œCash and Cash Equivalentsโ€ in the the column for September, 24th 2022, A pop up window appears that shows you the details of this entry (Tag: โ€œCashAndCashEquivalentsAtCarryingValueโ€, Fact (value): 23,646,000,000, Period (ddate): As of 09/24/2022)

So let us find out what data for โ€œCashAndCashEquivalentsAtCarryingValueโ€ is available in the num.txt file:

 num_df[(num_df.adsh==apple_adsh) & (num_df.tag=='CashAndCashEquivalentsAtCarryingValue')]
adsh tag version ddate qtrs coreg uom_ext uom value decimals footnote
40281 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 20210930 0 NaN USD 34940000000.0 6.0 NaN
40282 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 20220930 0 NaN USD 23646000000.0 6.0 NaN

Indeed, we find the two entries, one for the end of September of 2022 and one for the end of September of 2021. Note: the SEC does โ€œroundโ€ the dates to the end of the month, so that is the reason why the โ€œddateโ€ is 20220930, resp. 20210930.

Next, we look at the entries for โ€œCashAndCashEquivalentsAtCarryingValueโ€ in the pre.txt file:

pre_df[(pre_df.adsh==apple_adsh) & (pre_df.tag=='CashAndCashEquivalentsAtCarryingValue')]
adsh	tag	version	report	line	stmt	negating	inpth	rfile

18056 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 4 3 BS False 0 -

adsh tag version report line stmt negating inpth rfile
18056 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 4 3 BS False 0

The โ€œstmtโ€ is BS which stands for balance sheet. Furthermore, the โ€œlineโ€ is 3. If you have a look at the report itself, you see that in the first row of the balance sheet is the text โ€œASSETS:โ€. In the second row the text โ€œCurrent assets:โ€. The third row finally contains the โ€œCash and cash equivalentsโ€.

Merging the data from num.txt and pre.txt into a single dataframe

Next we want to merge the data from num.txt and pre.txt together. As you can see in the description of the Financial Statements Data Sets, the keys that define the relation between them are โ€œadshโ€, โ€œtagโ€, and โ€œversionโ€.

First, let us create dataframes that only contain apple data and only the columns that we are interested in to โ€œrebuildโ€ the balance sheet:

apple_num_df = num_df[num_df.adsh==apple_adsh]
apple_pre_df = pre_df[pre_df.adsh==apple_adsh]
apple_num_short_df = apple_num_df[['adsh', 'tag', 'version', 'ddate', 'uom', 'value']]
apple_pre_short_df = apple_pre_df[['adsh', 'tag', 'version', 'report', 'line', 'stmt','inpth', 'negating' ]]

Next we use a simple pandas.merge() to merge the two datasets together:

apple_num_pre_merged_df = pd.merge(apple_num_short_df, apple_pre_short_df, on=['adsh', 'tag', 'version'])
apple_num_pre_merged_df.head(5) #display the first 5 rows of the result
idx adsh tag version ddate uom value report line stmt inpth negating
0 0000320193-22-000108 CommonStockDividendsPerShareDeclared us-gaap/2022 20220930 USD 0.9 6 21 EQ 0 False
1 0000320193-22-000108 CommonStockDividendsPerShareDeclared us-gaap/2022 20210930 USD 0.85 6 21 EQ 0 False
2 0000320193-22-000108 CommonStockDividendsPerShareDeclared us-gaap/2022 20200930 USD 0.795 6 21 EQ 0 False
3 0000320193-22-000108 CommonStockParOrStatedValuePerShare us-gaap/2022 20220930 USD 0.0 5 1 BS 1 False
4 0000320193-22-000108 CommonStockParOrStatedValuePerShare us-gaap/2022 20210930 USD 0.0 5 1 BS 1 False

Pivot the table to have the ddate values as columns

If we look at the last table, we see that we have different dates in the ddate column for the same tag. That was to expected, since every annual or quarterly report always contains data from the previous periods. Furthermore, the financial statements (like the balance sheet shown above) usually also contains the values of the previous period.

What we want to do now, is to pivot the table, so that we have the different ddates as columns.

apple_num_pre_merged_pivot_df = apple_num_pre_merged_df.pivot_table(index=["adsh", "tag", "version", "stmt", "report", "line", "uom", "inpth"] ,columns="ddate", values="value")

adsh

tag

version

stmt

report

line

uom

negating
ddate
inpth
20190930
ย 
20200930
ย 
20210930
ย 
20220331
ย 
20220930
ย 
0000320193-22-000108 AccountsPayableCurrent us-gaap/2022 BS 4 18 USD False 0 NaN NaN 5.476300e+10 NaN 6.411500e+10
^ AccountsReceivableNetCurrent us-gaap/2022 BS 4 5 USD False 0 NaN NaN 2.627800e+10 NaN 2.818400e+10
^ AccumulatedOtherComprehensiveIncomeLossNetOfTax us-gaap/2022 BS 4 33 USD False 0 NaN NaN 1.630000e+08 NaN -1.110900e+10
^ Assets us-gaap/2022 BS 4 15 USD False 0 NaN NaN 3.510020e+11 NaN 3.527550e+11
^ AssetsCurrent us-gaap/2022 BS 4 9 USD False 0 NaN NaN 1.348360e+11 NaN 1.354050e+11

Note: after pivoting, the columns adsh, tag, version, stmt, report, line, uom, negating, and inpth are now in the idx. Furthermore, the columns themself also have an index โ€œddateโ€. Next, the column order is not really appropriate, and we also donโ€™t really need the multilevel index, so letโ€™s clean that up:

apple_num_pre_merged_pivot_df.rename_axis(None, axis=1, inplace=True)
apple_num_pre_merged_pivot_df.sort_values(['stmt', 'report', 'line', 'inpth'], inplace=True)
apple_num_pre_merged_pivot_df.reset_index(drop=False, inplace=True)
adsh tag version stmt report line uom negating inpth 20190930 20200930 20210930 20220331 20220930
0 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 BS 4 3 USD False 0 NaN NaN 3.494000e+10 NaN 2.364600e+10
1 0000320193-22-000108 MarketableSecuritiesCurrent us-gaap/2022 BS 4 4 USD False 0 NaN NaN 2.769900e+10 NaN 2.465800e+10
2 0000320193-22-000108 AccountsReceivableNetCurrent us-gaap/2022 BS 4 5 USD False 0 NaN NaN 2.627800e+10 NaN 2.818400e+10
3 0000320193-22-000108 InventoryNet us-gaap/2022 BS 4 6 USD False 0 NaN NaN 6.580000e+09 NaN 4.946000e+09
4 0000320193-22-000108 NontradeReceivablesCurrent us-gaap/2022 BS 4 7 USD False 0 NaN NaN 2.522800e+10 NaN 3.274800e+10

Filtering for the balance sheet

Finally, we are going to produce the balance sheet as it appears in the report above.

First, we only select the columns we need. Since our dataset only contain the information of one report, we can get rid of the adsh. The version is also the same for all the tags, so we also get rid of it. Finally, the report only shows the data for September 2022 and 2021, we only filter for these columns:

apple_data_2021_2022_df = apple_num_pre_merged_pivot_df[["tag", "stmt", "report", "line", "uom", "inpth", "negating", 20220930,20210930]]

Note: the โ€œdatesโ€ 20220930 and 20210930 were loaded as pure numbers, since we didnโ€™t specify a schema for the data.

Now, we can just display the data for the balance sheet:

apple_data_2021_2022_df[apple_data_2021_2022_df.stmt == "BS"]
tag stmt report line uom inpth negating 20220930 20210930
0 CashAndCashEquivalentsAtCarryingValue BS 4 3 USD 0 False 2.364600e+10 3.494000e+10
1 MarketableSecuritiesCurrent BS 4 4 USD 0 False 2.465800e+10 2.769900e+10
2 AccountsReceivableNetCurrent BS 4 5 USD 0 False 2.818400e+10 2.627800e+10
3 InventoryNet BS 4 6 USD 0 False 4.946000e+09 6.580000e+09
4 NontradeReceivablesCurrent BS 4 7 USD 0 False 3.274800e+10 2.522800e+10
5 OtherAssetsCurrent BS 4 8 USD 0 False 2.122300e+10 1.411100e+10
6 AssetsCurrent BS 4 9 USD 0 False 1.354050e+11 1.348360e+11
7 MarketableSecuritiesNoncurrent BS 4 11 USD 0 False 1.208050e+11 1.278770e+11
8 PropertyPlantAndEquipmentNet BS 4 12 USD 0 False 4.211700e+10 3.944000e+10
9 OtherAssetsNoncurrent BS 4 13 USD 0 False 5.442800e+10 4.884900e+10
10 AssetsNoncurrent BS 4 14 USD 0 False 2.173500e+11 2.161660e+11
11 Assets BS 4 15 USD 0 False 3.527550e+11 3.510020e+11
12 AccountsPayableCurrent BS 4 18 USD 0 False 6.411500e+10 5.476300e+10
13 OtherLiabilitiesCurrent BS 4 19 USD 0 False 6.084500e+10 4.749300e+10
14 ContractWithCustomerLiabilityCurrent BS 4 20 USD 0 False 7.912000e+09 7.612000e+09
15 CommercialPaper BS 4 21 USD 0 False 9.982000e+09 6.000000e+09
16 LongTermDebtCurrent BS 4 22 USD 0 False 1.112800e+10 9.613000e+09
17 LiabilitiesCurrent BS 4 23 USD 0 False 1.539820e+11 1.254810e+11
18 LongTermDebtNoncurrent BS 4 25 USD 0 False 9.895900e+10 1.091060e+11
19 OtherLiabilitiesNoncurrent BS 4 26 USD 0 False 4.914200e+10 5.332500e+10
20 LiabilitiesNoncurrent BS 4 27 USD 0 False 1.481010e+11 1.624310e+11
21 Liabilities BS 4 28 USD 0 False 3.020830e+11 2.879120e+11
22 CommonStocksIncludingAdditionalPaidInCapital BS 4 31 USD 0 False 6.484900e+10 5.736500e+10
23 RetainedEarningsAccumulatedDeficit BS 4 32 USD 0 False -3.068000e+09 5.562000e+09
24 AccumulatedOtherComprehensiveIncomeLossNetOfTax BS 4 33 USD 0 False -1.110900e+10 1.630000e+08
25 StockholdersEquity BS 4 34 USD 0 False 5.067200e+10 6.309000e+10
26 LiabilitiesAndStockholdersEquity BS 4 35 USD 0 False 3.527550e+11 3.510020e+11
27 CommonStockParOrStatedValuePerShare BS 5 1 USD 1 False 0.000000e+00 0.000000e+00
28 CommonStockSharesAuthorized BS 5 2 shares 1 False 5.040000e+10 5.040000e+10
29 CommonStockSharesIssued BS 5 3 shares 1 False 1.594342e+10 1.642679e+10
30 CommonStockSharesOutstanding BS 5 4 shares 1 False 1.594342e+10 1.642679e+10