Daily-Sec-Financial-Statement-Dataset एपीआई मुक्त ट्यूटोरियल ( hansjoerg.wingeier )

Daily-Sec-Financial-Statement-Dataset

फ्रीमियम

द्वारा Hansjörg Wingeier | अपडेट किया गया 2 месяца назад | Finance

लोकप्रियता

8.5 / 10

लेटेंसी

1,496ms

सेवा का स्तर

96%

Health Check

N/A

सभी ट्यूटोरियल पर वापस जाएं (2)

Consuming the data

Introduction

Once you have downloaded the zip file, you can easily explore the data just with a standard texteditor. But it is more fun to do it with Python.

Let me explain to you how to do this with just a few lines of code

Imports you need

import pandas as pd
import zipfile
from pathlib import Path

We will use pandas to explore the data. Moreover, since the content is in a zip-file, we will use zipfile. Path from pathlib helps to access the file in a simple way.

Reading the content of the zip file

Inside the zip-file are three files:

The sub.txt, which contains all the submissions, meaning all the reports that were filed at that day
The num.txt, which contains the all the detail data (numbers) of every report
The pre.txt, which contains the presentation of the data, meaning mainly in which table the data was presented (eg. BalanceSheet, CashFlow statement, …)

For our examples, we will use the daily file from the 28th October, 2022. This was the day when also Apple published its annual report.

First, we define a few constants:

example_file = "daily/20221028.zip"
SUB_TXT = "sub.txt"
NUM_TXT = "num.txt"
PRE_TXT = "pre.txt"

Next we write a simple function that extracts the content of a single file from the zip-file directly as a pandas dataframe:

def read_df_from_file_in_zip(zip_file: str, file_to_extract: str, dtype = None,  usecols=None) -> pd.DataFrame:
    with zipfile.ZipFile(zip_file, "r") as zf:
        file = Path(file_to_extract).name
        return pd.read_csv(zf.open(file), header=0, delimiter="\t", dtype = dtype, usecols=usecols)

Let us read the content of the sub.txt, num.txt, pre.txt directly into panda dataframes:

sub_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=SUB_TXT)
num_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=NUM_TXT)
pre_df = read_df_from_file_in_zip(zip_file=example_file, file_to_extract=PRE_TXT)

Now, we are ready to explore the data.

The submission file

sub_df.head(5) displays the first 5 rows of the file:

	adsh	cik	name	sic	fye	form	period	filed	accepted	fy_real	fy	fp
0	0000950170-22-020366	1000229	CORE LABORATORIES N V	NaN	1231	10-Q	20220930	20221028	NaN	2022	2022	Q3
1	0000100122-22-000037	100122	TUCSON ELECTRIC POWER CO	NaN	1231	10-Q	20220930	20221028	NaN	2022	2022	Q3
2	0001493152-22-029844	1006028	PURE BIOSCIENCE, INC.	NaN	731	10-K	20220731	20221028	NaN	2022	2022	FY
3	0001493152-22-029857	1008653	ECOMAX, INC	NaN	630	10-Q	20220930	20221028	NaN	2023	2023	Q1
4	0000950170-22-020458	1015739	AWARE INC /MA/	NaN	1231	10-Q	20220930	20221028	NaN	2022	2022	Q3

You find the details about the columns in this document: Financial Statements Data Sets
(Note: the daily-zip files contain only a subset of the attributes).

The most important attributes are:

adsh: the “accession number” - every report receives its unique id
cik: the “central index key” - every company that has to file reports with the SEC has its own cik
form: the “submission type of the filling” - e.g, 10-k is an annual report, 10-q is a quarterly report
period: the “balance sheet date” - the date of the balance sheet
filed: the “filing date” - the date when the report was filed with the sec

In order to find the CIK for a company, you can go to EDGAR Company Filings. For instance, if you search for “apple” you find the CIK “320193” (Note: you don’t need the leading zeros.

So lets find the submission entry for apple:

apple_sub_df = sub_df[sub_df.cik==320193][:1]
apple_sub_df

	adsh	cik	name	sic	fye	form	period	filed	accepted	fy_real	fy	fp
98	0000320193-22-000108	320193	APPLE INC.	NaN	930	10-K	20220930	20221028	NaN	2022	2022	FY

Let us store the adsh id of the annual report from apple of 2022 in its own variable:

apple_adsh = apple_sub_df.adsh.values[0]

Basics about num.txt and pre.txt

The num.txt files contains the raw numbers of data inside the report and the pre.txt file tells us in which table this value appears.

Again, it is important to understand the basic attributes of these two tables, so let’s have a look at the document Financial Statements Data Sets again

Main attributes of the num.txt.

adsh, tag, version, and ddate: this 4 attributes identify a certain value within the whole num.txt file. The adsh defines to which report the entry belongs, the tag defines what it is (like “Current Assets”), the version defines in which xbrl taxonomy this tag is defined, and the ddate tells you for which date the value was produced for
uom: the unit of measure - either the currency or the unit of the value, like shares
value: the value itself

Main attributes of the pre.txt:

adsh, tag, version: identify which tag is presented. Again, the adsh is the unique number of the report, the tag tells what is displayed and the version defines the taxonomy where the tag is defined
stmt: this attribute defines the (financial) statement to which the value belongs to (BS=Balance Sheet, IS=Income Statement, CF=Cash Flow, EQ = Equity, …)
report: is used to group the information of a single table together
line: defines at which line of the table the tag is presented
inpth: define if the values of the tag are presented within parentheses in the text column of the table.
negating: shows if a number is displayed as a negative number (for instance, if a number appears in parentheses like “(203.30)” the value of the corresponding entry in the num.txt file could actually be -203.30, or it could be 203.30 but with the negating flag set to True

This will become more clear with the following example.

We are going to reproduce the information in the Balance Sheet of apple’s 10-k report for 2022. If you go to APPLE 10-K 2022 you’ll see the filed report of apple published at the sec website. Open the link, press on “Sections” right next to “Menu” on the toolbar, Open “Financial Statements” and choose “CONSOLIDATED BALANCE SHEETS”. This will directly show you the page of the report containing the balance sheet:

Balance Sheet of Apple 2022

Every numeric entry in the table is actually a value that has to be present in the num.txt file. For instance if you click on the value of “Cash and Cash Equivalents” in the the column for September, 24th 2022, A pop up window appears that shows you the details of this entry (Tag: “CashAndCashEquivalentsAtCarryingValue”, Fact (value): 23,646,000,000, Period (ddate): As of 09/24/2022)

So let us find out what data for “CashAndCashEquivalentsAtCarryingValue” is available in the num.txt file:

 num_df[(num_df.adsh==apple_adsh) & (num_df.tag=='CashAndCashEquivalentsAtCarryingValue')]

	adsh	tag	version	ddate	qtrs	coreg	uom_ext	uom	value	decimals	footnote
40281	0000320193-22-000108	CashAndCashEquivalentsAtCarryingValue	us-gaap/2022	20210930	0	NaN	USD	34940000000.0	6.0	NaN
40282	0000320193-22-000108	CashAndCashEquivalentsAtCarryingValue	us-gaap/2022	20220930	0	NaN	USD	23646000000.0	6.0	NaN

Indeed, we find the two entries, one for the end of September of 2022 and one for the end of September of 2021. Note: the SEC does “round” the dates to the end of the month, so that is the reason why the “ddate” is 20220930, resp. 20210930.

Next, we look at the entries for “CashAndCashEquivalentsAtCarryingValue” in the pre.txt file:

pre_df[(pre_df.adsh==apple_adsh) & (pre_df.tag=='CashAndCashEquivalentsAtCarryingValue')]

adsh	tag	version	report	line	stmt	negating	inpth	rfile

18056 0000320193-22-000108 CashAndCashEquivalentsAtCarryingValue us-gaap/2022 4 3 BS False 0 -

	adsh	tag	version	report	line	stmt	negating	inpth	rfile
18056	0000320193-22-000108	CashAndCashEquivalentsAtCarryingValue	us-gaap/2022	4	3	BS	False	0

The “stmt” is BS which stands for balance sheet. Furthermore, the “line” is 3. If you have a look at the report itself, you see that in the first row of the balance sheet is the text “ASSETS:”. In the second row the text “Current assets:”. The third row finally contains the “Cash and cash equivalents”.

Merging the data from num.txt and pre.txt into a single dataframe

Next we want to merge the data from num.txt and pre.txt together. As you can see in the description of the Financial Statements Data Sets, the keys that define the relation between them are “adsh”, “tag”, and “version”.

First, let us create dataframes that only contain apple data and only the columns that we are interested in to “rebuild” the balance sheet:

apple_num_df = num_df[num_df.adsh==apple_adsh]
apple_pre_df = pre_df[pre_df.adsh==apple_adsh]
apple_num_short_df = apple_num_df[['adsh', 'tag', 'version', 'ddate', 'uom', 'value']]
apple_pre_short_df = apple_pre_df[['adsh', 'tag', 'version', 'report', 'line', 'stmt','inpth', 'negating' ]]

Next we use a simple pandas.merge() to merge the two datasets together:

apple_num_pre_merged_df = pd.merge(apple_num_short_df, apple_pre_short_df, on=['adsh', 'tag', 'version'])
apple_num_pre_merged_df.head(5) #display the first 5 rows of the result

idx	adsh	tag	version	ddate	uom	value	report	line	stmt	inpth	negating
0	0000320193-22-000108	CommonStockDividendsPerShareDeclared	us-gaap/2022	20220930	USD	0.9	6	21	EQ	0	False
1	0000320193-22-000108	CommonStockDividendsPerShareDeclared	us-gaap/2022	20210930	USD	0.85	6	21	EQ	0	False
2	0000320193-22-000108	CommonStockDividendsPerShareDeclared	us-gaap/2022	20200930	USD	0.795	6	21	EQ	0	False
3	0000320193-22-000108	CommonStockParOrStatedValuePerShare	us-gaap/2022	20220930	USD	0.0	5	1	BS	1	False
4	0000320193-22-000108	CommonStockParOrStatedValuePerShare	us-gaap/2022	20210930	USD	0.0	5	1	BS	1	False

Pivot the table to have the ddate values as columns

If we look at the last table, we see that we have different dates in the ddate column for the same tag. That was to expected, since every annual or quarterly report always contains data from the previous periods. Furthermore, the financial statements (like the balance sheet shown above) usually also contains the values of the previous period.

What we want to do now, is to pivot the table, so that we have the different ddates as columns.

apple_num_pre_merged_pivot_df = apple_num_pre_merged_df.pivot_table(index=["adsh", "tag", "version", "stmt", "report", "line", "uom", "inpth"] ,columns="ddate", values="value")

adsh	tag	version	stmt	report	line	uom	negating	20190930	20200930	20210930	20220331	20220930
0000320193-22-000108	AccountsPayableCurrent	us-gaap/2022	BS	4	18	USD	False	NaN	NaN	5.476300e+10	NaN	6.411500e+10
^	AccountsReceivableNetCurrent	us-gaap/2022	BS	4	5	USD	False	NaN	NaN	2.627800e+10	NaN	2.818400e+10
^	AccumulatedOtherComprehensiveIncomeLossNetOfTax	us-gaap/2022	BS	4	33	USD	False	NaN	NaN	1.630000e+08	NaN	-1.110900e+10
^	Assets	us-gaap/2022	BS	4	15	USD	False	NaN	NaN	3.510020e+11	NaN	3.527550e+11
^	AssetsCurrent	us-gaap/2022	BS	4	9	USD	False	NaN	NaN	1.348360e+11	NaN	1.354050e+11

Note: after pivoting, the columns adsh, tag, version, stmt, report, line, uom, negating, and inpth are now in the idx. Furthermore, the columns themself also have an index “ddate”. Next, the column order is not really appropriate, and we also don’t really need the multilevel index, so let’s clean that up:

apple_num_pre_merged_pivot_df.rename_axis(None, axis=1, inplace=True)
apple_num_pre_merged_pivot_df.sort_values(['stmt', 'report', 'line', 'inpth'], inplace=True)
apple_num_pre_merged_pivot_df.reset_index(drop=False, inplace=True)

	adsh	tag	version	stmt	report	line	uom	negating	20190930	20200930	20210930	20220331	20220930
0	0000320193-22-000108	CashAndCashEquivalentsAtCarryingValue	us-gaap/2022	BS	4	3	USD	False	NaN	NaN	3.494000e+10	NaN	2.364600e+10
1	0000320193-22-000108	MarketableSecuritiesCurrent	us-gaap/2022	BS	4	4	USD	False	NaN	NaN	2.769900e+10	NaN	2.465800e+10
2	0000320193-22-000108	AccountsReceivableNetCurrent	us-gaap/2022	BS	4	5	USD	False	NaN	NaN	2.627800e+10	NaN	2.818400e+10
3	0000320193-22-000108	InventoryNet	us-gaap/2022	BS	4	6	USD	False	NaN	NaN	6.580000e+09	NaN	4.946000e+09
4	0000320193-22-000108	NontradeReceivablesCurrent	us-gaap/2022	BS	4	7	USD	False	NaN	NaN	2.522800e+10	NaN	3.274800e+10

Filtering for the balance sheet

Finally, we are going to produce the balance sheet as it appears in the report above.

First, we only select the columns we need. Since our dataset only contain the information of one report, we can get rid of the adsh. The version is also the same for all the tags, so we also get rid of it. Finally, the report only shows the data for September 2022 and 2021, we only filter for these columns:

apple_data_2021_2022_df = apple_num_pre_merged_pivot_df[["tag", "stmt", "report", "line", "uom", "inpth", "negating", 20220930,20210930]]

Note: the “dates” 20220930 and 20210930 were loaded as pure numbers, since we didn’t specify a schema for the data.

Now, we can just display the data for the balance sheet:

apple_data_2021_2022_df[apple_data_2021_2022_df.stmt == "BS"]

	tag	stmt	report	line	uom	inpth	negating	20220930	20210930
0	CashAndCashEquivalentsAtCarryingValue	BS	4	3	USD	0	False	2.364600e+10	3.494000e+10
1	MarketableSecuritiesCurrent	BS	4	4	USD	0	False	2.465800e+10	2.769900e+10
2	AccountsReceivableNetCurrent	BS	4	5	USD	0	False	2.818400e+10	2.627800e+10
3	InventoryNet	BS	4	6	USD	0	False	4.946000e+09	6.580000e+09
4	NontradeReceivablesCurrent	BS	4	7	USD	0	False	3.274800e+10	2.522800e+10
5	OtherAssetsCurrent	BS	4	8	USD	0	False	2.122300e+10	1.411100e+10
6	AssetsCurrent	BS	4	9	USD	0	False	1.354050e+11	1.348360e+11
7	MarketableSecuritiesNoncurrent	BS	4	11	USD	0	False	1.208050e+11	1.278770e+11
8	PropertyPlantAndEquipmentNet	BS	4	12	USD	0	False	4.211700e+10	3.944000e+10
9	OtherAssetsNoncurrent	BS	4	13	USD	0	False	5.442800e+10	4.884900e+10
10	AssetsNoncurrent	BS	4	14	USD	0	False	2.173500e+11	2.161660e+11
11	Assets	BS	4	15	USD	0	False	3.527550e+11	3.510020e+11
12	AccountsPayableCurrent	BS	4	18	USD	0	False	6.411500e+10	5.476300e+10
13	OtherLiabilitiesCurrent	BS	4	19	USD	0	False	6.084500e+10	4.749300e+10
14	ContractWithCustomerLiabilityCurrent	BS	4	20	USD	0	False	7.912000e+09	7.612000e+09
15	CommercialPaper	BS	4	21	USD	0	False	9.982000e+09	6.000000e+09
16	LongTermDebtCurrent	BS	4	22	USD	0	False	1.112800e+10	9.613000e+09
17	LiabilitiesCurrent	BS	4	23	USD	0	False	1.539820e+11	1.254810e+11
18	LongTermDebtNoncurrent	BS	4	25	USD	0	False	9.895900e+10	1.091060e+11
19	OtherLiabilitiesNoncurrent	BS	4	26	USD	0	False	4.914200e+10	5.332500e+10
20	LiabilitiesNoncurrent	BS	4	27	USD	0	False	1.481010e+11	1.624310e+11
21	Liabilities	BS	4	28	USD	0	False	3.020830e+11	2.879120e+11
22	CommonStocksIncludingAdditionalPaidInCapital	BS	4	31	USD	0	False	6.484900e+10	5.736500e+10
23	RetainedEarningsAccumulatedDeficit	BS	4	32	USD	0	False	-3.068000e+09	5.562000e+09
24	AccumulatedOtherComprehensiveIncomeLossNetOfTax	BS	4	33	USD	0	False	-1.110900e+10	1.630000e+08
25	StockholdersEquity	BS	4	34	USD	0	False	5.067200e+10	6.309000e+10
26	LiabilitiesAndStockholdersEquity	BS	4	35	USD	0	False	3.527550e+11	3.510020e+11
27	CommonStockParOrStatedValuePerShare	BS	5	1	USD	1	False	0.000000e+00	0.000000e+00
28	CommonStockSharesAuthorized	BS	5	2	shares	1	False	5.040000e+10	5.040000e+10
29	CommonStockSharesIssued	BS	5	3	shares	1	False	1.594342e+10	1.642679e+10
30	CommonStockSharesOutstanding	BS	5	4	shares	1	False	1.594342e+10	1.642679e+10

पिछला Get your first dataset