EDGAR tickers

EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system used by the SEC. Companies submit regulatory filings (such as quarterly reports) into the system, which are then made public for us to download. These filings are the primary focus of my research. In order to study the market reaction to them, I need to match each filing against a ticker.

CIK to Ticker

The SEC identifies both companies and individuals with a 10-digit central index key (CIK). I need to map that CIK to the appropriate time series in my market data (from Kibot), which are identified by ticker.

The starting point is the current mapping provided by the SEC: tickers.txt. That aligns fairly nicely with the Kibot ticker list. Most of the effort past this point is dealing with dead tickers, which are important for reducing survivorship bias.

Ticker extraction from filings

I augmented my mapping by extracting tickers directly from filings where I could. Certain types of filings, 10-K and 10-Q, sometimes contain a “TradingSymbol” buried in the XBRL. But the richest sources of tickers are the ownership filings, forms 3, 4, and 5. Unfortunately, these are also the noisiest sources, since the ticker is entered by hand. Here is an example – all the unique tickers entered for NORTHWEST INDIANA BANCORP:

NIWB(OB)
NIWIB(OB)
NIWIN(OB)
NWIB (OB)
NWIB(OB)
NWIBOB)
NWIIN(OB)
NWIN
NWIN (OB)
NWIN()B)
NWIN(OB
NWIN(OB)
NWIN)OB)
NWIN.OB
NWINOB

Company name matching

Name matching is really tricky to get right. I decided to make some attempt at it, but limited it only a few hours of effort. It would take a crazy amount of work to research thousands of company histories to accurately match name variations. I’ve limited my matching to names that are identical after minor cleanup, and also names that are extremely close, but only after a cursory manual review.

Levenshtein edit distance

The Levenshtein distance is the minimum number of edits required to turn one string into another. I used it for fuzzy name matching. Edits include character insertions, deletions, and substitutions. The edit distance can be computed using dynamic programming. Happily, I found a fast C-based implementation is readily available in python:

conda install python-levenshtein

For my company name matching, I looked at each EDGAR company name for which I had not yet found a ticker in the Kibot data, and I compared it to each company name listed by Kibot. I kept any pair that matched with no more than one edit. This left me with about 450 matches to review manually. I didn’t do any research on the names, but rather just went off intuition. Here are some names I think probably do match. One criterion I used was whether I thought one company would sue the other over too similar a name:

MAINSTREET BANKSHARES INCMAINSTREET BANCSHARES INC
ENDWAVE CORPENWAVE CORP
UREX ENERGY CORPREX ENERGY CORP
CASH SYSTEMS INCCASA SYSTEMS INC
TRINSIC INCATRINSIC INC
IBIZ TECHNOLOGY CORPIBIS TECHNOLOGY CORP
BIDVILLE INCKIDVILLE INC
JAZZ TECHNOLOGIES INCJZZ TECHNOLOGIES INC
INYX INCINX INC
VERIDIAN CORPVERIDIEN CORP
CENTREX INCCEMTREX INC
EXCELON CORPEXELON CORP
TETON ADVISORS INCTETON ADVISORRS INC
TELS CORPTELUS CORP
TELOS CORPTELUS CORP
310 HOLDINGS INCP10 HOLDINGS INC
VANSEN PHARMA INCVENSEN PHARMA INC
SOTECH INCSOFTECH INC
SONTERRA RESOURCES INCBONTERRA RESOURCES INC
FORTUNET INCFORTINET INC
PANEX RESOURCES INCPAREX RESOURCES INC
VIVOS INCVIVUS INC
ORTEC INTERNATIONAL INCOPTEC INTERNATIONAL INC
DFP HEALTHCARE ACQUISITIONS CORPDFB HEALTHCARE ACQUISITIONS CORP
ZORO MINING CORPCORO MINING CORP
INVESTNET INCENVESTNET INC
LINK ENERGY LLCLINN ENERGY LLC
LASON INCLAWSON INC
VALCOM INCVOLCOM INC
REIGN SAPPHIRE CORPREIGN SAPPPHIRE CORP
Company name pairs I guessed DO match

And here are some very similar names I think probably do not match, or at least not with a likelihood that makes me comfortable:

L3 CORPLG CORP
K2 INCK12 INC
RANCON REALTY FUND IVRANCON REALTY FUND V
DCB FINANCIAL CORPHCB FINANCIAL CORP
BLOX INCBOX INC
UVIC INCUBIC INC
OMEGA BRANDS INCMEGA BRANDS INC
ROME BANCORP INCHOME BANCORP INC
MILLS CORPHILLS CORP
VINCERA INCFINCERA INC
JPS INDUSTRIES INCGPS INDUSTRIES INC
STEVIA CORPSTEVVA CORP
INNOVEX INCINNOVET INC
IDENTIX INCIDENTIV INC
VANS INCEVANS INC
BANTA CORPBAETA CORP
META GROUP INCMEGA GROUP INC
LANGER INCHANGER INC
SLAP INCSNAP INC
COREL CORPCORVEL CORP
SPECTRX INCSPECTRA INC
ZEC INCZEP INC
CTD HOLDINGS INCCT HOLDINGS INC
SOMO INCDOMO INC
SBT BANCORP INCS&T BANCORP INC
CARDAX INCCARMAX INC
SEEC INCSTEC INC
NETRO CORPNEVRO CORP
GENUS INCAGENUS INC
REON HOLDINGS INCAEON HOLDINGS INC
Company name pairs which I’m NOT sure are a match

There were a lot of borderline cases, but I hope my manual review added some value. The more I match correctly, the less likely I’ll have survivorship bias. But the more I match incorrectly, the more noise I’ll introduce into my research.

Mapping by date

Tickers change over time, so my ticker map is also keyed by date. I ended up doing something pretty simple. For each issuer CIK, I collect all distinct tickers, both from the official ticker mapping and from the filing extractions. Then, for each date, I look at which of any of those tickers I have market data for, and I pick the one with the most volume on the day prior to that date. I use the day before to avoid bias – high volume days tend to be associated with positive returns, which introduces a look-ahead bias. Note the implied assumption that I would want to trade the most liquid share class, even if the filing involved an insider transaction for a different share class.

Conclusion

My conclusion is that I should probably have searched out a company identifier database to buy, ideally one that included daily prices and corporate actions. Years ago, I used the Compustat dataset which had CIKs. But that was about $25k/year even back then. Once I start making money, I’ll be better able to justify buying higher quality data.