
EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system used by the SEC. Companies submit regulatory filings (such as quarterly reports) into the system, which are then made public for us to download. These filings are the primary focus of my research. In order to study the market reaction to them, I need to match each filing against a ticker.
CIK to Ticker
The SEC identifies both companies and individuals with a 10-digit central index key (CIK). I need to map that CIK to the appropriate time series in my market data (from Kibot), which are identified by ticker.
The starting point is the current mapping provided by the SEC: tickers.txt. That aligns fairly nicely with the Kibot ticker list. Most of the effort past this point is dealing with dead tickers, which are important for reducing survivorship bias.
Ticker extraction from filings
I augmented my mapping by extracting tickers directly from filings where I could. Certain types of filings, 10-K and 10-Q, sometimes contain a “TradingSymbol” buried in the XBRL. But the richest sources of tickers are the ownership filings, forms 3, 4, and 5. Unfortunately, these are also the noisiest sources, since the ticker is entered by hand. Here is an example – all the unique tickers entered for NORTHWEST INDIANA BANCORP:
NIWB(OB)
NIWIB(OB)
NIWIN(OB)
NWIB (OB)
NWIB(OB)
NWIBOB)
NWIIN(OB)
NWIN
NWIN (OB)
NWIN()B)
NWIN(OB
NWIN(OB)
NWIN)OB)
NWIN.OB
NWINOB
Company name matching
Name matching is really tricky to get right. I decided to make some attempt at it, but limited it only a few hours of effort. It would take a crazy amount of work to research thousands of company histories to accurately match name variations. I’ve limited my matching to names that are identical after minor cleanup, and also names that are extremely close, but only after a cursory manual review.
Levenshtein edit distance
The Levenshtein distance is the minimum number of edits required to turn one string into another. I used it for fuzzy name matching. Edits include character insertions, deletions, and substitutions. The edit distance can be computed using dynamic programming. Happily, I found a fast C-based implementation is readily available in python:
conda install python-levenshtein
For my company name matching, I looked at each EDGAR company name for which I had not yet found a ticker in the Kibot data, and I compared it to each company name listed by Kibot. I kept any pair that matched with no more than one edit. This left me with about 450 matches to review manually. I didn’t do any research on the names, but rather just went off intuition. Here are some names I think probably do match. One criterion I used was whether I thought one company would sue the other over too similar a name:
MAINSTREET BANKSHARES INC | MAINSTREET BANCSHARES INC |
ENDWAVE CORP | ENWAVE CORP |
UREX ENERGY CORP | REX ENERGY CORP |
CASH SYSTEMS INC | CASA SYSTEMS INC |
TRINSIC INC | ATRINSIC INC |
IBIZ TECHNOLOGY CORP | IBIS TECHNOLOGY CORP |
BIDVILLE INC | KIDVILLE INC |
JAZZ TECHNOLOGIES INC | JZZ TECHNOLOGIES INC |
INYX INC | INX INC |
VERIDIAN CORP | VERIDIEN CORP |
CENTREX INC | CEMTREX INC |
EXCELON CORP | EXELON CORP |
TETON ADVISORS INC | TETON ADVISORRS INC |
TELS CORP | TELUS CORP |
TELOS CORP | TELUS CORP |
310 HOLDINGS INC | P10 HOLDINGS INC |
VANSEN PHARMA INC | VENSEN PHARMA INC |
SOTECH INC | SOFTECH INC |
SONTERRA RESOURCES INC | BONTERRA RESOURCES INC |
FORTUNET INC | FORTINET INC |
PANEX RESOURCES INC | PAREX RESOURCES INC |
VIVOS INC | VIVUS INC |
ORTEC INTERNATIONAL INC | OPTEC INTERNATIONAL INC |
DFP HEALTHCARE ACQUISITIONS CORP | DFB HEALTHCARE ACQUISITIONS CORP |
ZORO MINING CORP | CORO MINING CORP |
INVESTNET INC | ENVESTNET INC |
LINK ENERGY LLC | LINN ENERGY LLC |
LASON INC | LAWSON INC |
VALCOM INC | VOLCOM INC |
REIGN SAPPHIRE CORP | REIGN SAPPPHIRE CORP |
And here are some very similar names I think probably do not match, or at least not with a likelihood that makes me comfortable:
L3 CORP | LG CORP |
K2 INC | K12 INC |
RANCON REALTY FUND IV | RANCON REALTY FUND V |
DCB FINANCIAL CORP | HCB FINANCIAL CORP |
BLOX INC | BOX INC |
UVIC INC | UBIC INC |
OMEGA BRANDS INC | MEGA BRANDS INC |
ROME BANCORP INC | HOME BANCORP INC |
MILLS CORP | HILLS CORP |
VINCERA INC | FINCERA INC |
JPS INDUSTRIES INC | GPS INDUSTRIES INC |
STEVIA CORP | STEVVA CORP |
INNOVEX INC | INNOVET INC |
IDENTIX INC | IDENTIV INC |
VANS INC | EVANS INC |
BANTA CORP | BAETA CORP |
META GROUP INC | MEGA GROUP INC |
LANGER INC | HANGER INC |
SLAP INC | SNAP INC |
COREL CORP | CORVEL CORP |
SPECTRX INC | SPECTRA INC |
ZEC INC | ZEP INC |
CTD HOLDINGS INC | CT HOLDINGS INC |
SOMO INC | DOMO INC |
SBT BANCORP INC | S&T BANCORP INC |
CARDAX INC | CARMAX INC |
SEEC INC | STEC INC |
NETRO CORP | NEVRO CORP |
GENUS INC | AGENUS INC |
REON HOLDINGS INC | AEON HOLDINGS INC |
There were a lot of borderline cases, but I hope my manual review added some value. The more I match correctly, the less likely I’ll have survivorship bias. But the more I match incorrectly, the more noise I’ll introduce into my research.
Mapping by date
Tickers change over time, so my ticker map is also keyed by date. I ended up doing something pretty simple. For each issuer CIK, I collect all distinct tickers, both from the official ticker mapping and from the filing extractions. Then, for each date, I look at which of any of those tickers I have market data for, and I pick the one with the most volume on the day prior to that date. I use the day before to avoid bias – high volume days tend to be associated with positive returns, which introduces a look-ahead bias. Note the implied assumption that I would want to trade the most liquid share class, even if the filing involved an insider transaction for a different share class.
Conclusion
My conclusion is that I should probably have searched out a company identifier database to buy, ideally one that included daily prices and corporate actions. Years ago, I used the Compustat dataset which had CIKs. But that was about $25k/year even back then. Once I start making money, I’ll be better able to justify buying higher quality data.