I need precise timestamp in order to study the market reaction to news. Sadly, the SEC has not joined the exchanges in providing nanosecond timestamps from GPS-synced rubidium atomic clocks. Rather, it looks like the best EDGAR timestamps I can get from the SEC are only accurate within a couple of minutes. Here are three ways to get the timestamps:
- Filing header XML
- “Oldloads” archives
- Continuous polling
Filing header XML
The typical way I’ve downloaded filings in bulk is through feed archives. These are .tar.gz files that can be decompressed into a folder to get all individual filings for the day. The filings have XML headers, but inexplicably, they do not contain the SEC’s timestamp. Instead you can follow these three steps to get the timestamps:
- Download the daily-index files. These give you a list of all filings for the day, including the issuer CIKs and the accession numbers.
- For each filing, build a URL out of the CIK with leading zeros removed, the accession number with dashes removed, and then the full accession number, such as this:
- The filing at that URL will contain the critical header line with the timestamp:
The acceptance timestamp is some time after file upload and some time before the file is made available for public dissemination. It has second precision and is in a human-readable form, without delimiting characters, and is in East coast time.
Acceptance timestamps first became available in 2002, as seen in this chart:
And yes, I did download each filing (almost 12 million) this way, just for that one piece of information. It took a very long time. I sped things up by opening each URL as a stream, and closing the connection as soon as I’d read the acceptance header line. An alternative is to pull just the full header from the SEC:
I only recently learned about Oldloads archives when I stumbled upon a developer page on the SEC website. I’m not sure why they exist. The archives include all filings for the day run together in one massive text file. The good news is that they include a special header, described in the EDGAR Public Dissemination Subsystem Technical Specification section 2.1, which contains timestamps. The first line of each filing is a 266-byte control block that contains the accession number, the acceptance timestamp, and another timestamp called the “build time.” The build time is typically the same as the acceptance time, but it can also be several seconds later. I believe the build time is closer than the acceptance time to the true time the filing became public. This is the way to go.
The best way to get timestamps is always to capture them yourself. I tried this for a day. I continuously hit the page where the SEC links to the latest filings:
or actually the RSS feed of the same:
The SEC developer page lays out the maximum rate you are allowed to hit this feed:
To ensure that everyone has equitable access to SEC EDGAR content, please use efficient scripting, downloading only what you need and please moderate requests to minimize server load. Current guidelines limit each user to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests.https://www.sec.gov/developer
I followed that constraint very carefully. I built a sophisticated multi-threaded scraper that always waited precisely 101 milliseconds between requests. I dropped DNS lookups and went straight to a list of IP addresses hosted by Akamai to disseminate the filings. I even logged all my actions with timestamps to be sure I was doing it right. And yet I still got banned!
I emailed the SEC webmaster and got back this response:
Your IP was labeled as a denial of service threat. Your access should gradually restore if you go lighter on the rates. Staff said they are seeing some 200 codes so some requests appear to be going through.
Well, my requests did NOT appear to be going through. So, I re-ran my script the next day on my laptop, tethered to my cell phone in order to get a different IP address. I reduced my hit rate by increasing the time between requests to 201 milliseconds. And everything went fine. I was able to collect a day’s worth of timestamps before I burned through my monthly allotment of mobile data.
By the end of the week, I still couldn’t reach the SEC’s website for any reason, so I emailed the webmaster again. I got back this response:
Akamai observed large amounts of traffic at average of 380 requests per second at 8/3/20 06:39, causing the IP address to be temporarily banned. As of 8/7/20, the ban has been automatically lifted.
So, I’m happy my access has been completely restored, but I’m clueless as to how they thought I was hitting their site 380 times a second! I guess I’ll have to pull out Wireshark and see what Python’s
requests package is actually doing. But that still wouldn’t explain how cutting my rate in half kept me under the limit. The real lesson I learned is to always run on a cloud machine with a throwaway IP address.
Anyway, here is the plot that matters most:
After some cropping, this histogram shows that most filings become available for public download within a minute or two of the acceptance timestamp. This is corroborated by the SEC’s own statement:
What is the lag time between the filing acceptance time from the EDGAR Filer System and availability of the documents on sec.gov?https://www.sec.gov/about/webmaster-faq.htm#lag
Filings are often available on sec.gov within 1-3 minutes of the EDGAR system timestamp. The lag time can increase significantly with high server load. We don’t guarantee and cannot predict this lag.
In conclusion, use the Oldloads archives for timestamps. But be very careful about modeling bias. If you (like me) get your timestamps from the SEC, then there will be some small number of times your backtest will assume you could trade after a filing came out, but in reality the filing had not yet made it through public dissemination. So you would simulate trading on an event that hadn’t happened yet, giving you lookahead bias.