Well I mostly meant how do you supply the server resources and how do you crawl so much of the net so quickly. :)
I thought about it many times but never did it on that scale, plus was never paid to do so and really didn't want my static IP banned. So if you ever write on that and publish it on HN you'd find a very enthusiastic audience in me.
That was pretty boring too! The "script" was just a few hundred lines of C# code triggering Selenium via its SDK. The requirement was simply to load a set of URLs with two different browsers, an "old" one and a "new" one that included a (potentially) breaking change to cookie handling that the customer needed to check for across all sites. I didn't need to fully crawl the sites, I just had to load the main page of each distinct "web app" twice, but I had process JavaScript and handle cookies.
I did this in two phases:
Phase #1 was to collect "top-level" URLs, which I did via Certificate Transparency (CT). There's online databases that can return all valid certs for domains with a given suffix. I used about a dozen known suffixes for the state government, which resulted in about 11K hits from the CT database. I dumped these into a SQL table as the starting point. I also added in distinct domains from load balancer configs provided by the customer. This provided another few thousand sites that are child domains under a wildcard record and hence not easily discoverable via CT. All of this was semi-manual and done mostly with PowerShell scripts and Excel.
Phase #2 was the fun bit. I installed two bespoke builds of Chromium side-by-side on the 120-core box, pointed Selenium at both, and had them trawl through the list of URLs in headless mode. Everything was logged to a SQL database. The final output was any difference between the two Chromium builds. E.g.: JS console log entries that are different, cookies that are not the same, etc...
All of this was related to a proposed change to the Public Suffix List (PSL), which has a bunch of effects on DNS domain handling, cookies, CORS, DMARC, and various other things. Because it is baked into browser EXEs, the only way to test a proposed change ahead of time is to produce your own custom-built browser and test with that to see what would happen. In a sense, there's no "non-production Internet", so these lab tests are the only way.
Actually, the most compute-intensive part was producing the custom Chromium builds! Those took about an hour each on the same huge server.
By far the most challenging aspect was... the icon. I needed to hand over the custom builds to web devs so that they could double-check the sites they were responsible for, and it was also needed for internal-only web app testing. The hiccup was that two builds look the same and end up with overlapping Windows task bar icons! Making them "different enough" that they don't share profiles and have distinct toolbar icons was weirdly difficult, especially the icon.
It was a fun project, but the most hilarious part was that it was considered to be such a large-scale thing that they farmed out various major groups of domains to several consultancies to split up the work effort. I just scanned everything because it was literally simpler. They kept telling me I had "exceeded the scope", and for the life of me I couldn't explain to them that treating all domains uniformly is less work than trying to determine which one belongs to which agency.
I only get a "fun" project like this once every year or two.
Selling this kind of thing is basically impossible. You can't convince anyone that you have an ability that they don't even understand, at some fundamental level.
At best, you can incidentally use your full set of skills opportunistically, but that's only possible for unusual projects. Deploying a single VM for some boring app is always going to be a trivial project that anyone can do.
With this project even after it was delivered the customer didn't really understand what I did or what they got out of it. I really did try to explain, but it's just beyond the understanding of non-technical-background executives that think only in terms of procurement paperwork and scopes of works.
I thought about it many times but never did it on that scale, plus was never paid to do so and really didn't want my static IP banned. So if you ever write on that and publish it on HN you'd find a very enthusiastic audience in me.