Ten years in the wrong regex lane
Today I want to share the story of a couple of regular expressions — a tiny inaccuracy in them ended up costing the world more than 50 million hours of CPU time on iOS devices.
Disclaimer: this post is packed with technical details and might be a tough read if you don’t know how to code.
How content blockers work in Safari
First, let me explain how ad blockers in Safari actually work. The fact is, AdGuard on iOS relies on Safari’s built-in mechanism called Safari Content Blockers.
These days, there’s also support for an alternative approach — declarativeNetRequest (DNR), partially compatible with Chrome and Firefox. Interestingly enough, in Safari’s case, DNR rules are silently translated into Safari Content Blocker rules under the hood.
The very first version of AdGuard for iOS was released in October 2015, and right away we ran into a nasty problem. Historically, ad blockers have always used their own filtering rule syntax. It’s powerful and specifically designed for web filtering. Apple, however, introduced their own take on it, which was quite different from what the community was used to.
To jump ahead a little: when Chrome introduced declarativeNetRequest, they also went their own way. At least when it came to URL matching, their syntax was much closer to what we were familiar with.
Here’s an example of how a standard AdGuard rule gets converted into syntax supported by Safari:
Notice what happens to the URL pattern. In AdGuard filters, we use a special wildcard-like syntax tailored specifically to URLs. The reason is simple: traditional regular expressions are too slow for this job.
Regular expressions
Safari Content Blockers, on the other hand, rely on regular expressions — although in a very stripped-down form, so that they can still be compiled into a structure that speeds up matching.
If you’re curious about the internals: Safari builds a DFA bytecode out of regex patterns, which is then executed by a custom interpreter: DFABytecodeInterpreter.cpp.
Regexps are more versatile than the standard ad blocker syntax. Unfortunately, that flexibility isn’t really relevant for web content filtering. What we get instead is very slow rule compilation that eats up resources, plus strict limits on how many rules filters can contain. We’ve written before about Safari’s issues, and most of them are still around.
Pattern conversion v1.0
So, back in 2015, we had to figure out how to convert AdGuard’s URL patterns into regexps that Safari would accept.
We faced two major tasks at the time.
First, we needed to cut down the total number of rules in the final set. Back then, Safari had a hard limit of 50,000 rules. (We described how we tackled this in our post about ad blocking in Safari).
By the way, the current limit has been raised to 150,000. However, due to process memory limits, you can only use around 60–80K in practice. We reported this to Apple a couple of times (Apple Feedback Assistant reports: FB19728743, FB13282146), but to no avail.
The second task was to ensure the regexps we generated were efficient enough to run fast and lightweight enough for iOS to compile them. In those days, we sometimes saw the system kill the com.apple.Safari.ContentBlockerLoader
process because it consumed too many resources.
After a lot of manual testing, we settled on what seemed like the optimal conversion rules:
- The symbol
||
(“start of URL”) became^[htpsw]+://([a-z0-9-]\.)?
- The symbol
^
(“separator”) became[/:&?]
We felt confident we’d done our homework, so we stopped worrying about it and left things as they were — for nearly ten years.
We were wrong
It all started with another bug report. The issue was that our standard conversion method slightly changed the semantics of the special ||
symbol. On iOS, it ended up matching only a single subdomain level, while in every other version of AdGuard it matched across all levels.
The simple fix was to use the regular expression originally suggested by the WebKit developers back in 2015 — the one we had dismissed as “non-optimal” at the time. But we were so sure of our choice back then that we didn’t even bother to recheck until recently.
The changes we should have made were dead simple:
- Replace
||
with^[^:]+://+([^:/]+\.)?
- Replace
^
in most cases with[/:]
But was there really that much of a difference? Turns out — oh yes, there was.
Oh boy, how wrong we were
After swapping in the new regexps, we ran a couple of quick tests and the results blew us away. Rule loading speed in Safari didn’t just improve a little — it skyrocketed.
To put it in numbers: compiling the Tracking Protection filter in Safari became 5.5 times faster, and compiling the Base filter became 2.8 times faster.
And to make it more tangible, just take a look at the video below:
When you add up the number of AdGuard users over the past decade, and how many times the app had to recompile filters, the wasted CPU time comes out to at least 50 million extra hours on iOS devices.
I honestly feel ashamed about this mistake — especially knowing that the correct solution was staring us in the face the whole time. In hindsight, it’s clear that these new regexps were obviously going to compile and run faster than what we had chosen.
So what exactly was our mistake back then? I think it all comes down to a flawed testing methodology. We tried to judge “by eye,” instead of:
- Defining a clear set of criteria: memory usage, speed, and actual content blocker performance in the browser.
- Most importantly: learning how to measure those things precisely. Not by eyeballing Activity Monitor or
top
, but by using a proper profiler.
The good news is, this problem is now fixed. And I really hope we’ve learned the lessons we needed to, so we won’t repeat mistakes like this in the future.