AdGuard for Windows October incident report: Post Mortem
Recently, some AdGuard for Windows users have been encountering problems with their app. After releasing version 7.22, we discovered that the update caused occasional page loading issues in browsers. While the issue was quickly mitigated for Chrome, it persisted longer in Firefox, since the bug proved rare and difficult to reproduce, stemming from an unusual combination of factors that were not detected during testing.
As of this publication, we’ve released a hotfix — AdGuard for Windows v7.22.1 — that fully resolves the issue. Please make sure your app is updated to the latest version for a smooth experience. If you use AdGuard on other platforms, there’s no need to take any action, everything should be working as expected.
We sincerely apologize for the inconvenience and hope this incident does not affect your overall experience with AdGuard. This was a one-time occurrence that will not happen again. Beyond addressing the technical issue, we also reviewed our internal processes and have already implemented several improvements to prevent similar situations in the future.
Below is the timeline of events that led to the incident, after which we will provide more technical details and list the measures we are taking to avoid anything like this in the future.
Events timeline
October 2
We released AdGuard for Windows v7.22.
October 4
A user opened a GitHub issue reporting page loading problems and, eventually, a Timed out error after updating to version 7.22.
October 6
Our QA team began investigating the issue but was unable to reproduce it. The team escalated the problem to developers the same day, and internal discussions began.
October 16
As more users joined the GitHub discussion, reports and upvotes accumulated. Unfortunately, at this stage, the QA team did not follow internal escalation guidelines for issues of this level. We faced severe difficulties when trying to reproduce the bug, partly because of the inconsistent nature of the user reports. This led the team to assume that it had been introduced a couple of versions back, and therefore didn’t require a hotfix. As a result, we postponed it for the next scheduled version, and, although the task was marked P1: Critical, it was not properly escalated, and valuable time was lost.
October 30
After 24 days, the issue resurfaced internally, revealing the true extent of its impact on users. Once we successfully reproduced the problem, we were able to estimate the overall damage more accurately. Work immediately began on a hotfix (v7.22.1). However, much of the time and effort went into reliably reproducing the problem and understanding its main cause. Several hypotheses were tested and ruled out during this process.
During the investigation, we also discovered a Firefox bug related to QUIC connections, which further slowed page loading and complicated debugging. Additionally, an issue was found with the lack of QUIC connection filtering when AdGuard was running alongside AdGuard VPN in compatibility mode with Wintun disabled.
November 1
We partially identified the cause of the issue and found a way to reproduce it. A fix was implemented in DnsLibs, our DNS filtering engine, and a nightly build was released for testing.
November 5
We pinpointed the main cause — a bug in the routing loop detection component of CoreLibs, AdGuard’s filtering engine. The issue was promptly fixed, and a second nightly build was released.
November 6
Testing of the nightly builds showed that the initial fixes resolved most problems with TCP connections, but some UDP-related issues remained. To minimize user impact, we decided to roll out the fixes in two phases. First, a third nightly build addressed these additional issues, including final library fixes and updates to driver instructions. We then prepared and thoroughly tested the v7.22.1 patch, which was released the same day.
Technical details
The bug
The bug in AdGuard for Windows v7.22 caused occasional, unpredictable freezes when loading certain pages in Firefox. It did not affect Chrome users as much. The issue began after CoreLibs v1.19 introduced protection against routing loops, a mechanism designed to prevent traffic from looping back into itself.
The bug itself had two main causes. First, a programming error excluded a port from the routing loop check, which meant that some legitimate filtered connections, such as browser requests, could be blocked. This was triggered by some service requests originating from AdGuard, such as OCSP requests. After such requests, the next browser connection would be interrupted.
Second, the check was applied in the wrong place to already established connections, resulting in unnecessary connection blocks.
Why do we need protection against routing loops?
A routing loop occurs when traffic loops back to its source application. Such loops can cause slower connections and high CPU usage. Normally, these situations do not occur, but due to interactions with other software, routing loops can still happen. To prevent them, AdGuard tracks outgoing connections by their source address and terminates any that return to the same address.
Why did it only slightly affect Chrome?
Because only the connection immediately following a service one was broken, Chrome users were much less likely to notice the issue. This is because Chrome automatically attempts to perform the same network request again after a connection is reset.
Firefox users, however, were affected in a more direct way — the browser’s design does not include automatic retry attempts after any kind of network error, which made the issue more apparent for them.
What about AdGuard for Linux and Android?
The problem in question was not detected during the CLI stage with version 1.19, even though the newest functionality is always introduced in AdGuard CLI first, so that major issues can be found and fixed before integration into UI-based applications. In this case, the problem occurred only on Linux auto mode, and mostly in Firefox. As a result, very few users met these criteria, so no user reports were received.
The issue also did not appear in AdGuard for Android v4.12, which uses the same CoreLibs 1.19, because the incoming and outgoing connection addresses differed, preventing the same conditions that triggered the bug elsewhere.
The diagnosis
Diagnosing the issue was particularly difficult. AdGuard opens relatively few service connections, and repeated attempts to reproduce the problem often used cached OCSP requests, preventing the bug from appearing. Additionally, the problem affected only a single downstream connection at a time. Chrome users were mostly unaffected because the browser automatically retries failed connections, whereas Firefox users were directly impacted, as it does not attempt reconnections when a network error occurs.
A related Firefox bug discovered along the way
While we were looking for this hard-to-find bug, we came across the fact that half of the reports had other symptoms, and the problems described began to recur on past versions of AdGuard as well. That’s how we also discovered a separate bug in Firefox for Windows affecting HTTP/3 connections.
In short, Firefox attempts HTTP/3 connections immediately when a site advertises support, even if the connection is not yet available. AdGuard currently does not filter HTTP/3 by default, so HTTP/3 is blocked for applications with HTTPS filtering enabled.
Usually browsers include an algorithm like Happy Eyeballs that allows you to choose which protocol works best, but Firefox for Windows immediately tries to establish an HTTP/3 connection when it receives information that a site has HTTP/3 (for instance, from a DNS record like HTTPS), and assigns requests to an HTTP/3 connection that is not yet established, despite the presence of a live HTTP/2 connection. If HTTP/3 is unavailable, it leads to pauses in site loading for 20–30 seconds, after which requests are "reassigned" to a live HTTP/2-connection.
A bug report regarding this behavior has been filled. As a preventive measure, AdGuard has introduced a modification of HTTPS-type DNS records to exclude the h3 ALPN parameter when HTTP/3 filtering is disabled. This hides the fact that HTTP/3 is available from the browser in cases where it would be blocked by AdGuard anyway.
The fix
The fix was applied in two phases.
The first phase corrected the connection-matching logic to properly include the port, which resolved most of the problem, though some false positives persisted due to Windows reusing ports from recently closed connections. A nightly build released on the evening of November 5 helped most users.
However, traces of the issue could still be seen in AdGuard logs. It turned out that the problem was only partially resolved — fixing the connection matching algorithm alone wasn’t enough, because Windows can reuse the port of an outgoing connection within a second after the socket is released. This caused another type of false positive, when unrelated connections with the same address and port were mistakenly identified as loops.
There were no clear new incidents (everyone reported that things were working fine), but potentially many users could still be affected. And that’s how we came to the second phase: it addressed these cases, resulting in the final hotfix, v7.22.1, which fully resolved the issue.
Prevention measures
This issue was not caught earlier primarily due to the difficulty of reproducing it under normal testing conditions, combined with insufficient attention to user feedback.
We are currently updating our QA and development processes, with a particular focus on combining stricter process monitoring, enhanced automation, and more rigorous testing and communication within teams. With that, we aim to prevent similar incidents in the future and ensure the reliability of AdGuard for all users.
Changing the QA team’s workflow
The QA team will start paying closer attention to the number of comments and upvotes on GitHub Issues. To minimize the human factor, this monitoring will not rely solely on manual review — automation will be introduced to track activity and the number of upvotes in public issues. If this approach proves effective, it can be scaled and applied to all QA teams across all AdGuard projects.
An additional briefing will be conducted based on our Triage Guidelines, and a mandatory practice will be introduced for internal issue assessment within Jira. This process will require a justification based on a structured internal Triage Guidelines, helping ensure consistency and transparency in prioritization decisions.
The team will also prepare a list of diagnostic questions for users, which will make it easier to identify and analyze filtering-related problems based on user feedback.
Finally, several new automated tests will be implemented to prevent similar issues in the future. A benchmark testing script is being developed to evaluate page filtering speed across different browsers. Once a performance baseline is defined, all subsequent releases will be measured against it. Currently, the automated tests are used only in Google Chrome, but we plan to extend them to Firefox as well. In addition, tests will be written to measure the loading speed of a defined set of “problematic” pages in these browsers. This list will be based on known problematic websites and will continue to expand, starting with those identified in the scope of this issue’s investigation — for example, discord.com.
Changing the development team’s workflow
The development teams will devote more attention to adding new tests — including integration tests — for new functionality whenever possible, in order to reduce the likelihood of errors appearing in future releases. Greater emphasis will also be placed on providing detailed technical documentation for new features and ensuring that all involved teams are properly informed about the need to test new functionality for potential issues and corner cases.
When integrating new versions of the CoreLibs into products, the teams will now wait for explicit approval from the CoreLibs team before proceeding. At the moment, this integration process is carried out somewhat “in isolation,” which increases the risk of missed issues.
Conclusion
We would like to apologize once again to all users affected by this incident and sincerely thank everyone who provided feedback and helped us navigate this challenging situation. Going forward, we will also be faster and more transparent in communicating with our users about any critical issues that could have significant impact.