, 15 min read
Precision, recall and why you shouldn’t crank up the warnings to 11
14 thoughts on “Precision, recall and why you shouldn’t crank up the warnings to 11”
, 15 min read
14 thoughts on “Precision, recall and why you shouldn’t crank up the warnings to 11”
I presume there’s also a harmonium between how concise a warning is, versus providing all the valid reasons you might encounter such an error. Such as the example you gave.
I see similar issue with code style checkers, recently I feel the industry has adopted quite dogmatic approach – more the better.
On one of our projects we have black, isort, flake… I had quite a few commits dealing with things like long lines with sql queries being rejected.. Like having one tool that is integrated with IDEs makes sense.. but having three that i have to run through docker…
But regarding the initial situation in the blog post: i think its fine writing something like
// noqa: this is safe because blabla
solves the problem of multiple people dealing with the same warning.. Of course if you have million of these your point still stands.
Having a static analyzer in your IDE from the very beginning of the project tuned for your current project has low overhead and excellent ROI. Maintaining near zero warnings is easy.
Checking an existing code base is often counterproductive. Even trying to come up with one set of personal fine-tubed warning settings doesn’t really work. Every project tends to have its own kind of typical false warnings and you don’t want to turn them all off because cumulatively they have a good chance of catching some real issues.
many languages also have a “Ignore x warning for this region of code” pragma/attribute. That way you can get to zero without torturing your code to meet some warning.
These “Ignore x warning for this region of code” techniques are complex and non portable. They require much code, and code that is relatively complicated.
I noticed that they were moving toward that and thought it was a bad idea to offer additional checks by default based on my own experience with the security-and-quality checks, but I did not say anything.
Last year, I arranged for OpenZFS to start running CodeQL on every commit and it rarely complained, but we had a number of bugs that were being caught that I felt a somewhat more aggressive version would catch. That resulted in the following experimental branch where I have been slowly disabling buggy / pointless queries while filing bug reports about them:
https://github.com/ryao/zfs/blob/codeql-experiments/.github/codeql.yml
Not that long ago, it caught a bug in a patch that reviewers missed, which vindicated my thoughts that a more aggressive analyzer would be useful. However, it is still not quite ready yet for me to open a PR against the main repository, since there is still a fair amount of noise. Some of it comes from tests while the rest comes from our python bindings. Unfortunately, I am not yet well versed enough in python to make judgment calls on those, but I imagine I will eventually.
That said, I often find making code changes to silence false positive reports from static analyzers to be worthwhile since the result is often significantly cleaner code than what was there originally. This is true for Clang’s static analyzer’s false positives, although I also often find them to be opportunities to add missing assertions. Unfortunately, false positives from CodeQL’s security-and-quality checks are an exception to this trend since I find that those checks rarely identify opportunities to improve code cleanliness. I often look at the false positive reports from CodeQL’s security-and-quality checks and think “the reviewers will think I have gone insane if I submit a patch based on this”.
There is one saving grace to GitHub’s move. GitHub’s code scanning only reports new reports in PRs, so even if you ignore the initial deluge of reports, you can still get useful information from a more noisy static analyzer, so this could be less terrible than it seems. However, getting people into the habit of ignoring reports is not good practice, so I highly suggest anyone deploying those checks to do some tuning to turn off the checks that have the worst signal to noise ratio first.
People could even use what I have done in my codeql-experiments branch as a basis for that, although I suggest at least spot checking reports in that category before turning it off, since maybe by the time you actually do this, CodeQL will have fixed some of the checkers. I plan to separate the checks being disabled into two groups, which are completely pointless checks that we should never enable and broken checks. After I have finished getting that branch ready to be merged and it is merged into OpenZFS, I plan to re-check the broken checks periodically. That would be maybe once or twice a year. That way, when the checks are fixed, we can start using them. That is probably something that others who adopt the security-and-quality checks should do too.
In hindsight, I should have written:
“we had a number of bugs that were being caught outside of CodeQL that I felt a somewhat more aggressive version would catch.”
Great read as usual. While you talk about false-positives, here’s one more (and IMO a bigger) issue to consider: Checks that blindly mark certain pattern/functions without even attempting to check whether the usage was correct or not.
For example, clang-tidy by default will mark any memcpy call (regardless of whether it was correct or not) as “unsafe” and recommend using the non-portable memcpy_s function.
This is significantly more harmful (especially when enabled by default) because not only is it not catching (or even attempting to catch) any actual bugs, but it’s going to trick newbies who don’t know any better into writing non-portable code with a false sense of security.
This trend is very unfortunate and concerning because I regularly use clang-tidy (and cppcheck) and even recommend it to others as they have in the past caught real bugs.
I opened an issue with LLVM to request more specific checks and that issue has morphed into a more general complaint about that checker in general:
https://github.com/llvm/llvm-project/issues/61098
Hopefully, they will drop that check in favor of more intelligent checks in the future.
That said, there is likely room for someone to open an issue asking for that check to be disabled in clang-tidy by default. It is deeply flawed.
Great to see this frank feedback, and the discussion of the precision/recall trade off is spot on. I like the use of the F1 score.
The problems of precision and recall are well recognised at GitHub, where I work, which is why there are different suites of queries for different precision/recall appetites.
There is also the ability to filter out specific problematic queries (such as the one you describe), modify them or even write your own.
CodeQL is opt-in for repos and can be configured quite a lot – you can change which “suite” is used (e.g.
security-extended
) and you can use “query filters” to further include or exclude specific queries.The GitHub docs on configuring Code Scanning with CodeQL are here.
If you find
security-and-quality
too noisy you can move tosecurity-extended
instead, or go back down to the default set, which is tuned to be low false positives – high precision – vs the quality set, and the extended set, which increase recall at the expense of precision.The default set is included in the extended set, and the extended set is included in the quality set.
You can really crank it up to 11 with the
security-experimental
suite, but that’s for auditing or where you really need a rule that hasn’t been fully tested yet to meet the standards for the other suites – you can just pick and choose one rule from a suite.I work in “Field” at GitHub, rather than directly on the product, but a big part of my role is helping get this balance right for customers using Advanced Security (the paid version of what open source get for free). We even have our own set of queries that we maintain (and move into the product, when they are ready) to help customers and demonstrate what’s possible.
If you get in touch with me by email or on Fosstodon I’d be happy to take a look at the repo you’re contributing to and see what can be done about the level of false positives.
Thanks. My blog post wasn’t a criticism of CodeQL per se.
I removed the security-and-quality setting. Please be mindful that once you promote some level of static analysis, this tends to propagate: “I just added your code to my project and CodeQL has found all these bugs”.
And now if you vary your “B” parameter how is the optimal number of warnings changed? The world is moving towards the hoards of not-so-well-trained programmers cause that model seems to scale better and has a faster response time. And it that world issuing tons of warnings is actually a good thing.
The world is moving towards the hoards of not-so-well-trained programmers
Is that true though?
Back when I worked on Windows Vista, the Windows team introduced static analysis tools that operated in conjunction with source code annotations. The vast majority of flagged issues were false positives, but the problem wasn’t just wasted time from investigating non-issues. Some manager had the brilliant idea of outsourcing all the “trivial fixes” for issues flagged by static analysis to a large IT contractor in India. You can probably guess how well that went. Novice programmers completely unfamiliar with one of the world’s most complex codebases introduced so many bugs (I wish I had statistics), which the Windows developers then had to fix, that I’m sure it would have been cheaper to leave the investigation and fixes to the original developers. The original “bugs” were mostly illusory, but the bugs introduced in the “fixes” certainly were not. (Not that I have anything against static analysis: the Vista codebase was far more robust than XP as a result. But this was definitely the wrong way to implement it.)