Last few months I have been spending time with semgrep tool. As much as it has features its still a growing tool and does needs a bit of handholding. Here I will quickly explain how to hack the base code of semgrep to make it work against your specific language even though input file extension might not be the standard names.
TL:DR;
Background
Semgrep is a static code analysis tool that relies on a customizable set of rules to scan source code and identify various issues. The issues can range from general coding standard deviation to security Issues. I have been using semgrep for finding security bugs. As part of this process, I needed to run semgrep on many files/folders. This is exactly where I spotted my situation.
Let’s assume hypothetically we have all of our php files as .php7 or .server. In current setup semgrep will execute no rules if you specify language as [php]
Sample Rule
rules:
- id: get
patterns:
- pattern-either:
- pattern: $_GET[$KEY]
languages: [php]
mode: search
message: Look for get
severity: WARNING
Execution
As you can see the source folder has 3 files but semgrep ran the rule on only one file.
This is where I started digging what I can do and I stubbled upon a few options.
Option 1: Use paths
rules:
- id: get_agent
patterns:
- pattern-either:
- pattern: $_GET[$KEY]
paths:
include:
- "*.php"
- "*.server"
- "*.php7"
languages: [php]
mode: search
message: Look for user agent
severity: WARNING
However, this didn’t work and that lead me to further reading and poking questions at semgrep slack which initially pointed to 2 distinct suggestions.
a. Use generic as language
b. this is a known bug and someone already registered a bug
The first option didn’t work as if you use generic the standard language level parsers don’t come into picture. And the second option leads me to a rabbit-hole which eventually brought me to this discussion. This section clarified that the way semgrep operates, we can solve this by directly specify a file or a regular expression hack which lists all files needed.
Also, while going through all of this I realized path include is working in different manner, then I expected. The last line of this comment and this (entire issue)[https://github.com/returntocorp/semgrep/issues/1099] helped me piece together the working. So effectively based on language they picked the files from a folder and then include or exclude filters executed on the list. They then sent the final list to semgrep for processing.
This is when i switched my search and have found a hack which works for now.
Solution
The way semgrep behaves clearly shows a presence of a list which allows it to pick extensions against the specific language tag. A bit of searching in the semgrep base code led me to target_manager_extensions.py This file holds the extensions corresponding to the specific language. Now I needed to test if editing this file would get the tool to work. So here is how I tested it.
- I needed semgrep python files locations on my system. I can do a full search of system or if you are specifically aware of your site-packages folder go directly to it. I did a slightly different hack.
$ pip3 install -U semgrep
This command results in an error message saying requirement already satisfied, however it quickly tells me the path of the package
Requirement already satisfied: semgrep in /usr/local/lib/python3.9/site-packages (0.51.0)
So testing my hypothesis I edited the file and reran the semgrep tool.
As you can see here, the result was that they scanned all 3 files. Once confirmed I raised an issue on the semgrep repository and I will link this specific blog post to it so if anyone wants to pick that issue they know what to look for specifically.
So now you know how to scan unusual extensions via semgrep even if it doesn’t directly support them right now. Remember semgrep upgrade will overwrite the edits remember to put them back manually till semgrep makes it a feature.
I hope this blog post helps people understand how a conclusion was reached and how debugging around the specific issue was performed.