-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up check for suspicious content #5302
Conversation
This replaces the code that scanned a script block for suspicious strings. The previous implementation: * Tokenized input (generating many strings for garbage collection) * Use multiple threads This approach is based on Rubin-Karp and does not allocate any memory other than a small array to hold the running hash values. I tested the new and old approaches on 2200 files in the PowerShell repo. The old code ran in about 1.8-2.1s (ignoring time spent reading files) The new code runs in about 0.6s and is more stable due to no garbage.
"UnderlyingSystemType", | ||
// If the character isn't in any of our patterns, | ||
// don't bother hashing and reset the running length. | ||
if (!((h >= 'a' && h <= 'z') || h == '-')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use here "else if".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - I manually inlined the first if
and didn't notice.
/// <returns>The string matching the hash, or null.</returns> | ||
public static string Match(string text) | ||
{ | ||
// The longest pattern is 29 characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the 29 character limit was chosen to accommodate the current set of suspicious patterns. But what if a new pattern is added that is greater than 29 characters? It seems the generator method (HashNewPattern) should check length and throw an error if greater than current longest pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a check.
int longestPossiblePattern = 0; | ||
for (int i = 0; i < text.Length; i++) | ||
{ | ||
uint h = text[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change 'h' variable to 'c' to be consistent with HasNewPattern which uses 'c' as character variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency is over rated. Here h
is not a character, but the hash value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint h = text[i]; 'h' is not a character from the text to be tested? I though the hashes were stored in 'runningHash'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's both, but I don't need 2 local variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And hashes are stored in runningHash
, but h
is being used to compute the hash. One could say it is partially hashed. In previous iterations of the code, I had intermediate steps, so there were multiple assignments to h
as part of hashing.
// Doing things with System.Runtime.InteropServices | ||
"InteropServices", "Marshal", "AllocHGlobal", "PtrToStructure", "StructureToPtr", | ||
"FreeHGlobal", "IntPtr", | ||
for (int j = Math.Min(i, runningHash.Length) - 1; j > 0; j--) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be missing something because I don't understand this loop. It looks like computing the next hash (based on HashNewPattern below) simply takes the previous hash, multiples by LCG, and adds next character. Why are previous hashes re-computed (j-1, j=(i-1) -> 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to improve the comment. I'll try again I guess.
Briefly - omitting many other steps:
Iteration n: compute hash on Emi
(len 3)
Iteration n+1: we compute hash on Emit
(len 4) using hash from previous iteration (j-1
)
Iteration n+1: compute hash on mit
(len 3) (overwriting previous iteration, so that's why we go from longest to shortest.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand the progression now and the loop makes sense. This makes the algorithm much more clear (at least to me). Can you add this to your comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor further comments. Otherwise LGTM.
With your tests it looks like false positives are very rare.
This replaces the code that scanned a script block for suspicious strings.
The previous implementation:
This approach is based on Rubin-Karp and does not allocate any memory
other than a small array to hold the running hash values.
I tested the new and old approaches on 2200 files in the PowerShell repo.
The old code ran in about 1.8-2.1s (ignoring time spent reading files)
The new code runs in about 0.6s and is more stable due to no garbage.