Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up check for suspicious content #5302

Merged
merged 4 commits into from
Nov 3, 2017

Conversation

lzybkr
Copy link
Member

@lzybkr lzybkr commented Nov 2, 2017

This replaces the code that scanned a script block for suspicious strings.
The previous implementation:

  • Tokenized input (generating many strings for garbage collection)
  • Use multiple threads

This approach is based on Rubin-Karp and does not allocate any memory
other than a small array to hold the running hash values.

I tested the new and old approaches on 2200 files in the PowerShell repo.
The old code ran in about 1.8-2.1s (ignoring time spent reading files)
The new code runs in about 0.6s and is more stable due to no garbage.

This replaces the code that scanned a script block for suspicious strings.
The previous implementation:
* Tokenized input (generating many strings for garbage collection)
* Use multiple threads

This approach is based on Rubin-Karp and does not allocate any memory
other than a small array to hold the running hash values.

I tested the new and old approaches on 2200 files in the PowerShell repo.
The old code ran in about 1.8-2.1s (ignoring time spent reading files)
The new code runs in about 0.6s and is more stable due to no garbage.
"UnderlyingSystemType",
// If the character isn't in any of our patterns,
// don't bother hashing and reset the running length.
if (!((h >= 'a' && h <= 'z') || h == '-'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use here "else if".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - I manually inlined the first if and didn't notice.

/// <returns>The string matching the hash, or null.</returns>
public static string Match(string text)
{
// The longest pattern is 29 characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the 29 character limit was chosen to accommodate the current set of suspicious patterns. But what if a new pattern is added that is greater than 29 characters? It seems the generator method (HashNewPattern) should check length and throw an error if greater than current longest pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a check.

int longestPossiblePattern = 0;
for (int i = 0; i < text.Length; i++)
{
uint h = text[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change 'h' variable to 'c' to be consistent with HasNewPattern which uses 'c' as character variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency is over rated. Here h is not a character, but the hash value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint h = text[i]; 'h' is not a character from the text to be tested? I though the hashes were stored in 'runningHash'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's both, but I don't need 2 local variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And hashes are stored in runningHash, but h is being used to compute the hash. One could say it is partially hashed. In previous iterations of the code, I had intermediate steps, so there were multiple assignments to h as part of hashing.

// Doing things with System.Runtime.InteropServices
"InteropServices", "Marshal", "AllocHGlobal", "PtrToStructure", "StructureToPtr",
"FreeHGlobal", "IntPtr",
for (int j = Math.Min(i, runningHash.Length) - 1; j > 0; j--)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something because I don't understand this loop. It looks like computing the next hash (based on HashNewPattern below) simply takes the previous hash, multiples by LCG, and adds next character. Why are previous hashes re-computed (j-1, j=(i-1) -> 1)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to improve the comment. I'll try again I guess.

Briefly - omitting many other steps:

Iteration n: compute hash on Emi (len 3)
Iteration n+1: we compute hash on Emit (len 4) using hash from previous iteration (j-1)
Iteration n+1: compute hash on mit (len 3) (overwriting previous iteration, so that's why we go from longest to shortest.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand the progression now and the loop makes sense. This makes the algorithm much more clear (at least to me). Can you add this to your comment?

Copy link
Contributor

@PaulHigin PaulHigin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor further comments. Otherwise LGTM.
With your tests it looks like false positives are very rare.

@lzybkr lzybkr merged commit 51c9854 into PowerShell:master Nov 3, 2017
@lzybkr lzybkr deleted the faster_suspicious_check branch November 3, 2017 00:59
@PaulHigin PaulHigin added the Backport-5.1-Consider Consider to backport to Windows PowerShell 5.1 due to impact label Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport-5.1-Consider Consider to backport to Windows PowerShell 5.1 due to impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants