Speed up check for suspicious content #5302

lzybkr · 2017-11-02T03:43:43Z

This replaces the code that scanned a script block for suspicious strings.
The previous implementation:

Tokenized input (generating many strings for garbage collection)
Use multiple threads

This approach is based on Rubin-Karp and does not allocate any memory
other than a small array to hold the running hash values.

I tested the new and old approaches on 2200 files in the PowerShell repo.
The old code ran in about 1.8-2.1s (ignoring time spent reading files)
The new code runs in about 0.6s and is more stable due to no garbage.

This replaces the code that scanned a script block for suspicious strings. The previous implementation: * Tokenized input (generating many strings for garbage collection) * Use multiple threads This approach is based on Rubin-Karp and does not allocate any memory other than a small array to hold the running hash values. I tested the new and old approaches on 2200 files in the PowerShell repo. The old code ran in about 1.8-2.1s (ignoring time spent reading files) The new code runs in about 0.6s and is more stable due to no garbage.

iSazonov · 2017-11-02T12:34:11Z

src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs

-            "UnderlyingSystemType",
+                    // If the character isn't in any of our patterns,
+                    // don't bother hashing and reset the running length.
+                    if (!((h >= 'a' && h <= 'z') || h == '-'))


We could use here "else if".

Nice - I manually inlined the first if and didn't notice.

PaulHigin · 2017-11-02T18:20:52Z

src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs

+            /// <returns>The string matching the hash, or null.</returns>
+            public static string Match(string text)
+            {
+                // The longest pattern is 29 characters.


I assume the 29 character limit was chosen to accommodate the current set of suspicious patterns. But what if a new pattern is added that is greater than 29 characters? It seems the generator method (HashNewPattern) should check length and throw an error if greater than current longest pattern.

Sure, I'll add a check.

PaulHigin · 2017-11-02T18:21:56Z

src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs

+                int longestPossiblePattern = 0;
+                for (int i = 0; i < text.Length; i++)
+                {
+                    uint h = text[i];


Please change 'h' variable to 'c' to be consistent with HasNewPattern which uses 'c' as character variable.

Consistency is over rated. Here h is not a character, but the hash value.

uint h = text[i]; 'h' is not a character from the text to be tested? I though the hashes were stored in 'runningHash'?

Yes, it's both, but I don't need 2 local variables.

And hashes are stored in runningHash, but h is being used to compute the hash. One could say it is partially hashed. In previous iterations of the code, I had intermediate steps, so there were multiple assignments to h as part of hashing.

PaulHigin · 2017-11-02T18:30:48Z

src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs

-            // Doing things with System.Runtime.InteropServices
-            "InteropServices", "Marshal", "AllocHGlobal", "PtrToStructure", "StructureToPtr",
-            "FreeHGlobal", "IntPtr",
+                    for (int j = Math.Min(i, runningHash.Length) - 1; j > 0; j--)


I must be missing something because I don't understand this loop. It looks like computing the next hash (based on HashNewPattern below) simply takes the previous hash, multiples by LCG, and adds next character. Why are previous hashes re-computed (j-1, j=(i-1) -> 1)?

I'm not sure how to improve the comment. I'll try again I guess.

Briefly - omitting many other steps:

Iteration n: compute hash on Emi (len 3)
Iteration n+1: we compute hash on Emit (len 4) using hash from previous iteration (j-1)
Iteration n+1: compute hash on mit (len 3) (overwriting previous iteration, so that's why we go from longest to shortest.)

Ok, I understand the progression now and the loop makes sense. This makes the algorithm much more clear (at least to me). Can you add this to your comment?

PaulHigin

Just some minor further comments. Otherwise LGTM.
With your tests it looks like false positives are very rare.

lzybkr requested review from BrucePay and daxian-dbw as code owners November 2, 2017 03:43

lzybkr requested review from SteveL-MSFT and PaulHigin November 2, 2017 04:05

iSazonov reviewed Nov 2, 2017

View reviewed changes

Address review feedback

357e985

PaulHigin suggested changes Nov 2, 2017

View reviewed changes

Address code review feedback

91de31e

PaulHigin approved these changes Nov 2, 2017

View reviewed changes

Make loop comment clearer

829fa44

lzybkr merged commit 51c9854 into PowerShell:master Nov 3, 2017

lzybkr deleted the faster_suspicious_check branch November 3, 2017 00:59

PaulHigin added the Backport-5.1-Consider Consider to backport to Windows PowerShell 5.1 due to impact label Nov 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up check for suspicious content #5302

Speed up check for suspicious content #5302

lzybkr commented Nov 2, 2017

iSazonov Nov 2, 2017

lzybkr Nov 2, 2017

PaulHigin Nov 2, 2017

lzybkr Nov 2, 2017

PaulHigin Nov 2, 2017

lzybkr Nov 2, 2017

PaulHigin Nov 2, 2017

lzybkr Nov 2, 2017

lzybkr Nov 2, 2017

PaulHigin Nov 2, 2017

lzybkr Nov 2, 2017

PaulHigin Nov 2, 2017

PaulHigin left a comment

Speed up check for suspicious content #5302

Speed up check for suspicious content #5302

Conversation

lzybkr commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulHigin left a comment

Choose a reason for hiding this comment