-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993
Conversation
These are the benchmarks @HumanEquivalentUnit and I performed whilst working on this binary parser. Top notch work from him, frankly astonishing. The two intermediary methods were some of the better solutions of the 10 or 20 iterations of this code that we examined. https://gist.github.com/vexx32/852cd1d0ddc3e1a93f77b90ab23cfe3d Note that all cases were tested within ranges that Convert.ToIntX() methods are available, but the current method (labelled |
@vexx32 I'd prefer to see small PR - with |
I understand your viewpoint, but in order to permit I think the alternative would leave you guys with a lot more to review, only to possibly pull it out later. Also, it would make any further PRs significantly more complex. I felt it best to simplify the code paths as much as possible, in order to prevent the tokenizer becoming a complex morass of unmaintainable code. The only parts I could really strip out for you guys to simplify them is the underscore literal support, and that is... not a lot. :( |
Underscore and I suffix is not approved in related issue. So we have to wait. |
3946b0d
to
5f36d13
Compare
Rebased to pick up CI fixes. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll continue after my computer rebooted :-)
case 32: // int | ||
case 64: // long | ||
case 96: // decimal | ||
case int n when n >= 128: // BigInteger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clear logic for sign but I think we need more feedback whether this is a good UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the existing patterns for things like hexadecimal, but I would be inclined to agree that there could be improvements made here, should we want to.
{ | ||
if (c < 128) | ||
{ | ||
return (s_traits[c] & CharTraits.BinaryDigit) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
return c & 0x30 || c & 0x31;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason we can't do this, but I don't see anything else in the file that behaves this way and felt it best to continue the prevalent pattern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+30% performance :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the experiment below my suggestion should be faster :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you meant c == 0x30 || c == 0x31
? &
does not return a straight true
/false
or 0
/ 1
(we can only do this in the binary parse because we also & 0b1111
or strip the high bits by cast to (byte)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (s_traits[c] & CharTraits.BinaryDigit) != 0; | |
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov honestly a lot of these could also be reformatted like this regardless. For example:
internal static bool IsDecimalDigit(this char c)
{
if (c < 128)
{
return (s_traits[c] & CharTraits.Digit) != 0;
}
return false;
}
becomes
internal static bool IsDecimalDigit(this char c) => c < 128 && (s_traits[c] & CharTraits.Digit) != 0;
Is this a worthwhile change to make to the common pattern of these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can do this.
Pattern from CoreFX repo:
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;
internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! :)
test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1
Outdated
Show resolved
Hide resolved
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number. | ||
// This value will also be used as the indexer for our array. | ||
int outputByteWalker = (digits.Length + 7) / 8; | ||
Span<byte> outputBytes = new byte[outputByteWalker--]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use auto-decrement here? I'd prefer do "-1" in previous init line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just felt clever to use autodecrement to change the variable from "size of array" to "last index of array".
It is not very clear, and could be outputByteWalker -= 1
in the following line, if that's more readable. Or like this?
Span<byte> outputBytes = new byte[(digits.Length + 7) / 8];
int outputByteWalker = outputBytes.Length - 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use more informational name like outputByteWalkerLastIndex.
@@ -202,7 +202,7 @@ internal static BigInteger ParseBinary(ReadOnlySpan<char> digits, bool unsigned) | |||
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number. | |||
// This value will also be used as the indexer for our array. | |||
int outputByteWalker = (digits.Length + 7) / 8; | |||
Span<byte> outputBytes = new byte[outputByteWalker--]; | |||
Span<byte> outputBytes = outputByteWalker <= 512 ? stackalloc byte[outputByteWalker--] : new byte[outputByteWalker--]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use outputByteWalker--
here - very bad readability is way to bugs.
I'd add new local variable - one for size/length, second for last index if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of this momentarily. 😄
a5e01c5
to
053262c
Compare
1d83a13
to
7d4ff02
Compare
case 16: // short | ||
case 32: // int | ||
case 64: // long | ||
case 96: // decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?
@mklement0 have you any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vexx32 It is not low level problem because we say about digits
- it can be PowerShell language propblem if ARM users expect sign bit on another position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found nothing useful. I hope @mklement0 help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:
From a hex string: prepend a 0
:
// 0xff as 255; without the leading 0, the result would be -1
BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)
From a byte array - prepend a 0x
byte (with Big-Endian ordering):
new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.
And with binary, it would simply always mean invoking the BigInteger constructor with a value of true
for the isUnsigned
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.
I could even put it in as an experimental feature later on, I think.
| (digits[byteWalker - 6] << 6) | ||
| (digits[byteWalker - 5] << 5) | ||
| (digits[byteWalker - 4] << 4)) | ||
| (((digits[byteWalker - 3] << 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use the extra parentheses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lets us make (8x zero + 7x OR) operations down to (1x zeroing + 8x OR), and is faster.
The brackets are:
(byte) (
((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )
)
This works because the bit-patterns in the input (chars '0', '1') are 00110000
and 00110001
. That is a coincidence of them being ASCII 48, ASCII 49. We can take advantage of those bit patterns, and do less work. We could zero the 0011
on every character, like this more readable slower code, and then shift the least significant bit, and OR them all together:
(byte)(
((digits[byteWalker - 7] - '0') << 7) |
((digits[byteWalker - 6] - '0') << 6) |
((digits[byteWalker - 5] - '0') << 5) |
((digits[byteWalker - 4] - '0') << 4) |
((digits[byteWalker - 3] - '0') << 3) |
((digits[byteWalker - 2] - '0') << 2) |
((digits[byteWalker - 1] - '0') << 1) |
((digits[byteWalker - 0] - '0') )
)
But, for the 7,6,5,4 shifts, they will push the 0011
part far out of the way. When we cast to (byte) they get dropped. Saves 4x zeroing operations if we treat those as a group.
The 3,2,1,0 shifts will combine with each other, but the high 0011
bits would clash if we weren't careful. So we merge these as a group, then do a single & 0b1111
of this group, and save 3x zeroing.
Then +1x OR to merge the two halves. ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )
This is quite hard to explain, I hope this is readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ( (7)|(6)|(5)|(4)|(3)|(2)|(1)|(0) ) & 0b1111
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would not give the right answer, it would make the high bits from inputs 3,2,1,0 overwrite the data from 7,6,5,4 (then & 0b1111 would leave only 0-15 output values).
String input: "10101011"
index from right: 76543210
Bit patterns of shifted input characters,
we want just the single far right bit from each line,
all merged down:
0011000|7
001100|06
00110|005
0011|0004
|-------possible overlap when combining
001|1000|3
00|1100|02
0|0110|001
|0011|0000
^ removed by &0b1111 to prevent overlap
only in the 3,2,1,0 section
^ far left overlap is OK, dropped by (byte) cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification!
I think we need reformat the code and add the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate some suggestions on how to approach the reformatting. The actual behaviour of the code is documented in the comments (although perhaps that can also be improved upon).
I suspect it will be necessary to skirt StyleCop rules heavily to make this much more readable than it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We could add the well-formatted sample
( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 )
in your comment. - We could add white spaces in code for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
I think this might annoy StyleCop, but... such is the price of readability. 😄
{ | ||
if (c < 128) | ||
{ | ||
return (s_traits[c] & CharTraits.BinaryDigit) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (s_traits[c] & CharTraits.BinaryDigit) != 0; | |
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1'; |
test/powershell/Modules/Microsoft.PowerShell.Utility/Get-Random.Tests.ps1
Outdated
Show resolved
Hide resolved
@vexx32 I have no more ideas than to occupy your time :-) - thanks for great work! We are waiting PowerShell Committee. |
Appreciate you taking the time to help us make it better, Ilya! 😄 I might rebase once more to tidy the commit history a little if I get a moment, but otherwise I don't mind waiting till the Committee makes a decision here. |
Please don't rebase until maintainers ask you. |
This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days. |
@SteveL-MSFT @TravisEz13 @daxian-dbw Please review the PR. |
@TravisEz13 are we just pending Dongbo to get back from his well-deserved vacation and review? Any chance this will make it for 6.2, or should I expect it to wait till after that is released? :slight_smile: |
@vexx32 I'd like to get someone to review this that has some experience in the area. Or at least one more person from the PowerShell team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Left a minor comment, but not a blocker
return false; | ||
} | ||
// Return true if the character is a binary digit. | ||
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true, it seems more straightforward to just have => c == '0' || c == '1';
. It took me a minute or so to realise why this isn't a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps the coercion + comparison has a performance advantage?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.
Should I change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a comment to this effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we get the question the comment can help code readers.
result = dmValue; | ||
return true; | ||
} | ||
else if (Utils.TryCast(bigValue, out double d)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the else
here; following the branch above always returns
@TravisEz13 Seems like CI got stuck here, sorry; could you kick it for me when you get some time? Cheers! 😊 |
@vexx32 The jobs never started. I cannot restart them without closing the PR and re-opening it. I've seen in a few cases where you have |
…hell into Tokenizer/Refactor
Because this is a breaking change, I'm scheduling it to be merged first thing during 6.3. |
@TravisEz13 Do you guys need anything more from me before you merge? 🙂 |
Perhaps @daxian-dbw could find something :-) |
@vexx32 Thanks for the great improvement! |
…elper (PowerShell#7993) Fixes PowerShell#7557 * Adds support for binary parsing in format echoing hex: `0b11010110` * Works with all existing type suffixes and multipliers. * Supports arbitrary length parsing with `n` suffix using BigInteger; details below. * Adds `NumberFormat` enum to specify hex/binary/base 10 for the tokenizer, replacing old `bool hex`. * Adds `n` suffix for all numeric literals to support returning value as a `BigInteger` if requested. This bypasses the issue of large literals losing accuracy when they cast through `double`. * Adds tests for all new behaviours. --- ### Binary / Hex Parsing Implementation * Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary. * i.e., `0xFFFFFFFF -eq ([int]-1)` and `0xFFFFFFFFFFFFFFFF -eq ([long]-1)`, but suffixing `u` creates `int.MaxValue` and `long.MaxValue`, respectively, instead. * Sign bits higher than this are accepted for bigint-suffixed numerals: * Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit * Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters. * Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. `0b011111111` * Specifying an `u`nsigned suffix (or combination suffix that includes `u`) ignores sign bits, similar to how parsing a hex string using `[Convert]::ToUint32()` would do so. * Supports negating literals using `-` prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals. --- ### Refactored numeric tokenizer parsing **New flow:** 1. Check for `real` (`.01`, `0.0`, or `0e0` syntaxes) 1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns `false`. 2. TryParse as `Double`, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns `false`. 1. Check type suffixes and attempt to cast into appropriate type. This will return `false` if the value exceeds the specified type's bounds. 2. Default to parsing as `double` where no suffix has been applied. 2. Check number format. * If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string. * If hex, TryParse into `BigInteger` using some special casing to retain original behaviours in int/long ranges. * If neither binary nor hex, TryParse normally as a `BigInteger`. 3. Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows. 4. Check type suffixes. * If a specific type suffix is used, check type bounds and attempt to parse into that type. * If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed. 5. If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number. * `int` * `long` * `decimal` (base-10 literals only) * `double` (base-10 literals only) * ~~`BigInteger` for binary or hex literals.~~ If the value is outside `long` range (for hex and binary) or `double` range (for base 10), the parse will fail; higher values must be explicitly requested using the `n`/`N` BigInteger suffix. --- *This is a breaking change* as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names. Notes: * Binary literal support was approved by the committee in PowerShell#7557 * ~~The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.~~ * ~~Supporting underscore literals is a further breaking change causing some generic tokens like `1_000_000` to be read as numerals instead.~~ Per @SteveL-MSFT's [comment](PowerShell#7993 (comment)) this proposal was rejected. * ~~Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason *not* to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.~~
PR Summary
Fixes #7557
0b11010110
n
suffix using BigInteger; details below.NumberFormat
enum to specify hex/binary/base 10 for the tokenizer, replacing oldbool hex
.n
suffix for all numeric literals to support returning value as aBigInteger
if requested. This bypasses the issue of large literals losing accuracy when they cast throughdouble
.Binary / Hex Parsing Implementation
0xFFFFFFFF -eq ([int]-1)
and0xFFFFFFFFFFFFFFFF -eq ([long]-1)
, but suffixingu
createsint.MaxValue
andlong.MaxValue
, respectively, instead.0b011111111
u
nsigned suffix (or combination suffix that includesu
) ignores sign bits, similar to how parsing a hex string using[Convert]::ToUint32()
would do so.-
prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.Refactored numeric tokenizer parsing
New flow:
real
(.01
,0.0
, or0e0
syntaxes)false
.Double
, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returnsfalse
.false
if the value exceeds the specified type's bounds.double
where no suffix has been applied.BigInteger
using some special casing to retain original behaviours in int/long ranges.BigInteger
.int
long
decimal
(base-10 literals only)double
(base-10 literals only)If the value is outsideBigInteger
for binary or hex literals.long
range (for hex and binary) ordouble
range (for base 10), the parse will fail; higher values must be explicitly requested using then
/N
BigInteger suffix.This is a breaking change as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.
Notes:
The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.Supporting underscore literals is a further breaking change causing some generic tokens likePer @SteveL-MSFT's comment this proposal was rejected.1_000_000
to be read as numerals instead.Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason not to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.PR Checklist
[Change is not breaking](https://github.com/PowerShell/PowerShell/blob/master/.github/CONTRIBUTING.md#making-breaking-changes).h
,.cpp
,.cs
,.ps1
and.psm1
files have the correct copyright headerWIP:
to the beginning of the title and remove the prefix when the PR is ready.[feature]
if the change is significant or affects feature tests/cc @iSazonov , @SteveL-MSFT as it pertains to the tagged-for-further-discussion #7557