Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Binary Parsing Support & Refactor TryGetNumberValue & ScanNumberHelper #7993

Merged
merged 54 commits into from
Apr 3, 2019

Conversation

vexx32
Copy link
Collaborator

@vexx32 vexx32 commented Oct 11, 2018

PR Summary

Fixes #7557

  • Adds support for binary parsing in format echoing hex: 0b11010110
    • Works with all existing type suffixes and multipliers.
    • Supports arbitrary length parsing with n suffix using BigInteger; details below.
  • Adds NumberFormat enum to specify hex/binary/base 10 for the tokenizer, replacing old bool hex.
  • Adds n suffix for all numeric literals to support returning value as a BigInteger if requested. This bypasses the issue of large literals losing accuracy when they cast through double.
  • Adds tests for all new behaviours.

Binary / Hex Parsing Implementation

  • Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary.
    • i.e., 0xFFFFFFFF -eq ([int]-1) and 0xFFFFFFFFFFFFFFFF -eq ([long]-1), but suffixing u creates int.MaxValue and long.MaxValue, respectively, instead.
  • Sign bits higher than this are accepted for bigint-suffixed numerals:
    • Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit
    • Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters.
    • Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. 0b011111111
  • Specifying an unsigned suffix (or combination suffix that includes u) ignores sign bits, similar to how parsing a hex string using [Convert]::ToUint32() would do so.
  • Supports negating literals using - prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.

Refactored numeric tokenizer parsing

New flow:

  1. Check for real (.01, 0.0, or 0e0 syntaxes)
    1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns false.
    2. TryParse as Double, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns false.
      1. Check type suffixes and attempt to cast into appropriate type. This will return false if the value exceeds the specified type's bounds.
      2. Default to parsing as double where no suffix has been applied.
  2. Check number format.
    • If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string.
    • If hex, TryParse into BigInteger using some special casing to retain original behaviours in int/long ranges.
    • If neither binary nor hex, TryParse normally as a BigInteger.
  3. Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows.
  4. Check type suffixes.
    • If a specific type suffix is used, check type bounds and attempt to parse into that type.
      • If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed.
  5. If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number.
    • int
    • long
    • decimal (base-10 literals only)
    • double (base-10 literals only)
    • BigInteger for binary or hex literals. If the value is outside long range (for hex and binary) or double range (for base 10), the parse will fail; higher values must be explicitly requested using the n/N BigInteger suffix.

This is a breaking change as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.

Notes:

  • Binary literal support was approved by the committee in Allow binary numbers, byte literals, and BigInteger? #7557
  • The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.
    • Supporting underscore literals is a further breaking change causing some generic tokens like 1_000_000 to be read as numerals instead. Per @SteveL-MSFT's comment this proposal was rejected.
    • Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason not to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.

PR Checklist

/cc @iSazonov , @SteveL-MSFT as it pertains to the tagged-for-further-discussion #7557

@vexx32
Copy link
Collaborator Author

vexx32 commented Oct 11, 2018

These are the benchmarks @HumanEquivalentUnit and I performed whilst working on this binary parser. Top notch work from him, frankly astonishing. The two intermediary methods were some of the better solutions of the 10 or 20 iterations of this code that we examined.

https://gist.github.com/vexx32/852cd1d0ddc3e1a93f77b90ab23cfe3d

Note that all cases were tested within ranges that Convert.ToIntX() methods are available, but the current method (labelled BigEndianBinaryOr in this table) generalises seamlessly to arbitrary-length binary strings.

@iSazonov
Copy link
Collaborator

@vexx32 I'd prefer to see small PR - with 0b only. We could more fast review.

@vexx32
Copy link
Collaborator Author

vexx32 commented Oct 11, 2018

I understand your viewpoint, but in order to permit 0b parsing a majority of these changes are required -- or a significant amount of redundant code would need to be in place, because TryParse() methods are not capable of parsing binary.

I think the alternative would leave you guys with a lot more to review, only to possibly pull it out later. Also, it would make any further PRs significantly more complex. I felt it best to simplify the code paths as much as possible, in order to prevent the tokenizer becoming a complex morass of unmaintainable code.

The only parts I could really strip out for you guys to simplify them is the underscore literal support, and that is... not a lot. :(

@iSazonov
Copy link
Collaborator

Underscore and I suffix is not approved in related issue. So we have to wait.

@vexx32
Copy link
Collaborator Author

vexx32 commented Oct 12, 2018

Rebased to pick up CI fixes. 😄

Copy link
Collaborator

@iSazonov iSazonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue after my computer rebooted :-)

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
case 32: // int
case 64: // long
case 96: // decimal
case int n when n >= 128: // BigInteger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clear logic for sign but I think we need more feedback whether this is a good UX.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the existing patterns for things like hexadecimal, but I would be inclined to agree that there could be improvements made here, should we want to.

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
{
if (c < 128)
{
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

return c & 0x30 || c & 0x31;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason we can't do this, but I don't see anything else in the file that behaves this way and felt it best to continue the prevalent pattern here?

Copy link
Collaborator

@iSazonov iSazonov Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+30% performance :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the experiment below my suggestion should be faster :-)

Copy link
Collaborator Author

@vexx32 vexx32 Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you meant c == 0x30 || c == 0x31? & does not return a straight true/false or 0 / 1
(we can only do this in the binary parse because we also & 0b1111 or strip the high bits by cast to (byte))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov honestly a lot of these could also be reformatted like this regardless. For example:

internal static bool IsDecimalDigit(this char c)
{
    if (c < 128)
    {
        return (s_traits[c] & CharTraits.Digit) != 0;
    }

    return false;
}

becomes

internal static bool IsDecimalDigit(this char c) => c < 128 && (s_traits[c] & CharTraits.Digit) != 0;

Is this a worthwhile change to make to the common pattern of these methods?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can do this.
Pattern from CoreFX repo:

internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;
internal static bool IsDecimalDigit(this char c) => (uint)(c - '0') <= 9;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! :)

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number.
// This value will also be used as the indexer for our array.
int outputByteWalker = (digits.Length + 7) / 8;
Span<byte> outputBytes = new byte[outputByteWalker--];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use auto-decrement here? I'd prefer do "-1" in previous init line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just felt clever to use autodecrement to change the variable from "size of array" to "last index of array".
It is not very clear, and could be outputByteWalker -= 1 in the following line, if that's more readable. Or like this?

            Span<byte> outputBytes = new byte[(digits.Length + 7) / 8];
            int outputByteWalker = outputBytes.Length - 1;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use more informational name like outputByteWalkerLastIndex.

src/System.Management.Automation/engine/Utils.cs Outdated Show resolved Hide resolved
@@ -202,7 +202,7 @@ internal static BigInteger ParseBinary(ReadOnlySpan<char> digits, bool unsigned)
// Calculate number of 8-bit bytes needed to hold the input, rounded up to next whole number.
// This value will also be used as the indexer for our array.
int outputByteWalker = (digits.Length + 7) / 8;
Span<byte> outputBytes = new byte[outputByteWalker--];
Span<byte> outputBytes = outputByteWalker <= 512 ? stackalloc byte[outputByteWalker--] : new byte[outputByteWalker--];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use outputByteWalker-- here - very bad readability is way to bugs.
I'd add new local variable - one for size/length, second for last index if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of this momentarily. 😄

case 16: // short
case 32: // int
case 64: // long
case 96: // decimal
Copy link
Collaborator

@iSazonov iSazonov Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder - is this constants the same for all hardware platforms? Can we catch a problem on some ARMs? Or more general question - do we sure that PowerShell signed constants (binary and hex) is portable?

@mklement0 have you any thoughts?

Copy link
Collaborator Author

@vexx32 vexx32 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, there aren't any supported platforms where these numbers will be different. However, if there is a programmatic way to get expected sign bit lengths, I would be more than open to checking against this instead.

Copy link
Collaborator

@iSazonov iSazonov Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vexx32 It is not low level problem because we say about digits - it can be PowerShell language propblem if ARM users expect sign bit on another position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, that could definitely be a concern. I could always refactor this to incorporate this, but I'd need to know what to incorporate here to account for it. Not sure where to find this information at the moment. 🙁

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found nothing useful. I hope @mklement0 help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vexx32: The always-positive logic is fairly easy to implement, if I understand things correctly:

From a hex string: prepend a 0:

// 0xff as 255; without the leading 0, the result would be -1
BigInteger.Parse("0ff",NumberStyles.AllowHexSpecifier)

From a byte array - prepend a 0x byte (with Big-Endian ordering):

new BigInteger(new byte[] { 0x0, 0xff }, isBigEndian: true)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even easier, actually. Hex strings are currently given a leading zero by default with my current code. I am then stripping it out in cases where it appears appropriate to allow a sign bit. I would simply remove the additional logic.

And with binary, it would simply always mean invoking the BigInteger constructor with a value of true for the isUnsigned parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking further, I could even trim out the extra leading 0 if we eventually opt for parsing as unsigned, as it will be possible to simply instruct the BigInteger constructor to always read it as unsigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, on the topic of the breaking change. I know it's desirable to change it, but it seems a bit risky to make such a change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that would probably be the case. It is a fairly straightforward follow up if we find reason to change it later.

I could even put it in as an experimental feature later on, I think.

| (digits[byteWalker - 6] << 6)
| (digits[byteWalker - 5] << 5)
| (digits[byteWalker - 4] << 4))
| (((digits[byteWalker - 3] << 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use the extra parentheses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It lets us make (8x zero + 7x OR) operations down to (1x zeroing + 8x OR), and is faster.

The brackets are:

(byte) (
    ((7)|(6)|(5)|(4))   |  ( ((3)|(2)|(1)|0) & 0b1111 )
)

This works because the bit-patterns in the input (chars '0', '1') are 00110000 and 00110001. That is a coincidence of them being ASCII 48, ASCII 49. We can take advantage of those bit patterns, and do less work. We could zero the 0011 on every character, like this more readable slower code, and then shift the least significant bit, and OR them all together:

(byte)(
      ((digits[byteWalker - 7] - '0') << 7) |
      ((digits[byteWalker - 6] - '0') << 6) |
      ((digits[byteWalker - 5] - '0') << 5) |
      ((digits[byteWalker - 4] - '0') << 4) |
      ((digits[byteWalker - 3] - '0') << 3) |
      ((digits[byteWalker - 2] - '0') << 2) |
      ((digits[byteWalker - 1] - '0') << 1) |
      ((digits[byteWalker - 0] - '0')     )
      )

But, for the 7,6,5,4 shifts, they will push the 0011 part far out of the way. When we cast to (byte) they get dropped. Saves 4x zeroing operations if we treat those as a group.

The 3,2,1,0 shifts will combine with each other, but the high 0011 bits would clash if we weren't careful. So we merge these as a group, then do a single & 0b1111 of this group, and save 3x zeroing.

Then +1x OR to merge the two halves. ((7)|(6)|(5)|(4)) | ( ((3)|(2)|(1)|0) & 0b1111 )

This is quite hard to explain, I hope this is readable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ( (7)|(6)|(5)|(4)|(3)|(2)|(1)|(0) ) & 0b1111?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would not give the right answer, it would make the high bits from inputs 3,2,1,0 overwrite the data from 7,6,5,4 (then & 0b1111 would leave only 0-15 output values).

String input:        "10101011"
index from right:     76543210

Bit patterns of shifted input characters, 
we want just the single far right bit from each line, 
all merged down:

    0011000|7
     001100|06
      00110|005   
       0011|0004
             |-------possible overlap when combining
        001|1000|3
         00|1100|02
          0|0110|001
           |0011|0000

             ^ removed by &0b1111 to prevent overlap
                only in the 3,2,1,0 section

      ^ far left overlap is OK, dropped by (byte) cast

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification!

I think we need reformat the code and add the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate some suggestions on how to approach the reformatting. The actual behaviour of the code is documented in the comments (although perhaps that can also be improved upon).

I suspect it will be necessary to skirt StyleCop rules heavily to make this much more readable than it is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We could add the well-formatted sample ( (7)|(6)|(5)|(4) ) | ( ( (3)|(2)|(1)|(0) ) & 0b1111 ) in your comment.
  2. We could add white spaces in code for better readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

I think this might annoy StyleCop, but... such is the price of readability. 😄

{
if (c < 128)
{
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (s_traits[c] & CharTraits.BinaryDigit) != 0;
internal static bool IsBinaryDigit(this char c) => c == '0' || c == '1';

@iSazonov iSazonov added Review - Committee The PR/Issue needs a review from the PowerShell Committee Breaking-Change breaking change that may affect users Documentation Needed in this repo Documentation is needed in this repo labels Oct 17, 2018
@iSazonov
Copy link
Collaborator

@vexx32 I have no more ideas than to occupy your time :-) - thanks for great work! We are waiting PowerShell Committee.

@vexx32
Copy link
Collaborator Author

vexx32 commented Oct 17, 2018

Appreciate you taking the time to help us make it better, Ilya! 😄

I might rebase once more to tidy the commit history a little if I get a moment, but otherwise I don't mind waiting till the Committee makes a decision here.

@iSazonov
Copy link
Collaborator

Please don't rebase until maintainers ask you.

@stale
Copy link

stale bot commented Nov 23, 2018

This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days.
Thank you for your contributions.
Community members are welcome to grab these works.

@stale stale bot added the Stale label Nov 23, 2018
@iSazonov
Copy link
Collaborator

iSazonov commented Feb 1, 2019

@SteveL-MSFT @TravisEz13 @daxian-dbw Please review the PR.

@vexx32
Copy link
Collaborator Author

vexx32 commented Feb 11, 2019

@TravisEz13 are we just pending Dongbo to get back from his well-deserved vacation and review?

Any chance this will make it for 6.2, or should I expect it to wait till after that is released? :slight_smile:

@TravisEz13
Copy link
Member

@vexx32 I'd like to get someone to review this that has some experience in the area. Or at least one more person from the PowerShell team.

Copy link
Collaborator

@rjmholt rjmholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Left a minor comment, but not a blocker

return false;
}
// Return true if the character is a binary digit.
internal static bool IsBinaryDigit(this char c) => (uint)(c - '0') <= 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true, it seems more straightforward to just have => c == '0' || c == '1';. It took me a minute or so to realise why this isn't a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps the coercion + comparison has a performance advantage?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjmholt I was following @iSazonov's lead on that one, he apparently pulled that from the CoreFX code somewhere or other. Knowing nothing about how these checks would be performed on a hardware level means I can only speculate on potential ways this might be better than the ultimately far more readable version you propose.

Should I change it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjmholt The micro optimization come from CoreFX - 2 vs 3 operations - this gives huge perf win.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add a comment to this effect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we get the question the comment can help code readers.

result = dmValue;
return true;
}
else if (Utils.TryCast(bigValue, out double d))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the else here; following the branch above always returns

@vexx32
Copy link
Collaborator Author

vexx32 commented Feb 12, 2019

@TravisEz13 Seems like CI got stuck here, sorry; could you kick it for me when you get some time? Cheers! 😊

@TravisEz13
Copy link
Member

@vexx32 The jobs never started. I cannot restart them without closing the PR and re-opening it. I've seen in a few cases where you have force-pushed that the PR cannot be reopened. The easiest way is to force push again. For example, make a whitespace change to one of your commit messages and push.

@TravisEz13 TravisEz13 added this to the 6.3.0-preview.1 milestone Feb 13, 2019
@TravisEz13
Copy link
Member

Because this is a breaking change, I'm scheduling it to be merged first thing during 6.3.

@vexx32
Copy link
Collaborator Author

vexx32 commented Apr 2, 2019

@TravisEz13 Do you guys need anything more from me before you merge? 🙂

@iSazonov
Copy link
Collaborator

iSazonov commented Apr 3, 2019

Perhaps @daxian-dbw could find something :-)

@iSazonov iSazonov added the CL-Engine Indicates that a PR should be marked as an engine change in the Change Log label Apr 3, 2019
@TravisEz13 TravisEz13 merged commit 3f52adb into PowerShell:master Apr 3, 2019
@iSazonov
Copy link
Collaborator

iSazonov commented Apr 4, 2019

@vexx32 Thanks for the great improvement!

@vexx32 vexx32 deleted the Tokenizer/Refactor branch April 4, 2019 20:53
adityapatwardhan pushed a commit to adityapatwardhan/PowerShell that referenced this pull request Apr 4, 2019
…elper (PowerShell#7993)

Fixes PowerShell#7557 

* Adds support for binary parsing in format echoing hex: `0b11010110`
  * Works with all existing type suffixes and multipliers.
  * Supports arbitrary length parsing with `n` suffix using BigInteger; details below.
* Adds `NumberFormat` enum to specify hex/binary/base 10 for the tokenizer, replacing old `bool hex`.
* Adds `n` suffix for all numeric literals to support returning value as a `BigInteger` if requested. This bypasses the issue of large literals losing accuracy when they cast through `double`.
* Adds tests for all new behaviours.

---

### Binary / Hex Parsing Implementation

* Mimics old sign bit behaviour for int and long types. Sign bits accepted for 8 or 16-bit Hex parsing, and 8, 16, 32, 64 for binary.
  * i.e., `0xFFFFFFFF -eq ([int]-1)` and `0xFFFFFFFFFFFFFFFF -eq ([long]-1)`, but suffixing `u` creates `int.MaxValue` and `long.MaxValue`, respectively, instead.
* Sign bits higher than this are accepted for bigint-suffixed numerals:
    * Hex: Bigint-suffixed hex treats the high bit of any literal with a length multiple of 8 as the sign bit
    * Binary: Bigint-suffixed binary accepts sign bits at 96 and 128 chars, and from there on every 8 characters.
    * Prefixing the literal with a 0 will bypass this and be treated as unsigned, e.g. `0b011111111`
* Specifying an `u`nsigned suffix (or combination suffix that includes `u`) ignores sign bits, similar to how parsing a hex string using `[Convert]::ToUint32()` would do so.
* Supports negating literals using `-` prefix. This can result in positive numbers due to sign bits being permitted, just like hex literals.

---

### Refactored numeric tokenizer parsing

**New flow:**

1. Check for `real` (`.01`, `0.0`, or `0e0` syntaxes)
    1. If the decimal suffix is present, TryParse directly into decimal. If the TryParse fails, TryGetNumberValue returns `false`.
    2. TryParse as `Double`, and apply multiplier to value. If the TryParse fails, TryGetNumberValue returns `false`.
        1. Check type suffixes and attempt to cast into appropriate type. This will return `false` if the value exceeds the specified type's bounds.
        2. Default to parsing as `double` where no suffix has been applied.
2. Check number format.
    * If binary, manually parse into BigInteger using optimized helper function to directly construct the BigInteger bytes from the string.
    * If hex, TryParse into `BigInteger` using some special casing to retain original behaviours in int/long ranges.
    * If neither binary nor hex, TryParse normally as a `BigInteger`.
3. Apply multiplier value before attempting any casts to ensure type bounds can be appropriately checked without overflows.
4. Check type suffixes.
    * If a specific type suffix is used, check type bounds and attempt to parse into that type.
      * If the value exceeds the type's available values, the parse fails. Otherwise, a straight cast is performed.
5. If no suffix is used, the following types are bounds-checked, in order, resulting in the first successful test determining the type of the number. 
    * `int`
    * `long`
    * `decimal` (base-10 literals only)
    * `double` (base-10 literals only)
    * ~~`BigInteger` for binary or hex literals.~~ If the value is outside `long` range (for hex and binary) or `double` range (for base 10), the parse will fail; higher values must be explicitly requested using the `n`/`N` BigInteger suffix.

---

*This is a breaking change* as binary literals are now read as numbers instead of generic tokens which could potentially have been used as function / cmdlet names or file names.

Notes:
* Binary literal support was approved by the committee in PowerShell#7557 
* ~~The same issue is still under further discussion for underscore support in numeric literals and whether BigInteger parsing ought to be exposed to the user at all.~~
    * ~~Supporting underscore literals is a further breaking change causing some generic tokens like `1_000_000` to be read as numerals instead.~~ Per @SteveL-MSFT's [comment](PowerShell#7993 (comment)) this proposal was rejected.
    * ~~Removing underscore support or preventing standard parsing from accepting BigInteger ranges is a relatively trivial matter. It is my personal opinion that there is no particular reason *not* to hand the user a BigInteger when they enter a sufficiently large literal, but I will defer to the PowerShell Committee's judgement on this.~~
@vexx32 vexx32 mentioned this pull request Mar 8, 2021
@mklement0 mklement0 mentioned this pull request Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking-Change breaking change that may affect users CL-Engine Indicates that a PR should be marked as an engine change in the Change Log Committee-Reviewed PS-Committee has reviewed this and made a decision Documentation Needed in this repo Documentation is needed in this repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow binary numbers, byte literals, and BigInteger?
7 participants