Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][move-compiler-v2] move-compiler-v2/tests/v1.matched tests have diverged from v1 #12712

Open
brmataptos opened this issue Mar 27, 2024 · 7 comments
Labels
bug Something isn't working compiler-v2 stale-exempt Prevents issues from being automatically marked and closed as stale

Comments

@brmataptos
Copy link
Contributor

brmataptos commented Mar 27, 2024

馃悰 Bug

Various of the move-compiler-v2 test files that are allegedly equivalent to v1 tests have diverged from the originals. To accurately assess divergence between V1 and V2 we need to have the same input code for both.

% diff ./third_party/move/move-compiler-v2/tests/reference-safety/v1-tests/call_mutual_borrows_invalid.move ./third_party/move/move-compiler/tests/move_check/borrows/call_mutual_borrows_invalid.move
21c21
<         id_mut(&mut s1.f); *f;
---
>         id_mut(&mut s1.f); f;
29c29
<         *s;
---
>         s;

Someone needs to diff all the "v1.matched" files to see if they really are matched.

If a modified test is required to exhibit functionality given our new pass/checking ordering, we should include that test as "_modified" or something, but also include the original test file under v2.

We should also go through the test outputs again and make sure they're equivalent or bugs are filed for differences.

@brmataptos brmataptos added the bug Something isn't working label Mar 27, 2024
@brmataptos brmataptos changed the title [Bug][move-compiler-v2] v1 tests have diverged from v1 [Bug][move-compiler-v2] move-compiler-v2/tests/v1.matched tests have diverged from v1 Mar 27, 2024
@wrwg
Copy link
Contributor

wrwg commented Mar 27, 2024

We discussed this multiple times so I'm not sure why we open a bug again, but fine.

"Someone needs to diff all the "v1.matched" files to see if they really are matched." -- maybe the someone is you? I created already the infra for this diff in move-compiler-v2/tools.

@wrwg
Copy link
Contributor

wrwg commented Mar 27, 2024

BTW this diff you are mentioning here is required because of differences in borrow analysis which are expected. Each of those tests have been already manually audited before, and the changes which are made are for sure not arbitrary.

I don't think the team has the resources now to compare those 400+ tests again, and the judgment of whether a change is needed or not is rather difficult, so I'm downgrading this bug to medium because of infeasabilty (and because the changes have been already audited).

@brmataptos
Copy link
Contributor Author

The diff is required because we don't have good processes. If test changes are required to exhibit the functionality we want, then we should add a new (modified) test that exhibits the functionality of interest, while leaving the original test input unchanged. Changing the test input until your code passes the test isn't part of a typical software engineering practice.

@wrwg
Copy link
Contributor

wrwg commented Mar 28, 2024

  1. Regards process: the process is well defined and as far as I'm concerned, I followed it (modulo possibly some smaller mistakes which happen always):
  • Every individual test is manually copied over from the v1 directory, run with the v2 compiler, and then compared regarding v1 and v2 output.
  • It is important that the tests are not bulk copied because you then loose track of what tests have been audited or not.
  • Together with the testdiff tool which generates the list of all tests which have not been ported yet, this process ensure that you have (a) now unverified tests in the v2 test tree (b) tracking via the output of testdiff which tests are not ported yet.
  • This is frankly the bettter of all the processes I can imagine here. It also has been discussed by the team multiple times during the last year. One more extensive discussion we had in last fall is what led to the creation of testdiff.
  1. Regards re-verification: the comparison is a large amount work, and before we do it we would not justify the resource investment. Here is why:
  • The errors reported by v1 and v2 have a rather different structure (different text, more or less additional notes and labels, etc.) . Also the order in which errors are generated is different. A textual diff is therefore basically useless. Rather the manual comparison entails: (a) going one-by-one over each error in the v1 exp file. and find the error at the same line numer in the v2 .exp file (b) deciding whether the errors are compatible (c) reasoning whether if one error is missed, it is semantically represented by a different one (the same logical error can reported at different locations in the file, an artifact of type inference) (d) checking out all v2 errors whether non are obsolete.

  • v1 and v2 have different phase structure and order of analysis. For example, many files in the v1 test suite do not fully compile, and don't need to, because they hit the tested blocking error before a secondary one is produced. But then in the other compiler (either v1 or v2), the secondary error may become the primary one, masking the tested error. For example, in v1 reference analysis errors mask ability analysis errors, but in v2 its the other way around. This leads to that test sources needed to be modified.

  • In the case of reference safety comparison becomes even more difficult because the semantics of those both is different. For example, v1 enforces borrow rules on x in statement like x; (refer to x and forget its value). Not so v2: one must actually use the variable (as in *x;). Those differences have been discussed in multiple team meetings and are by design.

@wrwg
Copy link
Contributor

wrwg commented Mar 28, 2024

we should add a new (modified) test that exhibits the functionality of interest, while leaving the original test input unchanged.

This does not work. Those left tests make no sense in v2 because they produce random results.

Rather we need to maintain the identity of the v1 and v2 tests, and manually compare whether they are compatible, as I described above. Making copies of tests will only diffuse the relation between the tests, making additional verification harder. Additional verification happens ever now and then on a case-by-case base, when they are needed to check changes in v2.

@brmataptos
Copy link
Contributor Author

This does not work. Those left tests make no sense in v2 because they produce random results.

All the more reason to include them at the time the test is customized. Otherwise, before release, someone might ask "Do all the V1 tests pass with V2?" and you will have to answer "I don't know.". If you include the original test, see that a different error is shown, and keep that test but add a modified test that fix the shadowing, revealing the interesting error result, then you can later say "We don't generate identical errors, due to the different checking orders of V2, but (a) we show some valid errors (shadowing the errors V1 shows first) on the original test, and (b) when those are fixed, then we would show something equivalent to the original errors."

This is only true if we check all test outputs along the way to make sure that the errors are maintained by each PR.

This is easy enough to do as long as we do adequate reviews for each PR, but when I point out such problems you push back and say I'm wasting time. That is not saving us time, it means we have to go and do a complete re-check again later, because we are not taking care along the way.

Those differences have been discussed in multiple team meetings and are by design.

The code (and tests) should stand alone. If it is by design then this design should be either be self-explanatory or be documented.

It is important that the tests are not bulk copied because you then loose track of what tests have been audited or not.

We are losing track anyway because we make mistakes and reviews that fiind those mistakes lead to hissy fits.

Look, there' no need to fight about this. As I said, if you want a looser PR process, that's fine, we just need to audit again before release, as described by this issue.

@wrwg
Copy link
Contributor

wrwg commented Mar 28, 2024

This does not work. Those left tests make no sense in v2 because they produce random results.

All the more reason to include them at the time the test is customized. Otherwise, before release, someone might ask "Do all the V1 tests pass with V2?" and you will have to answer "I don't know.".

The test is not "customized". If you have a pair of matching (v1, v2) tests, the v2 test is adapted until it is equivalent to the v1 test in the v2 setting. Pleas read what I wrote above how the comparison and equivalence of tests works. Above I also wrote: "Together with the testdiff tool which generates the list of all tests which have not been ported yet, this process ensure that you have (a) no unverified tests in the v2 test tree (b) tracking via the output of testdiff which tests are not ported yet." So the process we are doing is exactly for ensuring that we do know, at any time, that all tests in the v2 tree have been audited. Bulk copying tests and forking them, as you are suggesting, is in fact the methodology which would completely diffuse our test process and quality.

This is only true if we check all test outputs along the way to make sure that the errors are maintained by each PR.

Not true. Once you do the initial porting of the test, subsequent changes don't need to compare against v1 again, but only the derivation to the previous v2 result.

You also seem to continue to believe that you could somehow compare the v1/v2 tests with every change again. Can you please go back in this thread and read above what I wrote about how test comparison works in practice? It is not a textdiff.

We are losing track anyway because we make mistakes and reviews that fiind those mistakes lead to hissy fits.

Cannot confirm this. Any datapoints?

Look, there' no need to fight about this. As I said, if you want a looser PR process, that's fine, we just need to audit again before release, as described by this issue.

This is not were we disagree. I don't want a looser process at all. Rather some of the things you are bringing up here, bulk copying and forking tests, is what would make the process loose and diffuse the test comparison.

@sausagee sausagee added the stale-exempt Prevents issues from being automatically marked and closed as stale label Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compiler-v2 stale-exempt Prevents issues from being automatically marked and closed as stale
Projects
Status: 馃啎 New
Development

No branches or pull requests

3 participants