Behavior regression CI #1642

3Rafal · 2023-06-28T08:04:40Z

pitag-ha

To me, this looks very nice, @3Rafal! @voodoos , what do you think?

A couple of general thoughts (that we can think about after merging this):

Even though this CI is more resource efficient than the General Behavior CI will be, it will still be resource heavy once we make it run on a significant sample set:

Increase the number of samples per file (see one of the comments below)
Add a larger/different code-base than only Irmin to act on.
Add more Merlin queries to merl-an.

When to run the CI?

Given what I've just said, would you run this CI on every PR? Or would you enable it on concrete PRs via a github flag or similar?

Which code base to act on?

About the point "Add a larger/different code-base than only Irmin to act on" above: So far, we've chosen the Irmin code-base to act on. The reasoning is that it seems an interesting bode-base because of deep functor depth and wide general module type usage. So that's particularly interesting wrt Merlin performance, but also wrt Merlin behavior. What other OCaml features are interesting to test Merlin behavior and which projects do we know that use them? E.g.

Classes? Which projects use classes? The first ones that come to my mind are eio and ppxlib.
Interesting type stuff? Which projects could we use to test that? My OCaml code-base knowledge is quite limited. I only know ppxlib's Version module with "interesting type stuff". Maybe we can use Sherlocode to find something?
Something else?

Which Merlin queries to test?

About the point "Add more Merlin queries to merl-an" above: The queries merl-an supports so far were chosen with performance in mind. For this CI tha ttests whether Merlin gives a successful response, I can also think of destruct and construct as interesting. What do you two think?

.github/workflows/regression.yml

regression/dune

.github/workflows/regression.yml

regression/run.t

pitag-ha · 2023-06-29T11:47:06Z

regression/build-irmin

+cd irmin
+git checkout 8da4d16e7cc8beddfc8a824feca325426bae08a9
+sudo apt install -y gnuplot-x11 libgmp-dev pkg-config libffi-dev
+opam install . --deps-only --with-test --no-checksums -y


Suggested change

opam install . --deps-only --with-test --no-checksums -y

opam install . --deps-only --with-test -y

?

one of the irmin deps (duff) fail on checksum, so I had to make it this way.

pitag-ha · 2023-06-29T11:58:49Z

regression/dune

+(env (_
+ (binaries build-irmin)))


Do you happen to know if the build-irmin result gets cached? It takes about 5 min, so it's not a huge deal if it doesn't. If it's easy to do (or already the case), it would still be nice though, among others to reduce the cost of adding more source code than Irmin to test on.

no, it doesn't. big parts of our pipelines aren't cached. I don't know about easy way to do this, but I agree it would be nice.

I don't know about easy way to do this, but I agree it would be nice.

If we had the script be called by the GH action rather than by Dune, could we make use of GH action caching? (It's a genuine question. I don't know myself either.)

Running this script via GH actions / dune seems orthogonal to caching. I run it via dune, because checking out Irmin code in test directory makes dune test run Irmin tests as well. This could probably be fixed, but I didn't want to spend too much time on this.

Caching build dependency can be done via GH action, but I'm not sure about an actual implementation. It would be a nice optimisation, but realistically we would shave couple of minutes from the build job that currently takes ~40 minutes. I don't think it 's the most valuable thing to spend time on at this moment.

pitag-ha · 2023-06-29T12:12:29Z

regression/run.t

+
+  $ build-irmin >> /dev/null 2>&1
+
+  $ merl-an error-regression -s 1 --data=test-data 2> /dev/null


Suggested change

$ merl-an error-regression -s 1 --data=test-data 2> /dev/null

$ merl-an error-regression --data=test-data 2> /dev/null

One sample per file is very little. merl-an's default is 30. It'd be interesting to see how long the run takes with 30 samples per file.

(Side-note: this has reminded me that merl-an should distribute the samples better throughout the project by making the number of samples per file depend on the size of the file/AST. I've just opened an issue on merl-an to remember: pitag-ha/merl-an#28)

Can we merge it like this, and later use higher numbers? I think it's easier to review a smaller file. You can test bigger runs on different branches.

Also, I think that having more than 2,5k tests isn't small. I feel like we will quickly get a diminishing returns

Also, having -s 30 means, that I have to wait pretty long to get results and it slows down my dev process. It's easier to iterate on something smaller, especially in early stages

Also, I think that having more than 2,5k tests isn't small.

I agree that even with one sample per file, this isn't small. In fact, that's the reason why I've suggested to make it a bit more realistic: That way, we could find out if this approach scales or if we should consider choosing a different approach.

You can test bigger runs on different branches.

That sounds good. Have you done that already?

If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later. ~~I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.~~

I have tested this on my machine:

-- Results -- machine: Laptop with Intel i7 11 gen sample size: 30 (default) time: real 203m8.524s user 0m35.541s sys 0m17.571

The results between CI and my dev env are comparable, because we use single thread.

If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later.

I'm surprised, that we are considering using different approach at this point. I thought that we already discussed this CI and agreed on its implementation. I don't see why we would need to remove it, instead of simply adding some tweaks.

I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.

It's hard for me to agree with this remark. I think I have investigated a lot before creating this PR. What I'm suggesting, is that we already provide a value with this PR, so we can merge it and then improve according to specific needs.

Oh, I'm sorry, I wasn't aware you had already tested it with sample-size 30. How big is the test file when you do that?

If you've already investigated enough on your side and are convinced that this approach is the right way to go, I'm happy to believe you and stop the thinking/discussion around the approach itself to focus only on details/concrete improvements. Is that what you were saying?

artifact size: $ ls -l --block-size=M total 11M -rwxr-xr-x 1 opam opam 8M Jun 25 23:43 commands.json -rwxr-xr-x 1 opam opam 1M Jun 25 23:43 logs.json -rwxr-xr-x 1 opam opam 2M Jun 25 23:43 results.json

It was measured without cmds, so it could be bigger (let's say x2).

I think that this approach looks promising. I can't guarantee that it will scale perfectly to all requirements. I think it's something that we can ask about @voodoos too.

voodoos

It would be better to put the test in a subfolder of the tests folder: tests/regression

voodoos · 2023-06-30T10:19:12Z

.github/workflows/regression.yml

+  pull_request:
+  push:
+    branches:
+      - master


We should restrict the ci's run to changes made to .ml / .mli files

voodoos · 2023-06-30T10:22:18Z

~~I think you can restrict the sample testing to type-enclosing, locate and occurrences.~~

~~Also, it looks like it would be painful to promote the results manually (and this happens quite a lot).~~

Since the CI only check errors, and >30min is quite a lot of time, maybe we should have it run only after PR are merged and revert them afterwards if it fails ?

3Rafal force-pushed the error-regression branch from 016ffd3 to c291721 Compare June 28, 2023 16:54

3Rafal marked this pull request as ready for review June 28, 2023 16:57

3Rafal changed the title ~~WIP: error-regression~~ Behavior regression CI Jun 28, 2023

error-regression

d220103

3Rafal force-pushed the error-regression branch from c291721 to d220103 Compare June 28, 2023 17:06

pitag-ha reviewed Jun 29, 2023

View reviewed changes

Fix branch name

bad37c8

3Rafal mentioned this pull request Jun 29, 2023

Remove merlin_id from error-regression output pitag-ha/merl-an#29

Merged

Update results to reflect new version of merl-an

210ce0c

voodoos reviewed Jun 30, 2023

View reviewed changes

This was referenced Jul 3, 2023

Specified Regression CIs #1635

Closed

General Behavior CI #1634

Closed

pitag-ha mentioned this pull request Jul 25, 2023

[RFC] Fuzzy behavior tests/CI for Merlin #1657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior regression CI #1642

Behavior regression CI #1642

3Rafal commented Jun 28, 2023 •

edited

pitag-ha left a comment

pitag-ha Jun 29, 2023

3Rafal Jun 29, 2023

pitag-ha Jun 29, 2023

3Rafal Jun 29, 2023

pitag-ha Jun 29, 2023

3Rafal Jun 29, 2023 •

edited

pitag-ha Jun 29, 2023

3Rafal Jun 29, 2023

3Rafal Jun 29, 2023 •

edited

pitag-ha Jun 29, 2023 •

edited

3Rafal Jun 29, 2023

pitag-ha Jun 29, 2023

3Rafal Jun 29, 2023

voodoos left a comment

voodoos Jun 30, 2023

voodoos commented Jun 30, 2023 •

edited

	opam install . --deps-only --with-test --no-checksums -y
	opam install . --deps-only --with-test -y


		$ build-irmin >> /dev/null 2>&1

		$ merl-an error-regression -s 1 --data=test-data 2> /dev/null

	$ merl-an error-regression -s 1 --data=test-data 2> /dev/null
	$ merl-an error-regression --data=test-data 2> /dev/null

Behavior regression CI #1642

Are you sure you want to change the base?

Behavior regression CI #1642

Conversation

3Rafal commented Jun 28, 2023 • edited

pitag-ha left a comment

Choose a reason for hiding this comment

When to run the CI?

Which code base to act on?

Which Merlin queries to test?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3Rafal Jun 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3Rafal Jun 29, 2023 • edited

Choose a reason for hiding this comment

pitag-ha Jun 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voodoos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voodoos commented Jun 30, 2023 • edited

3Rafal commented Jun 28, 2023 •

edited

3Rafal Jun 29, 2023 •

edited

3Rafal Jun 29, 2023 •

edited

pitag-ha Jun 29, 2023 •

edited

voodoos commented Jun 30, 2023 •

edited