Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior regression CI #1642

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Behavior regression CI #1642

wants to merge 3 commits into from

Conversation

3Rafal
Copy link
Collaborator

@3Rafal 3Rafal commented Jun 28, 2023

As in #1635

@3Rafal 3Rafal marked this pull request as ready for review June 28, 2023 16:57
@3Rafal 3Rafal changed the title WIP: error-regression Behavior regression CI Jun 28, 2023
Copy link
Member

@pitag-ha pitag-ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this looks very nice, @3Rafal! @voodoos , what do you think?

A couple of general thoughts (that we can think about after merging this):

Even though this CI is more resource efficient than the General Behavior CI will be, it will still be resource heavy once we make it run on a significant sample set:

  • Increase the number of samples per file (see one of the comments below)
  • Add a larger/different code-base than only Irmin to act on.
  • Add more Merlin queries to merl-an.

When to run the CI?

Given what I've just said, would you run this CI on every PR? Or would you enable it on concrete PRs via a github flag or similar?

Which code base to act on?

About the point "Add a larger/different code-base than only Irmin to act on" above: So far, we've chosen the Irmin code-base to act on. The reasoning is that it seems an interesting bode-base because of deep functor depth and wide general module type usage. So that's particularly interesting wrt Merlin performance, but also wrt Merlin behavior. What other OCaml features are interesting to test Merlin behavior and which projects do we know that use them? E.g.

  • Classes? Which projects use classes? The first ones that come to my mind are eio and ppxlib.
  • Interesting type stuff? Which projects could we use to test that? My OCaml code-base knowledge is quite limited. I only know ppxlib's Version module with "interesting type stuff". Maybe we can use Sherlocode to find something?
  • Something else?

Which Merlin queries to test?

About the point "Add more Merlin queries to merl-an" above: The queries merl-an supports so far were chosen with performance in mind. For this CI tha ttests whether Merlin gives a successful response, I can also think of destruct and construct as interesting. What do you two think?

.github/workflows/regression.yml Show resolved Hide resolved
regression/dune Show resolved Hide resolved
.github/workflows/regression.yml Outdated Show resolved Hide resolved
regression/run.t Outdated Show resolved Hide resolved
cd irmin
git checkout 8da4d16e7cc8beddfc8a824feca325426bae08a9
sudo apt install -y gnuplot-x11 libgmp-dev pkg-config libffi-dev
opam install . --deps-only --with-test --no-checksums -y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
opam install . --deps-only --with-test --no-checksums -y
opam install . --deps-only --with-test -y

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the irmin deps (duff) fail on checksum, so I had to make it this way.

Comment on lines +1 to +2
(env (_
(binaries build-irmin)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know if the build-irmin result gets cached? It takes about 5 min, so it's not a huge deal if it doesn't. If it's easy to do (or already the case), it would still be nice though, among others to reduce the cost of adding more source code than Irmin to test on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it doesn't. big parts of our pipelines aren't cached. I don't know about easy way to do this, but I agree it would be nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about easy way to do this, but I agree it would be nice.

If we had the script be called by the GH action rather than by Dune, could we make use of GH action caching? (It's a genuine question. I don't know myself either.)

Copy link
Collaborator Author

@3Rafal 3Rafal Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this script via GH actions / dune seems orthogonal to caching. I run it via dune, because checking out Irmin code in test directory makes dune test run Irmin tests as well. This could probably be fixed, but I didn't want to spend too much time on this.

Caching build dependency can be done via GH action, but I'm not sure about an actual implementation. It would be a nice optimisation, but realistically we would shave couple of minutes from the build job that currently takes ~40 minutes. I don't think it 's the most valuable thing to spend time on at this moment.


$ build-irmin >> /dev/null 2>&1

$ merl-an error-regression -s 1 --data=test-data 2> /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ merl-an error-regression -s 1 --data=test-data 2> /dev/null
$ merl-an error-regression --data=test-data 2> /dev/null

One sample per file is very little. merl-an's default is 30. It'd be interesting to see how long the run takes with 30 samples per file.

(Side-note: this has reminded me that merl-an should distribute the samples better throughout the project by making the number of samples per file depend on the size of the file/AST. I've just opened an issue on merl-an to remember: pitag-ha/merl-an#28)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge it like this, and later use higher numbers? I think it's easier to review a smaller file. You can test bigger runs on different branches.

Also, I think that having more than 2,5k tests isn't small. I feel like we will quickly get a diminishing returns

Copy link
Collaborator Author

@3Rafal 3Rafal Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having -s 30 means, that I have to wait pretty long to get results and it slows down my dev process. It's easier to iterate on something smaller, especially in early stages

Copy link
Member

@pitag-ha pitag-ha Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think that having more than 2,5k tests isn't small.

I agree that even with one sample per file, this isn't small. In fact, that's the reason why I've suggested to make it a bit more realistic: That way, we could find out if this approach scales or if we should consider choosing a different approach.

You can test bigger runs on different branches.

That sounds good. Have you done that already?

If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later. I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this on my machine:

-- Results --
machine: Laptop with Intel i7 11 gen
sample size: 30 (default)

time:
real 203m8.524s
user 0m35.541s
sys 0m17.571

The results between CI and my dev env are comparable, because we use single thread.

If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later.

I'm surprised, that we are considering using different approach at this point. I thought that we already discussed this CI and agreed on its implementation. I don't see why we would need to remove it, instead of simply adding some tweaks.

I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.

It's hard for me to agree with this remark. I think I have investigated a lot before creating this PR. What I'm suggesting, is that we already provide a value with this PR, so we can merge it and then improve according to specific needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry, I wasn't aware you had already tested it with sample-size 30. How big is the test file when you do that?

If you've already investigated enough on your side and are convinced that this approach is the right way to go, I'm happy to believe you and stop the thinking/discussion around the approach itself to focus only on details/concrete improvements. Is that what you were saying?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

artifact size:
$ ls -l --block-size=M
total 11M
-rwxr-xr-x 1 opam opam 8M Jun 25 23:43 commands.json
-rwxr-xr-x 1 opam opam 1M Jun 25 23:43 logs.json
-rwxr-xr-x 1 opam opam 2M Jun 25 23:43 results.json

It was measured without cmds, so it could be bigger (let's say x2).

I think that this approach looks promising. I can't guarantee that it will scale perfectly to all requirements. I think it's something that we can ask about @voodoos too.

Copy link
Collaborator

@voodoos voodoos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to put the test in a subfolder of the tests folder: tests/regression

pull_request:
push:
branches:
- master
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should restrict the ci's run to changes made to .ml / .mli files

@voodoos
Copy link
Collaborator

voodoos commented Jun 30, 2023

I think you can restrict the sample testing to type-enclosing, locate and occurrences.

Also, it looks like it would be painful to promote the results manually (and this happens quite a lot). 

Since the CI only check errors, and >30min is quite a lot of time, maybe we should have it run only after PR are merged and revert them afterwards if it fails ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants