New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML between <script> and </script> should be ignored #844
Comments
The feed itself does not contain any article contents. Do you mean that FreshRSS uses some Sanitize methods on contents it fetches on its own? Could you provide a minimal code that reproduces the issue? |
Yeah, I think it's something like that. It fetches the HTML from the link in the feed and then tries to sanitize and extract the contents based on a CSS selector. And I guess it's using SimplePie for that step.
Maybe. It's been years since I've worked with php, but I'll see what I can come up with. |
Test update for simplepie#844
I tried the fairly simple change of just adding some html inside of a script tag in an existing test, but that test continued passing. (Although, the linter had a lot of complaints about things I hadn't touched.) So, apparently the issue is more complex :/ |
Hi, I initially filed this at FreshRSS/FreshRSS#5588 but it sounds like this might be a better place to report it.
This is similar #697, but it seemed different enough to merit it's own report.
The HTML parser appears to become confused and includes part of the JavaScript code as if it were part of the article text on a website with a
<script>
tag inline with an article's contents, e.g.To Reproduce (In FreshRSS)
#TGN_site_Article_body
Expected behavior
Everything from
<script>
to</script>
should be excluded from the scraped contentScreenshots
Environment information:
SimplePie 1.5.8, running in FreshRSS 1.21.0, on PHP 8.2.8, on the linuxserver.io FreshRSS Docker Image
Additional context
I was able to work around the problem by setting the selector to
#TGN_site_Article_body>p, #TGN_site_Article_body iframe
, but figured you all would still want to know about the issue.The text was updated successfully, but these errors were encountered: