Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML between <script> and </script> should be ignored #844

Open
nfriedly opened this issue Aug 15, 2023 · 3 comments
Open

HTML between <script> and </script> should be ignored #844

nfriedly opened this issue Aug 15, 2023 · 3 comments

Comments

@nfriedly
Copy link

Hi, I initially filed this at FreshRSS/FreshRSS#5588 but it sounds like this might be a better place to report it.

This is similar #697, but it seemed different enough to merit it's own report.


The HTML parser appears to become confused and includes part of the JavaScript code as if it were part of the article text on a website with a <script> tag inline with an article's contents, e.g.

<div id="TGN_site_Article_body" class="TGN_site_Article_body">
  <!-- ... -->
  <script type="text/javascript">
    // ...
    $('.InlineImageGalleryText').html("<div class='IIGcountIcon'><i class='galleryicon icon-gallery_2'></i></div><div class='IIGcount'>" + (mainIIG + 1) + "&nbsp;&nbsp;of&nbsp;&nbsp;" + listOfObjects.length + "</div><span class='IIGheadline'>" + capt01 + " <span class='IIGheadline2'>" + capt02 + "</span></span>");
    // ...
  </script>
  <p><!-- article text here -->

To Reproduce (In FreshRSS)

  1. Add the feed https://www.gatesnotes.com/RSS
  2. Set 'Article CSS selector on original website' to #TGN_site_Article_body
  3. Click the "eyeball button" to show a preview of the scraped content
  4. Observe that it begins with a bunch of JavaScript code

Expected behavior
Everything from <script> to </script> should be excluded from the scraped content

Screenshots

Screenshot 2023-08-14 at 2 53 29 PM Screenshot 2023-08-14 at 2 53 44 PM

Environment information:

SimplePie 1.5.8, running in FreshRSS 1.21.0, on PHP 8.2.8, on the linuxserver.io FreshRSS Docker Image

Additional context

I was able to work around the problem by setting the selector to #TGN_site_Article_body>p, #TGN_site_Article_body iframe, but figured you all would still want to know about the issue.

@jtojnar
Copy link
Contributor

jtojnar commented Aug 15, 2023

The feed itself does not contain any article contents. Do you mean that FreshRSS uses some Sanitize methods on contents it fetches on its own? Could you provide a minimal code that reproduces the issue?

@nfriedly
Copy link
Author

The feed itself does not contain any article contents. Do you mean that FreshRSS uses some Sanitize methods on contents it fetches on its own?

Yeah, I think it's something like that. It fetches the HTML from the link in the feed and then tries to sanitize and extract the contents based on a CSS selector. And I guess it's using SimplePie for that step.

Could you provide a minimal code that reproduces the issue?

Maybe. It's been years since I've worked with php, but I'll see what I can come up with.

nfriedly added a commit to nfriedly/simplepie that referenced this issue Aug 16, 2023
@nfriedly
Copy link
Author

I tried the fairly simple change of just adding some html inside of a script tag in an existing test, but that test continued passing. (Although, the linter had a lot of complaints about things I hadn't touched.) So, apparently the issue is more complex :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants