gpc-optmeowt: Improve performance of Do Not Sell link detection

There are a few instance in the validation set that go wrong on identifying Do Not Sell links. @OliverWang13, can you fix those (if you can, please chime in here @stanleymarkman and/or @kalicki1 supporting @OliverWang13).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 31 (26 by maintainers)

Commits related to this issue

Most upvoted comments

@kalicki1 If you want, feel free to shoot me a message on teams and we can schedule a Zoom or meet at the lab.

@SebastianZimmeck is correct that imposing a size limit had the best results.

Basically, you just close the filter filter.disconnect() if the stream exceeds an arbitrary length (we choose 200k bytes).

Implementation

const filter = browser.webRequest.filterResponseData(details.requestId)

  var responseByteLength = 0,
   abort = false,
   httpResponseStrArr = []

  filter.ondata = (event) => {
    if (!abort) {
      filter.write(event.data)
      responseByteLength += event.data.byteLength
      if ( responseByteLength > MAX_BYTE_LEN ) {
        filter.disconnect()
        abort = true
      }
      else {
        const str = decoder.decode(event.data, { stream: true })
        httpResponseStrArr.push(str)
      }
    }
  }

  filter.onerror = (event) => (request.error = filter.error)

  // when the filter stops, close filter, add data from httpResponseStrArr to
  // our Request created earlier. Sends to resolveBuffer (below)
  filter.onstop = async (event) => {
    if (!abort) {
      filter.close()
      request.responseData = httpResponseStrArr.toString()
      resolveBuffer(request.id, data)
      }
  }

@kalicki1 and I worked on this together and were able to implement a link finder for dynamically loaded links. This new method works on 3/4 of the sites we were unable to catch, which gap.com being the exception. As of now, the old DNS link finder is not running, but this is likely due to messy code from when I brought the branch closer to main.

I have removed the old do not sell link finder in main and will be closing this issue.

@SebastianZimmeck, I took a look at our regex for capturing “Do Not Sell My Info” links from your second point

We probably want to have “Do Not Sell My Info”

The new expression I have pushed to main is

const doNotSellPhrasing = /(Do.?Not|Don.?t).?Sell.?(My)?.?(Personal)?.?(Information|Info)/gmi

This will hopefully solve the issue of our false positive on weebly.com, since the new regex requires at least one of the following irrespective of in-between characters:

  1. do not OR don’t
  2. sell
  3. information OR info

This would catch “Do Not Sell My Info” or “Don’t sell Info”, but it would reject “Do not sell my” since it doesn’t have “info” or “information” also attached to it. I think we can run one final test on our validation set with this regex to see if we accidentally render new false negatives and if everything goes okay, keep this as our final expression.

I managed to get the Gap site logging this week without requiring user input on the site 🎉

The webRequestFiltering API we use to pull the data was not implemented fully correctly in order to catch everything we needed to. I refactored this code so that the filter itself did not close before everything was analyzed, and this managed to solve the issue. I now have a filter.onstop event that listens for the page to stop loading before processing any of the responses.

@SebastianZimmeck I can speak to your first question:

(1) We think that the Gap site still won’t show in either detection method simply because, we believe, the DNS link isn’t sent until the bottom of the page is requested. This means, unless we find some API to help us (my thinking is this is unlikely), we would still have to automate some sort of scrolling to get it to show.

About your last two points: (2) Yes, we should test on the entire validation set after we get it fully running and (3) we have not yet but should test whether the combination of old and new is better than each on its own.

With the new regex expression, I will work my way through the link spreadsheet once again. The header code that was put in is not currently working but is mostly an example that @kalicki1 and I found. I will continue to experiment with it and see if I can get it working.

Yes, the regex can be simplified to: var phrasing = /Do.Not.Sell.My|Don.t.Sell.My/gmi As for the two possibilities, the first possibility seems like it could work in a batch analysis sense, but would be cumbersome for the casual user who is using analysis mode. With that in mind, I am leaning towards the second possibility.

On my computer, I run git checkout issue-223 and then I npm run start from the optmeowt folder and refresh the extension. From the chrome or firefox extension page, you may then have to navigate to the chrome or firefox folder, respectively. If this is not working for whatever reason, all added code is in the contentScripts.js folder and could likely by manually added without much difficulty.

The Regex I am using should only be compared with the innerHTML in the tag. I am getting the tags to look at using getElementsByTagName(), which does not seem to have trouble on other sites when an additional class is involved. I tried using setTimeout, which worked on gap.com but only when I took the few seconds to quickly scroll down and load the link, which I would not say is a suitable workaround. Other than that, setTimeout did not seem to be effective.

@SebastianZimmeck I was more thinking that you could see if you find a Do Not Sell link manually, to confirm that it has one (as it may be a link that can only be seen from California). The identification code finds the link, but I can not confirm that the link exists. However, if you would like the code it is currently in main (in contentScripts.js, more specifically), and it prints the message “Found it” to console if it finds the link.

I have tested our code on the new sites I added to the validation set. Below are notes about certain sites. An issue I think we may be having is the site triggering the window.onload function before the page is actually fully loaded, causing us to miss searching through some tags. Here is the link to the validation set: https://docs.google.com/spreadsheets/d/19Wi2PaPsEOfiSdeaOEdSNg5QCzJbRmc7f5rHhXYzO6E/edit?usp=sharing

https://www.gap.com/: when searching for <a> or <button> or <footer>, it does not appear. When inspected, it appears as a button tag. When searching for all tags (<*>), it is found.

https://www.rakuten.com/: Is not found even when searching for <*>. The regex should capture “Don’t sell my info”, as confirmed on regex101.com.

https://www.businessinsider.com/: Is not found even when searching for <*>. The regex should capture “CA Do Not Sell My Info”, as confirmed on regex101.com. This page is made for lengthy scrolling, so maybe the footer does not actually “exist” on the page until a user scrolls to it.

https://www.theatlantic.com/: Very similarly to business insider, the site scrolls for a long time, which may result in the same issue.

https://slate.com/: @SebastianZimmeck could you check this site from California? The link finder finds a link, but I am unable to.

@kalicki1 If you wouldn’t mind quickly looking at the sites above and seeing if there is anything obvious I may have missed, I would appreciate it.

Overall, 78/82 sites working seems pretty functional, and I am unsure if the link finder can improve to get 82/82. I would be ok with leaving it as is, assuming that Kuba is unable to find something I may have missed.