gpc-optmeowt: Improve performance of Do Not Sell link detection

There are a few instance in the validation set that go wrong on identifying Do Not Sell links. @OliverWang13, can you fix those (if you can, please chime in here @stanleymarkman and/or @kalicki1 supporting @OliverWang13).

https://www.gap.com/: DNS is in a <button> but somehow not found. @OliverWang13: The code can only find it as a <script>, <body>, or <html>. @SebastianZimmeck: Any idea why button detection does not work?
https://www.rakuten.com/: @OliverWang13: Not sure why this doesn’t work. @SebastianZimmeck: Was the page fully loaded when you analyzed it? Is it maybe dynamically loaded and only in the DOM but not in the page source code itself (assuming that is what you are relying on; when you right-click and inspect with the developer tools, do you see what you expect? Or try checking the source code of the site.)?
https://www.adobe.com/: @OliverWang13: Sometimes works sometimes doesn’t. @SebastianZimmeck: Any thoughts on why? Can you look into it?
https://www.theguardian.com/us: @OliverWang13: Has a pop up when you enter the site, with a link to click there, which we do not detect. Then, past the popup, there is a link saying “California resident - Do Not Sell”. Current RegExp does not identify the link. @SebastianZimmeck: Can we modify the regex to capture?
https://www.businessinsider.com/: @OliverWang13: Not sure why this doesn’t work @SebastianZimmeck: Any hints?
https://www.condenast.com/: @OliverWang13 Seems to return a false positive? Not sure why. Perhaps the link is only shown in California but still exists in the site. @SebastianZimmeck: Yes, I can confirm. There is a link when accessed with a California IP address. It is probably suppressed from being displayed to non-California IP addresses but still in the code of the page. Those instances we should actually count, i.e., they are true positives. This brings up the larger point that I should probably load all pages with a California IP once you have finalized tuning to confirm that I get the same results.
https://www.theatlantic.com/: @OliverWang13: Not sure why this doesn’t work @SebastianZimmeck: Any hints?
https://www.newyorker.com/, https://nypost.com/: @SebastianZimmeck: Can confirm that this works in California.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 31 (26 by maintainers)

Commits related to this issue

Changed the regex (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
Modified regex and added nonfunctional code to search headers for injected DNS links (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
Brought issue further up to date with main, implemented dns link finder that identifies by searching header streams (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
Added webRequstFiltering to onBeforeSendHeaders listener (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
Fixed issue w/ recognizing DNS link on `gap.com`, started analysis data parser in logData, started adding CSV saver (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
Updated DNS link finder regex (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
Removed old DNS link finder (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
Updated script injection from `contentScript.js` (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago

Most upvoted comments

@kalicki1 If you want, feel free to shoot me a message on teams and we can schedule a Zoom or meet at the lab.

@SebastianZimmeck is correct that imposing a size limit had the best results.

Basically, you just close the filter filter.disconnect() if the stream exceeds an arbitrary length (we choose 200k bytes).

Implementation

const filter = browser.webRequest.filterResponseData(details.requestId)

  var responseByteLength = 0,
   abort = false,
   httpResponseStrArr = []

  filter.ondata = (event) => {
    if (!abort) {
      filter.write(event.data)
      responseByteLength += event.data.byteLength
      if ( responseByteLength > MAX_BYTE_LEN ) {
        filter.disconnect()
        abort = true
      }
      else {
        const str = decoder.decode(event.data, { stream: true })
        httpResponseStrArr.push(str)
      }
    }
  }

  filter.onerror = (event) => (request.error = filter.error)

  // when the filter stops, close filter, add data from httpResponseStrArr to
  // our Request created earlier. Sends to resolveBuffer (below)
  filter.onstop = async (event) => {
    if (!abort) {
      filter.close()
      request.responseData = httpResponseStrArr.toString()
      resolveBuffer(request.id, data)
      }
  }

notowen333 on Oct 14, 2021

@kalicki1 and I worked on this together and were able to implement a link finder for dynamically loaded links. This new method works on 3/4 of the sites we were unable to catch, which gap.com being the exception. As of now, the old DNS link finder is not running, but this is likely due to messy code from when I brought the branch closer to main.

OliverWang13 on Oct 3, 2021

I have removed the old do not sell link finder in main and will be closing this issue.

OliverWang13 on Oct 29, 2021

@SebastianZimmeck, I took a look at our regex for capturing “Do Not Sell My Info” links from your second point

We probably want to have “Do Not Sell My Info”

The new expression I have pushed to main is

const doNotSellPhrasing = /(Do.?Not|Don.?t).?Sell.?(My)?.?(Personal)?.?(Information|Info)/gmi

This will hopefully solve the issue of our false positive on weebly.com, since the new regex requires at least one of the following irrespective of in-between characters:

do not OR don’t
sell
information OR info

This would catch “Do Not Sell My Info” or “Don’t sell Info”, but it would reject “Do not sell my” since it doesn’t have “info” or “information” also attached to it. I think we can run one final test on our validation set with this regex to see if we accidentally render new false negatives and if everything goes okay, keep this as our final expression.

kalicki1 on Oct 28, 2021

I managed to get the Gap site logging this week without requiring user input on the site 🎉

The webRequestFiltering API we use to pull the data was not implemented fully correctly in order to catch everything we needed to. I refactored this code so that the filter itself did not close before everything was analyzed, and this managed to solve the issue. I now have a filter.onstop event that listens for the page to stop loading before processing any of the responses.

kalicki1 on Oct 14, 2021

@SebastianZimmeck I can speak to your first question:

(1) We think that the Gap site still won’t show in either detection method simply because, we believe, the DNS link isn’t sent until the bottom of the page is requested. This means, unless we find some API to help us (my thinking is this is unlikely), we would still have to automate some sort of scrolling to get it to show.

About your last two points: (2) Yes, we should test on the entire validation set after we get it fully running and (3) we have not yet but should test whether the combination of old and new is better than each on its own.

kalicki1 on Oct 5, 2021

With the new regex expression, I will work my way through the link spreadsheet once again. The header code that was put in is not currently working but is mostly an example that @kalicki1 and I found. I will continue to experiment with it and see if I can get it working.

OliverWang13 on Sep 20, 2021

Yes, the regex can be simplified to: var phrasing = /Do.Not.Sell.My|Don.t.Sell.My/gmi As for the two possibilities, the first possibility seems like it could work in a batch analysis sense, but would be cumbersome for the casual user who is using analysis mode. With that in mind, I am leaning towards the second possibility.

OliverWang13 on Sep 16, 2021

On my computer, I run git checkout issue-223 and then I npm run start from the optmeowt folder and refresh the extension. From the chrome or firefox extension page, you may then have to navigate to the chrome or firefox folder, respectively. If this is not working for whatever reason, all added code is in the contentScripts.js folder and could likely by manually added without much difficulty.

OliverWang13 on Sep 15, 2021

The Regex I am using should only be compared with the innerHTML in the tag. I am getting the tags to look at using getElementsByTagName(), which does not seem to have trouble on other sites when an additional class is involved. I tried using setTimeout, which worked on gap.com but only when I took the few seconds to quickly scroll down and load the link, which I would not say is a suitable workaround. Other than that, setTimeout did not seem to be effective.

OliverWang13 on Sep 15, 2021

@SebastianZimmeck I was more thinking that you could see if you find a Do Not Sell link manually, to confirm that it has one (as it may be a link that can only be seen from California). The identification code finds the link, but I can not confirm that the link exists. However, if you would like the code it is currently in main (in contentScripts.js, more specifically), and it prints the message “Found it” to console if it finds the link.

OliverWang13 on Sep 9, 2021

I have tested our code on the new sites I added to the validation set. Below are notes about certain sites. An issue I think we may be having is the site triggering the window.onload function before the page is actually fully loaded, causing us to miss searching through some tags. Here is the link to the validation set: https://docs.google.com/spreadsheets/d/19Wi2PaPsEOfiSdeaOEdSNg5QCzJbRmc7f5rHhXYzO6E/edit?usp=sharing

https://www.gap.com/: when searching for <a> or <button> or <footer>, it does not appear. When inspected, it appears as a button tag. When searching for all tags (<*>), it is found.

https://www.rakuten.com/: Is not found even when searching for <*>. The regex should capture “Don’t sell my info”, as confirmed on regex101.com.

https://www.businessinsider.com/: Is not found even when searching for <*>. The regex should capture “CA Do Not Sell My Info”, as confirmed on regex101.com. This page is made for lengthy scrolling, so maybe the footer does not actually “exist” on the page until a user scrolls to it.

https://www.theatlantic.com/: Very similarly to business insider, the site scrolls for a long time, which may result in the same issue.

https://slate.com/: @SebastianZimmeck could you check this site from California? The link finder finds a link, but I am unable to.

@kalicki1 If you wouldn’t mind quickly looking at the sites above and seeing if there is anything obvious I may have missed, I would appreciate it.

Overall, 78/82 sites working seems pretty functional, and I am unsure if the link finder can improve to get 82/82. I would be ok with leaving it as is, assuming that Kuba is unable to find something I may have missed.

OliverWang13 on Sep 9, 2021