gpc-optmeowt: Improve performance of Do Not Sell link detection
There are a few instance in the validation set that go wrong on identifying Do Not Sell links. @OliverWang13, can you fix those (if you can, please chime in here @stanleymarkman and/or @kalicki1 supporting @OliverWang13).
- https://www.gap.com/: DNS is in a
<button>but somehow not found. @OliverWang13: The code can only find it as a<script>,<body>, or<html>. @SebastianZimmeck: Any idea why button detection does not work? - https://www.rakuten.com/: @OliverWang13: Not sure why this doesn’t work. @SebastianZimmeck: Was the page fully loaded when you analyzed it? Is it maybe dynamically loaded and only in the DOM but not in the page source code itself (assuming that is what you are relying on; when you right-click and
inspectwith the developer tools, do you see what you expect? Or try checking the source code of the site.)? - https://www.adobe.com/: @OliverWang13: Sometimes works sometimes doesn’t. @SebastianZimmeck: Any thoughts on why? Can you look into it?
- https://www.theguardian.com/us: @OliverWang13: Has a pop up when you enter the site, with a link to click there, which we do not detect. Then, past the popup, there is a link saying “California resident - Do Not Sell”. Current RegExp does not identify the link. @SebastianZimmeck: Can we modify the regex to capture?
- https://www.businessinsider.com/: @OliverWang13: Not sure why this doesn’t work @SebastianZimmeck: Any hints?
- https://www.condenast.com/: @OliverWang13 Seems to return a false positive? Not sure why. Perhaps the link is only shown in California but still exists in the site. @SebastianZimmeck: Yes, I can confirm. There is a link when accessed with a California IP address. It is probably suppressed from being displayed to non-California IP addresses but still in the code of the page. Those instances we should actually count, i.e., they are true positives. This brings up the larger point that I should probably load all pages with a California IP once you have finalized tuning to confirm that I get the same results.
- https://www.theatlantic.com/: @OliverWang13: Not sure why this doesn’t work @SebastianZimmeck: Any hints?
- https://www.newyorker.com/, https://nypost.com/: @SebastianZimmeck: Can confirm that this works in California.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 31 (26 by maintainers)
Commits related to this issue
- Changed the regex (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
- Modified regex and added nonfunctional code to search headers for injected DNS links (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
- Brought issue further up to date with main, implemented dns link finder that identifies by searching header streams (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
- Added webRequstFiltering to onBeforeSendHeaders listener (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
- Fixed issue w/ recognizing DNS link on `gap.com`, started analysis data parser in logData, started adding CSV saver (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
- Updated DNS link finder regex (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
- Removed old DNS link finder (Issue #223) — committed to privacy-tech-lab/gpc-optmeowt by OliverWang13 3 years ago
- Updated script injection from `contentScript.js` (issue #223) — committed to privacy-tech-lab/gpc-optmeowt by kalicki1 3 years ago
@kalicki1 If you want, feel free to shoot me a message on teams and we can schedule a Zoom or meet at the lab.
@SebastianZimmeck is correct that imposing a size limit had the best results.
Basically, you just close the filter
filter.disconnect()if the stream exceeds an arbitrary length (we choose 200k bytes).Implementation
@kalicki1 and I worked on this together and were able to implement a link finder for dynamically loaded links. This new method works on 3/4 of the sites we were unable to catch, which gap.com being the exception. As of now, the old DNS link finder is not running, but this is likely due to messy code from when I brought the branch closer to main.
I have removed the old
do not selllink finder in main and will be closing this issue.@SebastianZimmeck, I took a look at our regex for capturing “Do Not Sell My Info” links from your second point
The new expression I have pushed to main is
const doNotSellPhrasing = /(Do.?Not|Don.?t).?Sell.?(My)?.?(Personal)?.?(Information|Info)/gmiThis will hopefully solve the issue of our false positive on
weebly.com, since the new regex requires at least one of the following irrespective of in-between characters:This would catch “Do Not Sell My Info” or “Don’t sell Info”, but it would reject “Do not sell my” since it doesn’t have “info” or “information” also attached to it. I think we can run one final test on our validation set with this regex to see if we accidentally render new false negatives and if everything goes okay, keep this as our final expression.
I managed to get the Gap site logging this week without requiring user input on the site 🎉
The
webRequestFilteringAPI we use to pull the data was not implemented fully correctly in order to catch everything we needed to. I refactored this code so that the filter itself did not close before everything was analyzed, and this managed to solve the issue. I now have afilter.onstopevent that listens for the page to stop loading before processing any of the responses.@SebastianZimmeck I can speak to your first question:
(1) We think that the Gap site still won’t show in either detection method simply because, we believe, the DNS link isn’t sent until the bottom of the page is requested. This means, unless we find some API to help us (my thinking is this is unlikely), we would still have to automate some sort of scrolling to get it to show.
About your last two points: (2) Yes, we should test on the entire validation set after we get it fully running and (3) we have not yet but should test whether the combination of old and new is better than each on its own.
With the new regex expression, I will work my way through the link spreadsheet once again. The header code that was put in is not currently working but is mostly an example that @kalicki1 and I found. I will continue to experiment with it and see if I can get it working.
Yes, the regex can be simplified to:
var phrasing = /Do.Not.Sell.My|Don.t.Sell.My/gmiAs for the two possibilities, the first possibility seems like it could work in a batch analysis sense, but would be cumbersome for the casual user who is using analysis mode. With that in mind, I am leaning towards the second possibility.On my computer, I run
git checkout issue-223and then Inpm run startfrom the optmeowt folder and refresh the extension. From the chrome or firefox extension page, you may then have to navigate to the chrome or firefox folder, respectively. If this is not working for whatever reason, all added code is in the contentScripts.js folder and could likely by manually added without much difficulty.The Regex I am using should only be compared with the
innerHTMLin the tag. I am getting the tags to look at usinggetElementsByTagName(), which does not seem to have trouble on other sites when an additional class is involved. I tried usingsetTimeout, which worked on gap.com but only when I took the few seconds to quickly scroll down and load the link, which I would not say is a suitable workaround. Other than that,setTimeoutdid not seem to be effective.@SebastianZimmeck I was more thinking that you could see if you find a Do Not Sell link manually, to confirm that it has one (as it may be a link that can only be seen from California). The identification code finds the link, but I can not confirm that the link exists. However, if you would like the code it is currently in main (in contentScripts.js, more specifically), and it prints the message “Found it” to console if it finds the link.
I have tested our code on the new sites I added to the validation set. Below are notes about certain sites. An issue I think we may be having is the site triggering the window.onload function before the page is actually fully loaded, causing us to miss searching through some tags. Here is the link to the validation set: https://docs.google.com/spreadsheets/d/19Wi2PaPsEOfiSdeaOEdSNg5QCzJbRmc7f5rHhXYzO6E/edit?usp=sharing
https://www.gap.com/: when searching for
<a>or<button>or<footer>, it does not appear. When inspected, it appears as a button tag. When searching for all tags (<*>), it is found.https://www.rakuten.com/: Is not found even when searching for
<*>. The regex should capture “Don’t sell my info”, as confirmed on regex101.com.https://www.businessinsider.com/: Is not found even when searching for
<*>. The regex should capture “CA Do Not Sell My Info”, as confirmed on regex101.com. This page is made for lengthy scrolling, so maybe the footer does not actually “exist” on the page until a user scrolls to it.https://www.theatlantic.com/: Very similarly to business insider, the site scrolls for a long time, which may result in the same issue.
https://slate.com/: @SebastianZimmeck could you check this site from California? The link finder finds a link, but I am unable to.
@kalicki1 If you wouldn’t mind quickly looking at the sites above and seeing if there is anything obvious I may have missed, I would appreciate it.
Overall, 78/82 sites working seems pretty functional, and I am unsure if the link finder can improve to get 82/82. I would be ok with leaving it as is, assuming that Kuba is unable to find something I may have missed.