almanac.httparchive.org: Wappalyzer Technologies table has unexpected entries
Sometimes I am seeing entries which are not in Wappalyzer apps.json file - https://github.com/WPO-Foundation/Wappalyzer/blob/master/src/apps.json
For example, under eCommerce category, we have duplicate entries (With and without spaces).
- SAP Commerce Cloud
- SAPCommerceCloud
- Salesforce Commerce Cloud
- SalesforceCommerceCloud
- Cart Functionality
- CartFunctionality
You can see this in output of query
SELECT distinct app FROM
httparchive.technologies.2020_10_01_mobile WHERE category = 'Ecommerce' order by app
Not sure why this is happening. This is resulting in slight over counting in some queries (For example - Total number of eCommerce platforms analyzed)
We should check why this is happening. Impact of this on 2020 chapter is minimal so I am not spending time to get to bottom of this for now and just raising an issue so that we can look into this later.
Also, If you look at site https://jelly-pop.com/, in technologies table, it shows app as ‘SalesforceCommerceCloud’ but if you see technologies using Wappalyzer chrome extension, this technology is not shown. Not sure why, this is appearing in technologies table.
Also, noticed this under ‘Analytics’ category and saw entries like -
- GoogleTagManager
- Google Tag Manager
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (22 by maintainers)
Commits related to this issue
- Add the meta tags to the Wappalyzer detection For https://github.com/HTTPArchive/almanac.httparchive.org/issues/1843 — committed to catchpoint/WebPageTest.agent by pmeenan 3 years ago
- Fixed Wappalyzer JS and DOM detections. For https://github.com/HTTPArchive/almanac.httparchive.org/issues/1843 — committed to catchpoint/WebPageTest.agent by pmeenan 3 years ago
Ahh, looks like the pages override string.trim() and cause it to remove all of the whitespace. Since the Wappalyzer definitions don’t have any trailing whitespace I can just remove the trim operations.
Should be fixed now (well, over the next hour as the agent update rolls out).
Whew. That was somewhat more painful than I expected. Had to rewrite the JS variable detection part which changed pretty significantly when the engine changed a few months back (also added the support for the DOM detections).
Here is an updated test. Change is rolling out to prod (and HA) over the next hour.
Possibly the serialized DOM. It serializes the HTML but not the DOM. Taking a look now.
Confirmed as all fixed in May crawl. Same query above gives 0 results.
Why, why would anyone do this? You get all sorts when you look at 7.5 milllion web pages…
Good work nailing it down.