Missing reports from large sites #134

emersonthis · 2020-07-23T22:23:33Z

Describe the bug
I'm trying generate reports for a large site (~6760 pages) and it is only producing 437 report files.

To Reproduce
Steps to reproduce the behavior:

git clone ...
npm install
cd ./auto-lighthouse
npm run start -- -u https://example.com --format=csv --respectRobots=false

Expected behavior
I expect the crawler to find ~6760 pages, then generate 13522 report files (extra two for aggregated).

Instead, I find ~437 report files and an error in the console.

The terminal shows Pushed: ... 6760 times.
It then says Generating 13522 reports! (so far so good)
Then I see Wrote ... 437 times, followed by an error.
There are 437 files in the expected directory. This included two aggregate reports.

It appears that the script is chocking on something before it finishes writing all the files. It may have something to do with the race condition mentioned in this unmerged PR.

Here's an abridged version of the full transcript:

~/Code/auto-lighthouse[master]$ npm run start -- -u https://example.com --format=csv --respectRobots=false

> [email protected] start /Users/emerson/Code/auto-lighthouse
> node cli "-u" "https://example.com" "--format=csv" "--respectRobots=false"

Not automatically opening reports when done!
Starting simple crawler on https://example.com!
Pushed: https://example.com/page1
Pushed: https://example.com/page2
...
Generating 13522 reports!
Wrote desktop report:  https://example.com/ at:  /Users/emerson/Code/auto-lighthouse/lighthouse/7_22_2020_3_59_16PM
...
Wrote desktop report:  https://example.com/page1 at:  /Users/emerson/Code/auto-lighthouse/lighthouse/7_22_2020_3_59_16PM
Error: not opened
    at WebSocket.send (/Users/emerson/Code/auto-lighthouse/node_modules/ws/lib/WebSocket.js:344:18)
    at CriConnection.sendRawMessage (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/connections/cri.js:167:14)
    at CriConnection.sendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/connections/connection.js:66:10)
    at Driver._innerSendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:397:29)
    at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:359:35
    at new Promise (<anonymous>)
    at Driver.sendCommandToSession (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:350:12)
    at Driver.sendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:377:17)
    at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:983:22
    at async Driver.gotoURL (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:1134:26) {
  friendlyMessage: undefined
}
TypeError: Cannot read property 'categories' of undefined
    at Function.generateReportCSV (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:72:37)
    at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:105:32
    at Array.map (<anonymous>)
    at Function.generateReport (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:98:32)
    at processResults (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:96:33)
    at /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:76:21
    at async processReports (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:68:13)
    at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:131:9
    at async Promise.all (index 0)
    at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:213:13
TypeError: Cannot read property 'categories' of undefined
    at Function.generateReportCSV (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:72:37)
    at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:105:32
    at Array.map (<anonymous>)
    at Function.generateReport (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:98:32)
    at processResults (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:96:33)
    at /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:76:21
    at async processReports (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:68:13)
    at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:131:9
    at async Promise.all (index 0)
    at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:213:13
Done with reports!

The text was updated successfully, but these errors were encountered:

TGiles · 2020-07-23T22:58:26Z

My gut intuition says this is because the crawler is finding items on the internet that it shouldn't be passing to Lighthouse (but is because of the respectRobots=false flag. I can take a look at this, not sure when though. My gut debug check is to console.log the queueItem.uriPath when the crawler adds URLs to the queue.

However, it definitely could be the size of the queue or something else that I'm not sure of. I'm not sure where to look for the root cause of this.

TGiles · 2020-07-24T01:40:02Z

I'm unable to reproduce this error. When running auto-lighthouse on example.com, my crawler only finds two pages. Can you provide more details? @emersonthis

emersonthis · 2020-07-24T20:33:35Z

Sorry for the confusion. example.com isn't the real site I'm testing. Just a placeholder. The real site I'm testing is for a client so I can't list it here for privacy reasons. I suspect you'll be able to reproduce this with any big site. Maybe try Wikipedia? Or Amazon?

emersonthis · 2020-07-24T21:02:25Z

As you mentioned in PR #133 it seems possible that this issue is the result of this upstream lighthouse issue. (Or maybe I'm misunderstanding?)

Having discovered this, I'm curious about what the implications are for the current architecture of this tool. Sounds like we're not supposed to run lighthouse a bunch inside a for loop... maybe child process resolves this. I'm less use about how to address what they say about not running lighthouse concurrently on the same machine...

TGiles · 2020-07-26T17:24:55Z

From my understanding, and this comment from Patrick, it sounds like running Lighthouse in parallel is a valid use case if you're okay with a loss of accuracy in the performance metrics. Now I don't know yours and your company's use case, but if you're using Lighthouse to audit the other metrics, maybe I can create some way to handle that.

I'd have to do some timing tests to see how fast Lighthouse can run if one is only auditing the performance metrics to justify my first thought solution though. Just for context, I'm thinking of a parallel run of metrics that aren't performance based then a sequential run of the performance metric. However, that's running Lighthouse four times on each page, which is why I'd need to do a quick check to see how fast the auditing can be done with different categories.

emersonthis · 2020-07-27T21:00:38Z

I support the idea of just adding the child process. One way to offset the potential inaccuracy (resulting from resource limitations) might be to add an option to control the amount of concurrency. Then the user could choose the balance between accuracy and performance. If you set the concurrency to 1, I don't think we should expect to hit resource limits because this would be the exact use case the tool was designed for. Users with more horsepower, or less concern for accuracy could turn up the concurrency to run more tests in parallel. Based on my limited understanding of the relevant code, both of these seem pretty straight-forward to do.

github-actions · 2020-08-27T00:39:09Z

Issue is looking mighty stale

emersonthis mentioned this issue Jul 24, 2020

Map reports to rows in aggregate csv report [WIP] #133

Closed

github-actions bot added the stale-issue label Aug 27, 2020

github-actions bot closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing reports from large sites #134

Missing reports from large sites #134

emersonthis commented Jul 23, 2020

TGiles commented Jul 23, 2020 •

edited

Loading

TGiles commented Jul 24, 2020

emersonthis commented Jul 24, 2020

emersonthis commented Jul 24, 2020

TGiles commented Jul 26, 2020

emersonthis commented Jul 27, 2020

github-actions bot commented Aug 27, 2020

Missing reports from large sites #134

Missing reports from large sites #134

Comments

emersonthis commented Jul 23, 2020

TGiles commented Jul 23, 2020 • edited Loading

TGiles commented Jul 24, 2020

emersonthis commented Jul 24, 2020

emersonthis commented Jul 24, 2020

TGiles commented Jul 26, 2020

emersonthis commented Jul 27, 2020

github-actions bot commented Aug 27, 2020

TGiles commented Jul 23, 2020 •

edited

Loading