Try using headless Google Chrome from Node.js
table of contents
Hello.
I'm Mandai, in charge of Wild on the development team.
previously wrote an article titled ``
Headless mode is now standard in Google Chrome 59, so let's try it out'' | Beyond Co., Ltd. Google Chrome updates are done automatically behind the scenes, so you might not be aware of them on a regular basis, so I'd like to keep checking them frequently.
This time, I would like to start Google Chrome in headless mode from Node.js and control it via the chrome-remote-interface module.
Sample program doesn't work!
In the previous article, we looked at how to start Google Chrome in headless mode on the command line and capture the screen, but since it is difficult to integrate with other systems with just a shell script, we decided to use Node. Let's start it from .js and try the part that retrieves the content as is.
Basically, Headless Chrome | Web | Google Developers , but it didn't work, so I tried using this code as a base and modifying it a little. The source is below.
const ChromeLauncher = require('chrome-launcher'); const CDP = require('chrome-remote-interface'); const url = 'https://beyondjapan.com'; var launcherKill = (client, launcher, args = null ) => { launcher.kill(); console.log(args); } ChromeLauncher.launch({ port: 9222, chromeFlags: [ // '--disable-gpu', // Enable if necessary '- -headless', // '--no-sandbox', // Enable if necessary ], }).then(launcher => { CDP(client => { const {Page, DOM} = client; Promise. all([ Page.enable(), DOM.enable(), ]).then(() => { Page.navigate({url : url}); Page.loadEventFired(() => { DOM.getDocument(( error, params) => { if (error){ console.log(params); launcher.kill(); return; } const opts = { nodeId : params.root.nodeId, selector : 'a' }; DOM.querySelectorAll (opts, (error, params) => { if (error){ console.log(params); launcher.kill(); return; } var promises = []; var getDomAttribute = (DOM, nodeId, count) => { return new Promise((resolve, reject)=>{ const opts = { nodeId : nodeId }; DOM.getAttributes(opts, (error, params) => { if (!error) console.log(count, params.attributes [1]); resolve(); }) }) }; console.log(params.nodeIds.length); params.nodeIds.forEach(elm => { promises.push(getDomAttribute(DOM, elm, promises.length) ); }); Promise.all(promises).then(()=>{ launcher.kill(); }) }) }) }) }) }).on('error', error => { console. error(error) launcher.kill(); }) })
The process flow is as follows:
- Launch Google Chrome using the chrome-launcher module
- Once started, connect to the DevTools protocol using the chrome-remote-interface module.
- Move to the specified URL from the DevTools client object and get the data
- Get all A tags from the DOM and display the href value on standard output
It becomes.
I haven't been able to investigate properly, but it seems that the DOM.getAttributes method around line 49 is running asynchronously, and if I stop Google Chrome right after I run forEach, I can't read the contents of the DOM.
This time, I use Promise to wait until all the DOM has been read before stopping Google Chrome.
An error occurred
If you get an error like the one below when you start the script, you may need to tweak Google Chrome's startup options a bit.
No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
There seems to be a version of the library for using Google Chrome's sandbox environment that has the path to the Google Chrome executable file hard-coded, so the sandbox cannot be used.
To properly avoid this error, it appears that you need to change the hard-coded path in sandbox/linux/suid/sandbox.cc and rebuild.
Also, in case of an emergency or if you only need it on a temporary basis, you can start it by uncommenting the startup option "--no-sandbox".
Please note that in this case, you will be starting Google Chrome in an environment where the sandbox is not used, which increases the security risk.
Now, what do you use it for? (summary)
Since DevTools can directly interfere with Google Chrome, it seems like it will be possible to perform more realistic scraping than before.
Depending on how you do it, it may be possible to test the behavior of JavaScript as well, so I think it would be convenient to be able to run these tests automatically in the background using the code base.
Due to the structure of the website, it seems possible to create a system that can monitor whether content that cannot be monitored with curl etc. is running normally.
That's it.