Try using headless Google Chrome from Node.js

Hello.
I'm Mandai, in charge of Wild on the development team.

previously wrote an article titled "
Google Chrome 59 now comes with headless mode, so I'm going to try it out" | Beyond Co., Ltd. Google Chrome updates automatically in the background, so you might not be aware of them, but I'd like to continue to check regularly.

This time, I will try to start Google Chrome in headless mode from Node.js and control it via the chrome-remote-interface module

The sample program doesn't work!

In the previous article, we looked at how to start Google Chrome in headless mode on the command line and capture the screen, but since it is difficult to integrate with other systems using just a shell script, we will start it from Node.js and get the content directly

I basically with headless Chrome | Web | Google Developers as a reference, but I just couldn't get it to work, so I modified it a bit based on that code and came up with the following source code.

const ChromeLauncher = require('chrome-launcher'); const CDP = require('chrome-remote-interface'); const url = 'https://beyondjapan.com'; var launcherKill = (client, launcher, args = null) => { launcher.kill(); console.log(args); } ChromeLauncher.launch({ port: 9222, chromeFlags: [ // '--disable-gpu', // Enable if needed '--headless', // '--no-sandbox', // Enable if needed ], }).then(launcher => { CDP(client => { const {Page, DOM} = client; Promise.all([ Page.enable(), DOM.enable(), ]).then(() => { Page.navigate({url : url}); Page.loadEventFired(() => { DOM.getDocument((error, params) => { if (error){ console.log(params); launcher.kill(); return; } const opts = { nodeId : params.root.nodeId, selector : 'a' }; DOM.querySelectorAll(opts, (error, params) => { if (error){ console.log(params); launcher.kill(); return; } var promises = []; var getDomAttribute = (DOM, nodeId, count) => { return new Promise((resolve, reject)=>{ const opts = { nodeId : nodeId }; DOM.getAttributes(opts, (error, params) => { if (!error) console.log(count, params.attributes[1]); resolve(); }) }) }; console.log(params.nodeIds.length); params.nodeIds.forEach(elm => { promises.push(getDomAttribute(DOM, elm, promises.length)); }); Promise.all(promises).then(()=>{ launcher.kill(); }) }) }) }) }) }).on('error', error => { console.error(error) launcher.kill(); }) })

 

The processing flow is as follows:

  1. Launch Google Chrome using the chrome-launcher module
  2. Once launched, use the chrome-remote-interface module to connect to the DevTools protocol
  3. From the DevTools client object, navigate to the specified URL and retrieve data
  4. Get all A tags from the DOM and display the href values ​​to standard output

this
, but it seems that the DOM.getAttributes method around line 49 runs asynchronously, and if I stop Google Chrome immediately after running forEach, the DOM contents cannot be read.
This time, I'm using a Promise to wait until all the DOM has been read before stopping Google Chrome.

 

An error occurred

If you get the following error when you launch the script, you'll need to tweak Google Chrome's startup options

No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.

 

It seems that the library for using Google Chrome's sandbox environment has a hard-coded path to the Google Chrome executable, preventing the sandbox from being used.
To avoid this error correctly, it seems necessary to change the hard-coded path in sandbox/linux/suid/sandbox.cc and rebuild the application.

Also, in an emergency or as a temporary solution, you can start it by uncommenting the startup option "--no-sandbox"

In this case, please note that Google Chrome will be launched in an environment that does not use the sandbox, which increases security risks

 

So, what is it used for? (Summary)

Since you can directly interact with Google Chrome from DevTools, it looks like you'll be able to perform more realistic scraping than ever before

It seems possible to test JavaScript behavior as well, so it would be convenient to be able to automatically run such tests in the background using a code base

Due to the structure of the website, it may be possible to create a system that can monitor whether content that cannot be fully monitored using curl, etc., is operating normally

That's it.

If you found this article helpful , please give it a like!
0
Loading...
0 votes, average: 0.00 / 10
2,208
X facebook Hatena Bookmark pocket

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but I'm also fortunate to be able to do a lot of other work, including marketing.
Furthermore, my portrait rights in Beyond are treated as CC0 by him.