[Osaka/Yokohama/Tokushima] Looking for infrastructure/server side engineers!

[Osaka/Yokohama/Tokushima] Looking for infrastructure/server side engineers!

[Deployed by over 500 companies] AWS construction, operation, maintenance, and monitoring services

[Deployed by over 500 companies] AWS construction, operation, maintenance, and monitoring services

[Successor to CentOS] AlmaLinux OS server construction/migration service

[Successor to CentOS] AlmaLinux OS server construction/migration service

[For WordPress only] Cloud server “Web Speed”

[For WordPress only] Cloud server “Web Speed”

[Cheap] Website security automatic diagnosis “Quick Scanner”

[Cheap] Website security automatic diagnosis “Quick Scanner”

[Reservation system development] EDISONE customization development service

[Reservation system development] EDISONE customization development service

[Registration of 100 URLs is 0 yen] Website monitoring service “Appmill”

[Registration of 100 URLs is 0 yen] Website monitoring service “Appmill”

[Compatible with over 200 countries] Global eSIM “Beyond SIM”

[Compatible with over 200 countries] Global eSIM “Beyond SIM”

[If you are traveling, business trip, or stationed in China] Chinese SIM service “Choco SIM”

[If you are traveling, business trip, or stationed in China] Chinese SIM service “Choco SIM”

[Global exclusive service] Beyond's MSP in North America and China

[Global exclusive service] Beyond's MSP in North America and China

[YouTube] Beyond official channel “Biyomaru Channel”

[YouTube] Beyond official channel “Biyomaru Channel”

Try using headless Google Chrome from Node.js

Hello.
I'm Mandai, in charge of Wild on the development team.

previously wrote an article titled ``
Headless mode is now standard in Google Chrome 59, so let's try it out'' | Beyond Co., Ltd. Google Chrome updates are done automatically behind the scenes, so you might not be aware of them on a regular basis, so I'd like to keep checking them frequently.

This time, I would like to start Google Chrome in headless mode from Node.js and control it via the chrome-remote-interface module.

Sample program doesn't work!

In the previous article, we looked at how to start Google Chrome in headless mode on the command line and capture the screen, but since it is difficult to integrate with other systems with just a shell script, we decided to use Node. Let's start it from .js and try the part that retrieves the content as is.

Basically, Headless Chrome | Web | Google Developers , but it didn't work, so I tried using this code as a base and modifying it a little. The source is below.

const ChromeLauncher = require('chrome-launcher'); const CDP = require('chrome-remote-interface'); const url = 'https://beyondjapan.com'; var launcherKill = (client, launcher, args = null ) => { launcher.kill(); console.log(args); } ChromeLauncher.launch({ port: 9222, chromeFlags: [ // '--disable-gpu', // Enable if necessary '- -headless', // '--no-sandbox', // Enable if necessary ], }).then(launcher => { CDP(client => { const {Page, DOM} = client; Promise. all([ Page.enable(), DOM.enable(), ]).then(() => { Page.navigate({url : url}); Page.loadEventFired(() => { DOM.getDocument(( error, params) => { if (error){ console.log(params); launcher.kill(); return; } const opts = { nodeId : params.root.nodeId, selector : 'a' }; DOM.querySelectorAll (opts, (error, params) => { if (error){ console.log(params); launcher.kill(); return; } var promises = []; var getDomAttribute = (DOM, nodeId, count) => { return new Promise((resolve, reject)=>{ const opts = { nodeId : nodeId }; DOM.getAttributes(opts, (error, params) => { if (!error) console.log(count, params.attributes [1]); resolve(); }) }) }; console.log(params.nodeIds.length); params.nodeIds.forEach(elm => { promises.push(getDomAttribute(DOM, elm, promises.length) ); }); Promise.all(promises).then(()=>{ launcher.kill(); }) }) }) }) }) }).on('error', error => { console. error(error) launcher.kill(); }) })

 

The process flow is as follows:

  1. Launch Google Chrome using the chrome-launcher module
  2. Once started, connect to the DevTools protocol using the chrome-remote-interface module.
  3. Move to the specified URL from the DevTools client object and get the data
  4. Get all A tags from the DOM and display the href value on standard output

It becomes.
I haven't been able to investigate properly, but it seems that the DOM.getAttributes method around line 49 is running asynchronously, and if I stop Google Chrome right after I run forEach, I can't read the contents of the DOM.
This time, I use Promise to wait until all the DOM has been read before stopping Google Chrome.

 

An error occurred

If you get an error like the one below when you start the script, you may need to tweak Google Chrome's startup options a bit.

No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.

 

There seems to be a version of the library for using Google Chrome's sandbox environment that has the path to the Google Chrome executable file hard-coded, so the sandbox cannot be used.
To properly avoid this error, it appears that you need to change the hard-coded path in sandbox/linux/suid/sandbox.cc and rebuild.

Also, in case of an emergency or if you only need it on a temporary basis, you can start it by uncommenting the startup option "--no-sandbox".

Please note that in this case, you will be starting Google Chrome in an environment where the sandbox is not used, which increases the security risk.

 

Now, what do you use it for? (summary)

Since DevTools can directly interfere with Google Chrome, it seems like it will be possible to perform more realistic scraping than before.

Depending on how you do it, it may be possible to test the behavior of JavaScript as well, so I think it would be convenient to be able to run these tests automatically in the background using the code base.

Due to the structure of the website, it seems possible to create a system that can monitor whether content that cannot be monitored with curl etc. is running normally.

That's it.

If you found this article helpful , please give it a like!
0
Loading...
0 votes, average: 0.00 / 10
1,910
X facebook Hatena Bookmark pocket
[2025.6.30 Amazon Linux 2 support ended] Amazon Linux server migration solution

[2025.6.30 Amazon Linux 2 support ended] Amazon Linux server migration solution

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but I'm also fortunate to be able to do a lot of other work, including marketing.
Furthermore, my portrait rights in Beyond are treated as CC0 by him.