[Osaka/Yokohama/Tokushima] Looking for infrastructure/server side engineers!

[Deployed by over 500 companies] AWS construction, operation, maintenance, and monitoring services

[Successor to CentOS] AlmaLinux OS server construction/migration service

[For WordPress only] Cloud server “Web Speed”

[Cheap] Website security automatic diagnosis “Quick Scanner”

[Reservation system development] EDISONE customization development service

[Registration of 100 URLs is 0 yen] Website monitoring service “Appmill”

[If you are traveling, business trip, or stationed in China] Chinese SIM service “Choco SIM”

[Global exclusive service] Beyond's MSP in North America and China

[YouTube] Beyond official channel “Biyomaru Channel”

[Introduction to scraping] Obtain table data on a site with Python

2022.04.27

Server construction/operation

table of contents

1 What is scraping?
2 Preparation
3 Operating environment and full code
4 Code explanation
5 Output result

Nice to meet you.
The System Solutions Department is totally addicted to Kirby Discovery! It's so cute.

It's already hot, but it's spring. Even when commuting to work, I often see people dressed like new employees (beyond oblivion).
This time, I'm going to write an article about Python scraping, which is (may be) a little useful for new employees' work.

What is scraping?

Data analysis has been attracting attention in recent years, and scraping is also a fundamental technology and is primarily a method for obtaining desired data from websites.
Originally, the word "scrape" had the meaning of "to scrape up", and it seems that it was derived from that.

In this article, we will use the programming language Python to automatically retrieve table information within the page.

Preparation

If Python is not installed,
please download and install the package for your OS from https://www.python.org/downloads/

install the libraries we will use this time, “ BeautifulSoup4 ” and “ html.parser

This article assumes a Windows environment, so use the search window or press [CTRL] + [R] keys → enter [cmd] → open a command prompt and execute the following. Execute the command below to start the installation.

pip install bs4 html.parser

This time, I would like to try automatically acquiring information such as domains and IP addresses used in Office 365 from the following Microsoft page and then exporting it to a csv file (it would be quite a hassle if you tried to retrieve it manually).

"Office 365 URLs and IP Address Ranges" > "Microsoft 365 Common and Office Online"
https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view= o365-worldwide

Operating environment and full code

OS used: Microsoft Windows 10 Pro
Python version: 3.10

from bs4 import BeautifulSoup from html.parser import HTMLParser import csv from urllib.request import urlopen headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"} html = urlopen("https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide") bsObj = BeautifulSoup(html, "html. parser") table = bsObj.findAll("table")[4] rows = table.findAll("tr") with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", " w", encoding="cp932", newline="") as file: writer = csv.writer(file) for row in rows: csvRow = [] for cell in row.findAll(["td", "th"] ): csvRow.append(cell.get_text()) writer.writerow(csvRow)

Code explanation

from bs4 import BeautifulSoup from html.parser import HTMLParser import csv from urllib.request import urlopen

→ Import each library.

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"} html = urlopen("https://docs.microsoft.com/ja-jp/ microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide") bsObj = BeautifulSoup(html, "html.parser")

→ Add user agent information (in this case, Firefox).
Declare here to specify the page you want to open with urlopen and read it with BeautifulSoup.

table = bsObj.findAll("table")[4] rows = table.findAll("tr")

→ From this HTML structure (using each browser's web developer tools), specify [4] as the fourth table. Also, use findAll to look for the "tr" tag.

with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", "w", encoding="cp932", newline="") as file: writer = csv.writer(file) for row in rows: csvRow = [] for cell in row.findAll(["td", "th"]): csvRow.append(cell.get_text()) writer.writerow(csvRow)

→ Specify the character code, etc. with any path and file name (if the file does not exist, it will be created in the specified path).

It is possible to write with "w" and write the information obtained with "newline="" with a new line for each column.
Search for td and th in the rows (tr tag) specified in the above column, get the value of that column in a loop and write it to the csv file.

Output result

This is what happened. This time we will only use a few pieces of information, but the more information you can obtain, the more efficient it will be.

If I have another chance, I would like to write an article that will be useful to someone. See you soon