[Introduction to scraping] Obtain table data on a site with Python
table of contents
Nice to meet you.
The System Solutions Department is totally addicted to Kirby Discovery! It's so cute.
It's already hot, but it's spring. Even when commuting to work, I often see people dressed like new employees (beyond oblivion).
This time, I'm going to write an article about Python scraping, which is (may be) a little useful for new employees' work.
What is scraping?
Data analysis has been attracting attention in recent years, and scraping is also a fundamental technology and is primarily a method for obtaining desired data from websites.
Originally, the word "scrape" had the meaning of "to scrape up", and it seems that it was derived from that.
In this article, we will use the programming language Python to automatically retrieve table information within the page.
Preparation
If Python is not installed,
please download and install the package for your OS from https://www.python.org/downloads/
install the libraries we will use this time, “ BeautifulSoup4 ” and “ html.parser
This article assumes a Windows environment, so use the search window or press [CTRL] + [R] keys → enter [cmd] → open a command prompt and execute the following. Execute the command below to start the installation.
pip install bs4 html.parser
This time, I would like to try automatically acquiring information such as domains and IP addresses used in Office 365 from the following Microsoft page and then exporting it to a csv file (it would be quite a hassle if you tried to retrieve it manually).
"Office 365 URLs and IP Address Ranges" > "Microsoft 365 Common and Office Online"
https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view= o365-worldwide
Operating environment and full code
OS used: Microsoft Windows 10 Pro
Python version: 3.10
from bs4 import BeautifulSoup from html.parser import HTMLParser import csv from urllib.request import urlopen headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"} html = urlopen("https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide") bsObj = BeautifulSoup(html, "html. parser") table = bsObj.findAll("table")[4] rows = table.findAll("tr") with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", " w", encoding="cp932", newline="") as file: writer = csv.writer(file) for row in rows: csvRow = [] for cell in row.findAll(["td", "th"] ): csvRow.append(cell.get_text()) writer.writerow(csvRow)
Code explanation
from bs4 import BeautifulSoup from html.parser import HTMLParser import csv from urllib.request import urlopen
→ Import each library.
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"} html = urlopen("https://docs.microsoft.com/ja-jp/ microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide") bsObj = BeautifulSoup(html, "html.parser")
→ Add user agent information (in this case, Firefox).
Declare here to specify the page you want to open with urlopen and read it with BeautifulSoup.
table = bsObj.findAll("table")[4] rows = table.findAll("tr")
→ From this HTML structure (using each browser's web developer tools), specify [4] as the fourth table. Also, use findAll to look for the "tr" tag.
with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", "w", encoding="cp932", newline="") as file: writer = csv.writer(file) for row in rows: csvRow = [] for cell in row.findAll(["td", "th"]): csvRow.append(cell.get_text()) writer.writerow(csvRow)
→ Specify the character code, etc. with any path and file name (if the file does not exist, it will be created in the specified path).
It is possible to write with "w" and write the information obtained with "newline="" with a new line for each column.
Search for td and th in the rows (tr tag) specified in the above column, get the value of that column in a loop and write it to the csv file.
Output result
This is what happened. This time we will only use a few pieces of information, but the more information you can obtain, the more efficient it will be.
If I have another chance, I would like to write an article that will be useful to someone. See you soon