[Osaka/Yokohama/Tokushima] Looking for infrastructure/server side engineers!

[Deployed by over 500 companies] AWS construction, operation, maintenance, and monitoring services

[Successor to CentOS] AlmaLinux OS server construction/migration service

[For WordPress only] Cloud server “Web Speed”

Mechanize doesn't work! In such a case, Happy Happy with Beautiful Soup ☆

2016.02.17

table of contents [非表示]

1 What is Beautiful Soup?
2 Preparing the necessary libraries
3 How to analyze HTML source picked up by Mechanize with Beautiful Soup
4 Switch Beautiful Soup's parser engine
5 Return the HTML parsed by Beautiful Soup to Mechanize and continue crawling

Hello.
My name is Bandai and I am in charge of Wild on the development team.

This time, we'll be happy scraping using Mechanize, which is often used when crawling with Python, and Beautiful Soup, a library for analyzing HTML. I would like to proceed with this purpose.
Beautiful Soup itself does not solve the problems of Mechanize, so the title is a bit of an exaggeration, but it's okay.

What is Beautiful Soup?

Beautiful Soup is a Python library that provides HT(X)ML parser functionality.
Since it only provides a parser function, there are some parts that can be solved with Mechanize alone, but
the beauty of Beautiful Soup is that it can also analyze HTML that Mechanize cannot.

I'll be thinking about how to effectively use Beautiful Soup and Mechanize to easily eliminate HTML that Mechanize can't parse using

This time, we will proceed using Python2.7 series.

Preparing the necessary libraries

Use pip to get the required libraries.

pip install mechanize beautifulsoup4 # Please use sudo if necessary # Be careful as Beautiful Soup3 series will be installed if you use beautifulsoup # By the way, in my environment, 3 and 4 coexist almost everywhere.

How to analyze HTML source picked up by Mechanize with Beautiful Soup

It is very easy to import the HTML obtained with Mechanize into Beautiful Soup and parse it.

1	`br.open('https://beyondjapan.com') soup` `=` `BeautifulSoup(br.response().read())`

The above command will retrieve the HTML from https://beyondjapan.com, and the object that has been parsed by Beautiful Soup will be placed in soup.

Switch Beautiful Soup's parser engine

Beautiful Soup allows you to switch the parser engine used for analysis depending on the situation.
Beautiful Soup's official page seems to recommend using lxml in terms of speed.
This time we will use lxml for analysis, so we will install the lxml library using pip.

1	`pip` `install` `lxml` `# sudo if necessary`

The parser engine is specified when calling Beautiful Soup in the code, but
if nothing is specified, Python's standard HTML parser will be used.
Some errors will appear in the standard error, so if you are concerned about this,
I think it is better to specify the parser engine even when using the standard parser.

# When executed without specifying a parser soup = BeautifulSoup(br.response().read())

I can parse it without any problem, but I get the following error

/usr/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup ]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type))

To specify lxml to parse HTML, specify as follows.

1	`soup` `=` `BeautifulSoup(br.response().read(),` `"lxml")`

Return the HTML parsed by Beautiful Soup to Mechanize and continue crawling

use Beautiful Soup to parse HTML that could not be parsed by Mechanize, but then
proceed to a further page based on the parsed results.

For example, when you want to send data using a form,
it is common to encounter a parsing error in the following places, although this is not clear.

br.select_form(nr=0) Traceback (most recent call last): File "/home/vagrant/workspace/mechanize_test/test3.py", line 13, in<module> br.select_form(nr=0) File "/usr/local/lib/python2.7/site-packages/mechanize/_mechanize.py", line 499, in select_form global_form = self._factory.global_form File "/usr/local /lib/python2.7/site-packages/mechanize/_html.py", line 544, in __getattr__ self.forms() File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py ", line 557, in forms self._forms_factory.forms()) File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 237, in forms _urlunparse=_rfc3986.urlunsplit, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 844, in ParseResponseEx _urlunparse=_urlunparse, File "/usr/local/lib/python2.7/site-packages/ mechanize/_form.py", line 981, in _ParseFileEx fp.feed(data) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 758, in feed _sgmllib_copy.SGMLParser .feed(self, data) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 110, in feed self.goahead(0) File "/usr/local/lib /python2.7/site-packages/mechanize/_sgmllib_copy.py", line 144, in goahead k = self.parse_starttag(i) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy. py", line 302, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 347, in finish_starttag self.handle_starttag(tag , method, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 387, in handle_starttag method(attrs) File "/usr/local/lib/python2.7 /site-packages/mechanize/_form.py", line 735, in do_option _AbstractFormParser._start_option(self, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 480, in _start_option raise ParseError("OPTION outside of SELECT") mechanize._form.ParseError: OPTION outside of SELECT

In these cases, I usually can't parse it and end up giving up and looking for another method, so
I let Beautiful Soup analyze it, extract and process only the necessary forms, and then send it back to Mechanize, and then I can move on. Let's proceed.
(The error itself, OPTION outside of SELECT, seems to be an error where there is an option tag outside the select tag (or there is no option tag inside the select tag))

This time, there was no option tag in the select tag, so I'll add an option tag with Beautiful Soup and then return it to Mechanize.

# Parse the HTML with Beautiful Soup soup = BeautifulSoup(br.response().read(), 'lxml') # Extract the form tag part f = soup.find('form') # Beautiful Soup object of the HTML tag you want to add generate o = BeautifulSoup('<option value="hogehoge" selected> fugafuga</option> ', 'lxml') # Extract option tag part option = o.option # Add option tag to select tag in form f.find(id='target_select').append(option) # Mechanize response Create an object and register it with Mechanize response = mechanize.make_response(str(f), [("Content-Type", "text/html")], br.geturl(), 200, "OK") br. set_response(response) # Try specifying form with Mechanize br.select_form(nr=0)

I had a hard time finding a Japanese site that says that you can return HTML parsed by Beautiful Soup to Mechanize using mechanize.make_response() and set_response(), but I was able to solve the problem by referring to this site. Ta.

Form Handling With Mechanize And Beautifulsoup · Todd Hayton

Although it is an English site, there are other details mainly about how to use Mechanize, so you can learn a lot.

That's it.

If you found this article helpful , please give it a like!

[2026.6.30 Amazon Linux 2 end of support] Amazon Linux server migration solution

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but I'm also fortunate to be able to do a lot of other work, including marketing.
Furthermore, my portrait rights in Beyond are treated as CC0 by him.

We held a game-themed study session sponsored by Beyond, which builds your own verification environment with Vagrant.