Mechanize won't work! In that case, Beautiful Soup is the answer!

2016.02.17

Web system development

table of contents

1 What is Beautiful Soup?
2 Scraping Procedure
3 If you want to consult with a development professional

Hello.
I'm Bandai, the Wild Team member in charge of the development team.

This time, I'd like to use Mechanize, a popular Python crawling tool, and Beautiful Soup, a library for parsing HTML, to encourage happy scraping!
The title is a bit exaggerated, but please forgive me for that, as Beautiful Soup itself doesn't solve the problems Mechanize faces.

What is Beautiful Soup?

Beautiful Soup is a Python library that provides HTML (XML) parsing functionality.
Because it only provides parser functionality, there are some parts that can be handled by Mechanize alone, but
the beauty of Beautiful Soup is that it can parse HTML that Mechanize cannot.

we'll look at how to use Beautiful Soup and Mechanize effectively to smash through HTML that Mechanize can't parse

This time, we will use Python 2.7

Scraping Procedure

Preparing the necessary libraries

Use pip to get the required libraries

pip install mechanize beautifulsoup4 # Use sudo if necessary # Note that typing beautifulsoup will install Beautiful Soup 3.0. # In my environment, both 3 and 4 are installed

How to parse HTML source code retrieved by Mechanize with Beautiful Soup

It's very easy to take HTML from Mechanize and put it into Beautiful Soup for parsing

br.open('https://beyondjapan.com') soup = BeautifulSoup(br.response().read())

The above command will retrieve the HTML from https://beyondjapan.com, and the resulting object will be placed in soup after parsing it with Beautiful Soup

Switching Beautiful Soup's parser engine

Beautiful Soup allows you to switch the parser engine used for analysis depending on the situation.
The official Beautiful Soup page recommends using lxml for speed.
Since we will be using lxml for analysis this time, we will install the lxml library using pip.

pip install lxml # use sudo if necessary

When calling Beautiful Soup in your code, you specify a parser engine, but
if you don't specify one, it will use Python's standard HTML parser.
An error message will be displayed on standard error, so if you're concerned about this,
it's a good idea to specify a parser engine even when using the standard parser.

# When executed without specifying a parser soup = BeautifulSoup(br.response().read())

It parses fine, but I get the following error:

/usr/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type))

To parse HTML using lxml, use the following:

soup = BeautifulSoup(br.response().read(), "lxml")

Return the parsed HTML with Beautiful Soup to Mechanize and continue crawling

use Beautiful Soup to parse HTML that Mechanize couldn't parse, but
then navigate to a further page based on the parsed results.

For example, when you want to submit data using a form, it is common
to encounter a parse error in the following places, causing trouble for some reason.

br.select_form(nr=0) Traceback (most recent call last): File "/home/vagrant/workspace/mechanize_test/test3.py", line 13, in<module> br.select_form(nr=0) File "/usr/local/lib/python2.7/site-packages/mechanize/_mechanize.py", line 499, in select_form global_form = self._factory.global_form File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 544, in __getattr__ self.forms() File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 557, in forms self._forms_factory.forms()) File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 237, in forms _urlunparse=_rfc3986.urlunsplit, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 844, in ParseResponseEx _urlunparse=_urlunparse, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 981, in _ParseFileEx fp.feed(data) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 758, in feed _sgmllib_copy.SGMLParser.feed(self, data) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 110, in feed self.goahead(0) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 144, in goahead k = self.parse_starttag(i) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 302, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 347, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 387, in handle_starttag method(attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 735, in do_option _AbstractFormParser._start_option(self, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 480, in _start_option raise ParseError("OPTION outside of SELECT") mechanize._form.ParseError: OPTION outside of SELECT

In such cases, parsing usually fails, forcing you to give up and look for another solution.
Instead, you can run Beautiful Soup to parse the form, extract and process only the necessary parts, and then return it to Mechanize, which will allow you to proceed without any issues.
(The error "OPTION outside of SELECT" itself seems to mean that there is an option tag outside of a select tag (or that there is no option tag in the select tag).)

This time, there was no option tag within the select tag, so I added the option tag using Beautiful Soup and then switched back to Mechanize

# Analyze HTML with Beautiful Soup soup = BeautifulSoup(br.response().read(), 'lxml') # Extract the form tag f = soup.find('form') # Create a Beautiful Soup object for the HTML tag you want to add o = BeautifulSoup('<option value="hogehoge" selected> fugafuga</option> ', 'lxml') # Extract the option tag part option = o.option # Add an option tag to the select tag in the form f.find(id='target_select').append(option) # Create a Mechanize response object and register it with Mechanize response = mechanize.make_response(str(f), [("Content-Type", "text/html")], br.geturl(), 200, "OK") br.set_response(response) # Try specifying the form in Mechanize br.select_form(nr=0)

I struggled to find a Japanese website that said that mechanize.make_response() and set_response() could be used to return HTML parsed with Beautiful Soup to Mechanize, but I managed to solve the problem by referring to this website

Form Handling With Mechanize And Beautifulsoup · Todd Hayton

Although it is an English site, it also contains detailed information, mainly on how to use Mechanize, and is very educational

That's all

If you want to consult with a development professional

At Beyond, we combine our extensive track record, technology, and know-how in system development with OSS technology and cloud technologies such as AWS to provide contracted development of web systems with reliable quality and excellent cost performance

We also handle server-side/back-end development and proprietary API collaboration development, making full use of our technology and know-how in building and operating web system/application infrastructure for large-scale, high-load games, applications, and digital content

If you have any problems with your development project, please visit the following website

● Web system development
● Server-side development (API/DB)

If you found this article useful, please click [Like]!

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but I'm also grateful to be able to do a variety of other work, including marketing.
My portrait rights within Beyond are CC0.

How to build your own testing environment using the command line tool Vagrant. Beyond hosted a gaming-themed study session.