Mechanize won't work! In that case, Beautiful Soup is the answer!

2016.02.17

Web system development

table of contents

1 What is Beautiful Soup?
2 Scraping Procedure
3 If you want to consult with a development professional

Hello.
I'm Bandai, the Wild team member in charge of development.

This time, we'll be using Mechanize, a commonly used Python crawling tool, and Beautiful Soup, a library for parsing HTML, to do some happy web scraping! Please note that
Beautiful Soup itself doesn't solve Mechanize's problems, so the title is a bit of an exaggeration.

What is Beautiful Soup?

Beautiful Soup is a Python library that provides HTML (X)ML parsing functionality.
While it only provides parsing capabilities, and some tasks can be handled by Mechanize alone,
Beautiful Soup's strength lies in its ability to parse HTML that Mechanize cannot.

effectively use Beautiful Soup and Mechanize to efficiently handle HTML that Mechanize cannot parse,
we'll explore how to

This time, we will use Python 2.7

Scraping Procedure

Preparing the necessary libraries

Use pip to get the required libraries

pip install mechanize beautifulsoup4 # Use sudo if necessary # Note that typing beautifulsoup will install Beautiful Soup 3.0. # In my environment, both 3 and 4 are installed

How to parse HTML source code retrieved by Mechanize with Beautiful Soup

It's very easy to take HTML from Mechanize and put it into Beautiful Soup for parsing

br.open('https://beyondjapan.com') soup = BeautifulSoup(br.response().read())

The above command will retrieve the HTML from https://beyondjapan.com, and the resulting object will be placed in soup after parsing it with Beautiful Soup

Switching Beautiful Soup's parser engine

Beautiful Soup allows you to switch between parser engines depending on the situation.
The official Beautiful Soup website recommends using lxml for speed reasons.
This time, we will use lxml for analysis, so we will install the lxml library using pip.

pip install lxml # use sudo if necessary

When calling Beautiful Soup in your code, you specify the parser engine, but
if you don't specify one, it seems to use Python's standard HTML parser.
Some kind of error will appear in standard error, so if this bothers you,
you might want to specify the parser engine even if you're using the standard parser.

# When executed without specifying a parser soup = BeautifulSoup(br.response().read())

It parses fine, but I get the following error:

/usr/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type))

To parse HTML using lxml, use the following:

soup = BeautifulSoup(br.response().read(), "lxml")

Return the parsed HTML with Beautiful Soup to Mechanize and continue crawling

where you can't parse HTML with Mechanize, but then use Beautiful Soup to parse it,
only to then navigate to an even later page based on the parsed result.

in the following areas, even if you don't fully understand why
it's common to encounter problems such as parsing errors occurring

br.select_form(nr=0) Traceback (most recent call last): File "/home/vagrant/workspace/mechanize_test/test3.py", line 13, in<module> br.select_form(nr=0) File "/usr/local/lib/python2.7/site-packages/mechanize/_mechanize.py", line 499, in select_form global_form = self._factory.global_form File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 544, in __getattr__ self.forms() File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 557, in forms self._forms_factory.forms()) File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 237, in forms _urlunparse=_rfc3986.urlunsplit, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 844, in ParseResponseEx _urlunparse=_urlunparse, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 981, in _ParseFileEx fp.feed(data) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 758, in feed _sgmllib_copy.SGMLParser.feed(self, data) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 110, in feed self.goahead(0) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 144, in goahead k = self.parse_starttag(i) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 302, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 347, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 387, in handle_starttag method(attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 735, in do_option _AbstractFormParser._start_option(self, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 480, in _start_option raise ParseError("OPTION outside of SELECT") mechanize._form.ParseError: OPTION outside of SELECT

In these cases, parsing usually fails, forcing you to give up and look for other methods. However
, if you first have Beautiful Soup analyze it, extract and process only the necessary forms, and then return it to Mechanize, you can proceed smoothly.
(The "OPTION outside of SELECT" error itself seems to indicate that there's an option tag outside a select tag (or that there's no option tag inside a select tag).)

This time, there was no option tag within the select tag, so I added the option tag using Beautiful Soup and then switched back to Mechanize

# Analyze HTML with Beautiful Soup soup = BeautifulSoup(br.response().read(), 'lxml') # Extract the form tag f = soup.find('form') # Create a Beautiful Soup object for the HTML tag you want to add o = BeautifulSoup('<option value="hogehoge" selected> fugafuga</option> ', 'lxml') # Extract the option tag part option = o.option # Add an option tag to the select tag in the form f.find(id='target_select').append(option) # Create a Mechanize response object and register it with Mechanize response = mechanize.make_response(str(f), [("Content-Type", "text/html")], br.geturl(), 200, "OK") br.set_response(response) # Try specifying the form in Mechanize br.select_form(nr=0)

I struggled to find a Japanese website that said that mechanize.make_response() and set_response() could be used to return HTML parsed with Beautiful Soup to Mechanize, but I managed to solve the problem by referring to this website

Form Handling With Mechanize And Beautifulsoup · Todd Hayton

Although it is an English site, it also contains detailed information, mainly on how to use Mechanize, and is very educational

That's all

If you want to consult with a development professional

At Beyond, we combine our extensive track record, technology, and know-how in system development with OSS technology and cloud technologies such as AWS to provide contracted development of web systems with reliable quality and excellent cost performance

We also handle server-side/back-end development and proprietary API collaboration development, making full use of our technology and know-how in building and operating web system/application infrastructure for large-scale, high-load games, applications, and digital content

If you have any problems with your development project, please visit the following website

● Web system development
● Server-side development (API / DB)

If you found this article helpful,please give it a "Like"!

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but thankfully I'm also given the opportunity to work on various other tasks, including marketing.
My image rights within Beyond are treated as CC0.

how to build your own testing environment using the command-line tool Vagrant. Beyond hosted a study session on the theme of games, demonstrating