Mechanize doesn't work! In such a case, Happy Happy with Beautiful Soup ☆

Hello.
My name is Bandai and I am in charge of Wild on the development team.

This time, I'd like to use Mechanize, a popular Python crawling tool, and Beautiful Soup, a library for parsing HTML, to encourage happy scraping!
The title is a bit exaggerated, but please forgive me for that, as Beautiful Soup itself doesn't solve the problems Mechanize faces.

What is Beautiful Soup?

Beautiful Soup is a Python library that provides HTML (XML) parsing functionality.
Because it only provides parser functionality, there are some parts that can be handled by Mechanize alone, but
the beauty of Beautiful Soup is that it can parse HTML that Mechanize cannot.


we'll look at how to use Beautiful Soup and Mechanize effectively to smash through HTML that Mechanize can't parse

This time, we will use Python 2.7

Scraping Procedure

Preparing the necessary libraries

Use pip to get the required libraries

pip install mechanize beautifulsoup4 # Use sudo if necessary # Note that typing beautifulsoup will install Beautiful Soup 3.0. # In my environment, both 3 and 4 are installed

 

How to parse HTML source code retrieved by Mechanize with Beautiful Soup

It's very easy to take HTML from Mechanize and put it into Beautiful Soup for parsing

br.open('https://beyondjapan.com') soup = BeautifulSoup(br.response().read())

 

The above command will retrieve the HTML from https://beyondjapan.com, and the resulting object will be placed in soup after parsing it with Beautiful Soup

 

Switching Beautiful Soup's parser engine

Beautiful Soup allows you to switch the parser engine used for analysis depending on the situation.
The official Beautiful Soup page recommends using lxml for speed.
Since we will be using lxml for analysis this time, we will install the lxml library using pip.

pip install lxml # use sudo if necessary

 

When calling Beautiful Soup in your code, you specify a parser engine, but
if you don't specify one, it will use Python's standard HTML parser.
An error message will be displayed on standard error, so if you're concerned about this,
it's a good idea to specify a parser engine even when using the standard parser.

# When executed without specifying a parser soup = BeautifulSoup(br.response().read())

 

It parses fine, but I get the following error:

/usr/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type))

 

To parse HTML using lxml, use the following:

soup = BeautifulSoup(br.response().read(), "lxml")

 

Return the parsed HTML with Beautiful Soup to Mechanize and continue crawling

use Beautiful Soup to parse HTML that Mechanize couldn't parse, but
then navigate to a further page based on the parsed results.

For example, when you want to submit data using a form, it is common
to encounter a parse error in the following places, causing trouble for some reason.

br.select_form(nr=0) Traceback (most recent call last): File "/home/vagrant/workspace/mechanize_test/test3.py", line 13, in<module> br.select_form(nr=0) File "/usr/local/lib/python2.7/site-packages/mechanize/_mechanize.py", line 499, in select_form global_form = self._factory.global_form File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 544, in __getattr__ self.forms() File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 557, in forms self._forms_factory.forms()) File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 237, in forms _urlunparse=_rfc3986.urlunsplit, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 844, in ParseResponseEx _urlunparse=_urlunparse, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 981, in _ParseFileEx fp.feed(data) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 758, in feed _sgmllib_copy.SGMLParser.feed(self, data) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 110, in feed self.goahead(0) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 144, in goahead k = self.parse_starttag(i) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 302, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 347, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 387, in handle_starttag method(attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 735, in do_option _AbstractFormParser._start_option(self, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 480, in _start_option raise ParseError("OPTION outside of SELECT") mechanize._form.ParseError: OPTION outside of SELECT

 

In such cases, parsing usually fails, forcing you to give up and look for another solution.
Instead, you can run Beautiful Soup to parse the form, extract and process only the necessary parts, and then return it to Mechanize, which will allow you to proceed without any issues.
(The error "OPTION outside of SELECT" itself seems to mean that there is an option tag outside of a select tag (or that there is no option tag in the select tag).)

This time, there was no option tag within the select tag, so I added the option tag using Beautiful Soup and then switched back to Mechanize

# Analyze HTML with Beautiful Soup soup = BeautifulSoup(br.response().read(), 'lxml') # Extract the form tag f = soup.find('form') # Create a Beautiful Soup object for the HTML tag you want to add o = BeautifulSoup('<option value="hogehoge" selected> fugafuga</option> ', 'lxml') # Extract the option tag part option = o.option # Add an option tag to the select tag in the form f.find(id='target_select').append(option) # Create a Mechanize response object and register it with Mechanize response = mechanize.make_response(str(f), [("Content-Type", "text/html")], br.geturl(), 200, "OK") br.set_response(response) # Try specifying the form in Mechanize br.select_form(nr=0)

 

I struggled to find a Japanese website that said that mechanize.make_response() and set_response() could be used to return HTML parsed with Beautiful Soup to Mechanize, but I managed to solve the problem by referring to this website

Form Handling With Mechanize And Beautifulsoup · Todd Hayton

Although it is an English site, it also contains detailed information, mainly on how to use Mechanize, and is very educational

That's it.

If you want to consult a development professional

At Beyond, we combine our rich track record, technology and know-how in system development that we have cultivated to date with OSS technology and cloud technologies such as AWS to create contract development of web systems with reliable quality and excellent cost performance.

We also work on server-side/backend development and linkage development of our own APIs, using the technology and know-how of the construction and operation of web system/app infrastructure for large-scale, highly loaded games, apps, and digital content.

If you are having trouble with development projects, please visit the website below.

● Web system development
● Server-side development (API/DB)

If you found this article helpful , please give it a like!
0
Loading...
0 votes, average: 0.00 / 10
1,374
X facebook Hatena Bookmark pocket

The person who wrote this article

About the author

Yoichi Bandai

My main job is developing web APIs for social games, but I'm also fortunate to be able to do a lot of other work, including marketing.
Furthermore, my portrait rights in Beyond are treated as CC0 by him.