Mechanize won't work! In that case, Beautiful Soup is the answer!

table of contents
Hello.
I'm Bandai, the Wild team member in charge of development.
This time, we'll be using Mechanize, a commonly used Python crawling tool, and Beautiful Soup, a library for parsing HTML, to do some happy web scraping! Please note that
Beautiful Soup itself doesn't solve Mechanize's problems, so the title is a bit of an exaggeration.
What is Beautiful Soup?
Beautiful Soup is a Python library that provides HTML (X)ML parsing functionality.
While it only provides parsing capabilities, and some tasks can be handled by Mechanize alone,
Beautiful Soup's strength lies in its ability to parse HTML that Mechanize cannot.
effectively use Beautiful Soup and Mechanize to efficiently handle HTML that Mechanize cannot parse,
we'll explore how to
This time, we will use Python 2.7
Scraping Procedure
Preparing the necessary libraries
Use pip to get the required libraries
pip install mechanize beautifulsoup4 # Use sudo if necessary # Note that typing beautifulsoup will install Beautiful Soup 3.0. # In my environment, both 3 and 4 are installed
How to parse HTML source code retrieved by Mechanize with Beautiful Soup
It's very easy to take HTML from Mechanize and put it into Beautiful Soup for parsing
br.open('https://beyondjapan.com') soup = BeautifulSoup(br.response().read())
The above command will retrieve the HTML from https://beyondjapan.com, and the resulting object will be placed in soup after parsing it with Beautiful Soup
Switching Beautiful Soup's parser engine
Beautiful Soup allows you to switch between parser engines depending on the situation.
The official Beautiful Soup website recommends using lxml for speed reasons.
This time, we will use lxml for analysis, so we will install the lxml library using pip.
pip install lxml # use sudo if necessary
When calling Beautiful Soup in your code, you specify the parser engine, but
if you don't specify one, it seems to use Python's standard HTML parser.
Some kind of error will appear in standard error, so if this bothers you,
you might want to specify the parser engine even if you're using the standard parser.
# When executed without specifying a parser soup = BeautifulSoup(br.response().read())
It parses fine, but I get the following error:
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type))
To parse HTML using lxml, use the following:
soup = BeautifulSoup(br.response().read(), "lxml")
Return the parsed HTML with Beautiful Soup to Mechanize and continue crawling
where you can't parse HTML with Mechanize, but then use Beautiful Soup to parse it,
only to then navigate to an even later page based on the parsed result.
in the following areas, even if you don't fully understand why
it's common to encounter problems such as parsing errors occurring
br.select_form(nr=0) Traceback (most recent call last): File "/home/vagrant/workspace/mechanize_test/test3.py", line 13, in<module> br.select_form(nr=0) File "/usr/local/lib/python2.7/site-packages/mechanize/_mechanize.py", line 499, in select_form global_form = self._factory.global_form File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 544, in __getattr__ self.forms() File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 557, in forms self._forms_factory.forms()) File "/usr/local/lib/python2.7/site-packages/mechanize/_html.py", line 237, in forms _urlunparse=_rfc3986.urlunsplit, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 844, in ParseResponseEx _urlunparse=_urlunparse, File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 981, in _ParseFileEx fp.feed(data) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 758, in feed _sgmllib_copy.SGMLParser.feed(self, data) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 110, in feed self.goahead(0) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 144, in goahead k = self.parse_starttag(i) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 302, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 347, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_sgmllib_copy.py", line 387, in handle_starttag method(attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 735, in do_option _AbstractFormParser._start_option(self, attrs) File "/usr/local/lib/python2.7/site-packages/mechanize/_form.py", line 480, in _start_option raise ParseError("OPTION outside of SELECT") mechanize._form.ParseError: OPTION outside of SELECT
In these cases, parsing usually fails, forcing you to give up and look for other methods. However
, if you first have Beautiful Soup analyze it, extract and process only the necessary forms, and then return it to Mechanize, you can proceed smoothly.
(The "OPTION outside of SELECT" error itself seems to indicate that there's an option tag outside a select tag (or that there's no option tag inside a select tag).)
This time, there was no option tag within the select tag, so I added the option tag using Beautiful Soup and then switched back to Mechanize
# Analyze HTML with Beautiful Soup soup = BeautifulSoup(br.response().read(), 'lxml') # Extract the form tag f = soup.find('form') # Create a Beautiful Soup object for the HTML tag you want to add o = BeautifulSoup('<option value="hogehoge" selected> fugafuga</option> ', 'lxml') # Extract the option tag part option = o.option # Add an option tag to the select tag in the form f.find(id='target_select').append(option) # Create a Mechanize response object and register it with Mechanize response = mechanize.make_response(str(f), [("Content-Type", "text/html")], br.geturl(), 200, "OK") br.set_response(response) # Try specifying the form in Mechanize br.select_form(nr=0)
I struggled to find a Japanese website that said that mechanize.make_response() and set_response() could be used to return HTML parsed with Beautiful Soup to Mechanize, but I managed to solve the problem by referring to this website
Form Handling With Mechanize And Beautifulsoup · Todd Hayton
Although it is an English site, it also contains detailed information, mainly on how to use Mechanize, and is very educational
That's all
If you want to consult with a development professional
At Beyond, we combine our extensive track record, technology, and know-how in system development with OSS technology and cloud technologies such as AWS to provide contracted development of web systems with reliable quality and excellent cost performance
We also handle server-side/back-end development and proprietary API collaboration development, making full use of our technology and know-how in building and operating web system/application infrastructure for large-scale, high-load games, applications, and digital content
If you have any problems with your development project, please visit the following website
● Web system development
● Server-side development (API / DB)
1
