Introduction to Python Web Crawling
Category Programming Techniques
I. What is a Web Crawler
Web Crawler: A program that automatically retrieves information from the internet, capturing valuable data for us.
II. Python Web Crawler Architecture
The architecture of a Python web crawler mainly consists of five parts: the scheduler, URL manager, web page downloader, web page parser, and application (valuable data retrieved).
Scheduler: It is like the CPU of a computer, mainly responsible for coordinating the work between the URL manager, downloader, and parser.
URL Manager: Includes URLs to be crawled and URLs already crawled, to prevent duplicate and circular crawling. There are three main ways to implement a URL manager: through memory, databases, and caching databases.
Web Page Downloader: Downloads a web page by inputting a URL, converting the web page into a string. The web page downloader includes urllib2 (Python's official basic module) which includes login, proxy, and cookie requirements, and requests (a third-party package).
Web Page Parser: Parses a web page string according to our requirements to extract useful information, or parses based on the DOM tree. Web page parsers include regular expressions (intuitive, converting web pages into strings and extracting valuable information through fuzzy matching, which can be very difficult when documents are complex), html.parser (Python's built-in), BeautifulSoup (a third-party plugin, can use Python's built-in html.parser or lxml for parsing, relatively more powerful than the others), and lxml (a third-party plugin, can parse xml and HTML), html.parser, BeautifulSoup, and lxml all parse in the form of a DOM tree.
Application: An application composed of useful data extracted from web pages.
Below is a diagram to explain how the scheduler coordinates work:
III. Three Ways to Download Web Pages with urllib2
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import cookielib
import urllib2
url = "http://www.baidu.com"
response1 = urllib2.urlopen(url)
print "First method"
# Get the status code, 200 indicates success
print response1.getcode()
# Get the length of the web page content
print len(response1.read())
print "Second method"
request = urllib2.Request(url)
# Simulate a crawler with the Mozilla browser
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())
print "Third method"
cookie = cookielib.CookieJar()
# Add the ability to handle cookies with urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cookie
IV. Installation of the Third-Party Library Beautiful Soup
Beautiful Soup: A third-party plugin for Python used to extract data from xml and HTML, official website https://www.crummy.com/software/BeautifulSoup/
1. Install Beautiful Soup
Open cmd (Command Prompt), go to the scripts directory of the Python (Python 2.7 version) installation, type dir to check if there is pip.exe, if so, you can use Python's built-in pip command to install, type the following command to install:
pip install beautifulsoup4
2. Test if the installation was successful
Write a Python file and input:
import bs4
print bs4
Run the file, if it can output normally, then the installation was successful.
V. Parsing HTML Files with Beautiful Soup
```python
!/usr/bin/python
-- coding: UTF-8 --
import re
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
Create a BeautifulSoup parsing object
soup = BeautifulSoup(html_doc