Easy Tutorial
❮ Js Random Verilog2 Rtl Low Power Design 2 ❯

Introduction to Python Web Crawling

Category Programming Techniques

I. What is a Web Crawler

Web Crawler: A program that automatically retrieves information from the internet, capturing valuable data for us.

II. Python Web Crawler Architecture

The architecture of a Python web crawler mainly consists of five parts: the scheduler, URL manager, web page downloader, web page parser, and application (valuable data retrieved).

Below is a diagram to explain how the scheduler coordinates work:

III. Three Ways to Download Web Pages with urllib2

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import cookielib
import urllib2

url = "http://www.baidu.com"
response1 = urllib2.urlopen(url)
print "First method"
# Get the status code, 200 indicates success
print response1.getcode()
# Get the length of the web page content
print len(response1.read())

print "Second method"
request = urllib2.Request(url)
# Simulate a crawler with the Mozilla browser
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print "Third method"
cookie = cookielib.CookieJar()
# Add the ability to handle cookies with urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cookie

IV. Installation of the Third-Party Library Beautiful Soup

Beautiful Soup: A third-party plugin for Python used to extract data from xml and HTML, official website https://www.crummy.com/software/BeautifulSoup/

1. Install Beautiful Soup

Open cmd (Command Prompt), go to the scripts directory of the Python (Python 2.7 version) installation, type dir to check if there is pip.exe, if so, you can use Python's built-in pip command to install, type the following command to install:

pip install beautifulsoup4

2. Test if the installation was successful

Write a Python file and input:

import bs4
print bs4

Run the file, if it can output normally, then the installation was successful.

V. Parsing HTML Files with Beautiful Soup

```python

!/usr/bin/python

-- coding: UTF-8 --

import re

from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

Create a BeautifulSoup parsing object

soup = BeautifulSoup(html_doc

❮ Js Random Verilog2 Rtl Low Power Design 2 ❯