❮ Js Random Verilog2 Rtl Low Power Design 2 ❯

Introduction to Python Web Crawling

Category Programming Techniques

I. What is a Web Crawler

Web Crawler: A program that automatically retrieves information from the internet, capturing valuable data for us.

II. Python Web Crawler Architecture

The architecture of a Python web crawler mainly consists of five parts: the scheduler, URL manager, web page downloader, web page parser, and application (valuable data retrieved).

Scheduler: It is like the CPU of a computer, mainly responsible for coordinating the work between the URL manager, downloader, and parser.
URL Manager: Includes URLs to be crawled and URLs already crawled, to prevent duplicate and circular crawling. There are three main ways to implement a URL manager: through memory, databases, and caching databases.
Web Page Downloader: Downloads a web page by inputting a URL, converting the web page into a string. The web page downloader includes urllib2 (Python's official basic module) which includes login, proxy, and cookie requirements, and requests (a third-party package).
Web Page Parser: Parses a web page string according to our requirements to extract useful information, or parses based on the DOM tree. Web page parsers include regular expressions (intuitive, converting web pages into strings and extracting valuable information through fuzzy matching, which can be very difficult when documents are complex), html.parser (Python's built-in), BeautifulSoup (a third-party plugin, can use Python's built-in html.parser or lxml for parsing, relatively more powerful than the others), and lxml (a third-party plugin, can parse xml and HTML), html.parser, BeautifulSoup, and lxml all parse in the form of a DOM tree.
Application: An application composed of useful data extracted from web pages.

Below is a diagram to explain how the scheduler coordinates work:

III. Three Ways to Download Web Pages with urllib2

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import cookielib
import urllib2

url = "http://www.baidu.com"
response1 = urllib2.urlopen(url)
print "First method"
# Get the status code, 200 indicates success
print response1.getcode()
# Get the length of the web page content
print len(response1.read())

print "Second method"
request = urllib2.Request(url)
# Simulate a crawler with the Mozilla browser
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print "Third method"
cookie = cookielib.CookieJar()
# Add the ability to handle cookies with urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cookie

IV. Installation of the Third-Party Library Beautiful Soup

Beautiful Soup: A third-party plugin for Python used to extract data from xml and HTML, official website https://www.crummy.com/software/BeautifulSoup/

1. Install Beautiful Soup

Open cmd (Command Prompt), go to the scripts directory of the Python (Python 2.7 version) installation, type dir to check if there is pip.exe, if so, you can use Python's built-in pip command to install, type the following command to install:

pip install beautifulsoup4

2. Test if the installation was successful

Write a Python file and input:

import bs4
print bs4

Run the file, if it can output normally, then the installation was successful.

V. Parsing HTML Files with Beautiful Soup

```python

!/usr/bin/python

-- coding: UTF-8 --

import re

from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """

Create a BeautifulSoup parsing object

soup = BeautifulSoup(html_doc

❮ Js Random Verilog2 Rtl Low Power Design 2 ❯