Scrapy Beginner's Tutorial
Category Programming Techniques
Scrapy is an application framework implemented in Python for crawling website data and extracting structured data.
Scrapy is often used in a series of programs including data mining, information processing, or storing historical data.
Usually, we can easily implement a crawler through the Scrapy framework to capture the content or images of a specified website.
Scrapy Architecture Diagram (Green lines indicate the direction of data flow)
-
Scrapy Engine (Engine): Responsible for communication between Spider, ItemPipeline, Downloader, Scheduler, signal transmission, and data transfer.
-
Scheduler (Scheduler): It is responsible for receiving the Request sent by the engine, organizing and arranging it in a certain way, queuing it, and returning it to the engine when needed.
-
Downloader (Downloader): Responsible for downloading all Requests sent by the Scrapy Engine (Engine), and returning the obtained Responses to the Scrapy Engine (Engine), which is then handed over to the Spider for processing.
-
Spider (Spider): It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required for the Item fields, and submitting the URLs that need to be followed to the engine, and re-entering the Scheduler (Scheduler).
-
Item Pipeline (Pipeline): It is responsible for processing the Items obtained in the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
-
Downloader Middlewares (Download Middleware): You can think of it as a component that can be customized to extend the download function.
-
Spider Middlewares (Spider Middleware): You can understand it as a component that can be customized to extend and operate the communication between the engine and the Spider (for example, Responses entering the Spider; and Requests going out from the Spider).
Scrapy's Operation Process
The code is written, and the program starts running...
1 Engine: Hi! Spider, which website do you want to process?
2 Spider: The boss wants me to process xxxx.com.
3 Engine: Give me the first URL you need to process.
4 Spider: Here you are, the first URL is xxxxxx.com.
5 Engine: Hi! Scheduler, I have a request for you to help me sort and queue.
6 Scheduler: Okay, I'm processing, wait a moment.
7 Engine: Hi! Scheduler, give me the request you have processed.
8 Scheduler: Here you are, this is the request I have processed.
9 Engine: Hi! Downloader, please download this request according to the boss's download middleware settings.
10 Downloader: Okay! Here you are, this is the downloaded content. (If failed: Sorry, this request failed to download. Then the engine tells the scheduler, this request failed to download, please record it, and we will download it again later)
11 Engine: Hi! Spider, this is the downloaded content, and it has been processed according to the boss's download middleware, please process it yourself (Note! Here the responses are handed over to the def parse() function by default)
12 Spider: (After processing the data, for URLs that need to be followed) Hi! Engine, I have two results here, this is the URL I need to follow, and this is the Item data I obtained.
13 Engine: Hi! Pipeline, I have an item here that you help me process! Scheduler! This is the URL that needs to be followed, please help me process it. Then start the cycle from step four until all the information the boss needs is obtained.
14 Pipeline Scheduler: Okay, I'll do it now!
Note! The entire program will only stop when there are no requests left in the scheduler (that is to say, for URLs that failed to download, Scrapy will also download them again).
Making a Scrapy Spider requires 4 steps:
Create a new project (scrapy startproject xxx): Create a new spider project
Define the target (write items.py): Define the target you want to capture
Make a spider (spiders/xxspider.py): Make a spider to start crawling the web page
Store the content (pipelines.py): Design the pipeline to store the crawled content
Installation
Windows Installation Method
Upgrade the pip version:
pip install --upgrade pip
Install the Scrapy framework through pip:
pip install Scrapy
Ubuntu Installation Method
Install non-Python dependencies:
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl
<h4> xxxxx </h4>
<p> xxxxxxxx </p>
Isn't it clear at a glance? Let's start extracting data directly with XPath.
The xpath method, we just need to input the xpath rule to locate the corresponding HTML tag node. For detailed content, you can refer to the xpath tutorial.
It doesn't matter if you don't know the xpath syntax. Chrome provides us with a one-click method to get the xpath address (Right click -> Inspect -> Copy -> Copy XPath), as shown in the figure below:
Here are some examples of XPath expressions and their corresponding meanings:
: Select the
element within the ` tag in the HTML document: Select the text of the
element mentioned above: Select all
elements: Select all
elements with the ` attribute
For example, let's read the website title of http://www.itcast.cn/, and modify the itcast.py file code as follows:
# -*- coding: utf-8 -*-
import scrapy
# The following three lines are to solve the garbled problem in Python2.x version, which can be removed in Python3.x version
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class Opp2Spider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.com']
start_urls = ['http://www.itcast.cn/']
def parse(self, response):
# Get the website title
context = response.xpath('/html/head/title/text()')
# Extract the website title
title = context.extract_first()
print(title)
pass
Execute the following command:
$ scrapy crawl itcast
...
...
ITcast official website - Good reputation IT training institution, the same education, different quality
...
...
We previously defined an ItcastItem class in mySpider/items.py. Introduce it here:
from mySpider.items import ItcastItem
Then encapsulate the data we obtained into a ItcastItem object, which can save the attributes of each teacher:
from mySpider.items import ItcastItem
def parse(self, response):
# open("teacher.html","wb").write(response.body).close()
# A collection of teacher information
items = []
for each in response.xpath("//div[@class='li_txt']"):
# Encapsulate the data we obtained into a `ItcastItem` object
item = ItcastItem()
# The extract() method returns a unicode string
name = each.xpath("h3/text()").extract()
title = each.xpath("h4/text()").extract()
info = each.xpath("p/text()").extract()
# XPath returns a list containing one element
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
items.append(item)
# Directly return the final data
return items
We will not process the pipeline for the time being, which will be introduced in detail later.
Save Data
There are mainly four simple methods to save information with scrapy, -o output in a specified format, the command is as follows:
scrapy crawl itcast -o teachers.json
json lines format, default is Unicode encoding
scrapy crawl itcast -o teachers.jsonl
csv comma expression, can be opened with Excel
scrapy crawl itcast -o teachers.csv
xml format
scrapy crawl itcast -o teachers.xml
Thinking
If the code is changed to the following form, the result is exactly the same.
Please think about the role of yield here (A Brief Analysis of Python Yield Usage):
# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import ItcastItem
# The following three lines are to solve the garbled problem in Python2.x version, which can be removed in Python3.x version
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class Opp2Spider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.com']
start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)
def parse(self, response):
# open("teacher.html","wb").write(response.body).close()
# A collection of teacher information
items = []
for each in response.xpath("//div[@class='li_txt']"):
# Encapsulate the data we obtained into a `ItcastItem` object
item = ItcastItem()
# The extract
I found a solution on Stack Overflow, followed the steps, and successfully resolved the error after two operations. However, I suspect that the first step might be redundant and you can just execute the second step directly:
- Download the corresponding `.whl` file from [http://www.lfd.uci.edu/~gohlke/pythonlibs/#pywin32](http://www.lfd.uci.edu/~gohlke/pythonlibs/#pywin32) and install it. If you only perform this step, you will encounter a similar new error: "ImportError: DLL load failed: The specified module could not be found."
- Download the appropriate version of pywin32 from [http://sourceforge.net/projects/pywin32/files/pywin32/](http://sourceforge.net/projects/pywin32/files/pywin32/). For example, I downloaded "pywin32-221.win-amd64-py3.6.exe". (A small side note, the installation of "pywin32-220.win-amd64-py3.6.exe" failed)
** Sunny Day
** 182***[email protected]
** [Reference Address](https://blog.csdn.net/qq_38019321/article/details/77374171)
-
** tOmMy
** era***q.com
Regarding the issue of saving captured files with Python3:
The code is as follows:
filename = "teacher.html"
with open(filename, 'w', encoding='utf-8') as f:
f.write(response.body.decode())
```
Here are a few points to note:
The
open
function must includeencoding='utf-8'
, otherwise an error will occur when writing withf.write
.The
response.body
returns bytes, which need to be decoded into a string.
** tOmMy
* era**q.com
** Click here to share my notes
-
-
-