Python urllib
The Python urllib library is used to manipulate web URLs and scrape content from web pages.
This article primarily introduces urllib for Python3.
The urllib package includes the following modules:
urllib.request
- Opens and reads URLs.urllib.error
- Contains exceptions thrown by urllib.request.urllib.parse
- Parses URLs.urllib.robotparser
- Parses robots.txt files.
urllib.request
urllib.request defines functions and classes for opening URLs, including authentication, redirects, and browser cookies.
urllib.request can simulate the process of a browser making a request.
We can use the urlopen method of urllib.request to open a URL, with the following syntax:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
- url: URL address.
- data: Additional data object to be sent to the server, default is None.
- timeout: Sets the access timeout duration.
- cafile and capath: cafile is the CA certificate, capath is the path to the CA certificate, used for HTTPS.
- cadefault: Deprecated.
- context: ssl.SSLContext type, used to specify SSL settings.
Here is an example:
Example
from urllib.request import urlopen
myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read())
The above code uses urlopen to open a URL and then uses the read() function to fetch the HTML entity code of the web page.
read() fetches the entire web page content, and we can specify the length to read:
Example
from urllib.request import urlopen
myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read(300))
In addition to the read() function, there are two other functions for reading web content:
readline(): Reads a single line of the file.
from urllib.request import urlopen myURL = urlopen("https://www.tutorialpro.org/") print(myURL.readline()) # Reads one line of content
readlines(): Reads all the content of the file and assigns it to a list variable.
from urllib.request import urlopen myURL = urlopen("https://www.tutorialpro.org/") lines = myURL.readlines() for line in lines: print(line)
When scraping web pages, it is often necessary to check if the web page is accessible. We can use the getcode() function to get the status code of the web page. A return of 200 indicates the page is normal, while a return of 404 indicates the page does not exist:
Example
import urllib.request
myURL1 = urllib.request.urlopen("https://www.tutorialpro.org/")
print(myURL1.getcode()) # 200
try:
myURL2 = urllib.request.urlopen("https://www.tutorialpro.org/no.html")
except urllib.error.HTTPError as e:
if e.code == 404:
print(404) # 404
For more web status codes, refer to: https://www.tutorialpro.org/http/http-status-codes.html.
To save the scraped web page locally, you can use the Python3 File write() method function:
Example
from urllib.request import urlopen
myURL = urlopen("https://www.tutorialpro.org/")
with open("local_copy.html", "wb") as file:
file.write(myURL.read())
myURL = urlopen("https://www.tutorialpro.org/")
f = open("tutorialpro_urllib_test.html", "wb")
content = myURL.read() # Read the webpage content
f.write(content)
f.close()
Executing the above code will generate a tutorialpro_urllib_test.html
file locally, which contains the content of the https://www.tutorialpro.org/
webpage.
For more information on Python file handling, you can refer to: https://www.tutorialpro.org/python3/python-file-methods.html
URL encoding and decoding can be done using urllib.request.quote() and urllib.request.unquote() methods:
Example
import urllib.request
encode_url = urllib.request.quote("https://www.tutorialpro.org/") # Encoding
print(encode_url)
unencode_url = urllib.request.unquote(encode_url) # Decoding
print(unencode_url)
The output will be:
https%3A//www.tutorialpro.org/
https://www.tutorialpro.org/
Simulating Header Information
When scraping a webpage, it is usually necessary to simulate headers. This requires the use of the urllib.request.Request
class:
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
- url: The URL address.
- data: Additional data object to be sent to the server, default is None.
- headers: HTTP request headers in dictionary format.
- originreqhost: The host address of the request, IP or domain name.
- unverifiable: Rarely used parameter to set whether the webpage requires verification, default is False.
- method: The request method, such as GET, POST, DELETE, PUT, etc.
Example - py3_urllib_test.py
file code
import urllib.request
import urllib.parse
url = 'https://www.tutorialpro.org/?s=' # tutorialpro.org search page
keyword = 'Python 教程'
key_code = urllib.request.quote(keyword) # Encoding the request
url_all = url + key_code
header = {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
} # Header information
request = urllib.request.Request(url_all, headers=header)
response = urllib.request.urlopen(request).read()
fh = open("./urllib_test_tutorialpro_search.html", "wb") # Writing the file to the current directory
fh.write(response)
fh.close()
Executing the above Python code will generate a urllib_test_tutorialpro_search.html
file in the current directory. Opening the urllib_test_tutorialpro_search.html
file (can be opened with a browser) will show the following content:
For form POST data transmission, we first create a form. The code is as follows, where I use PHP code to retrieve the form data:
Example - py3_urllib_test.php
file code:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>tutorialpro.org(tutorialpro.org) urllib POST Test</title>
</head>
<body>
<form action="" method="post" name="myForm">
Name: <input type="text" name="name"><br>
Tag: <input type="text" name="tag"><br>
<input type="submit" value="Submit">
</form>
<hr>
<?php
// Using PHP to fetch form submission data, you can replace it with others
if(isset($_POST['name']) && $_POST['tag']) {
echo $_POST["name"] . ', ' . $_POST['tag'];
}
?>
</body>
</html>
Example
import urllib.request
import urllib.parse
url = 'https://www.tutorialpro.org/try/py3/py3_urllib_test.php' # Submit to the form page
data = {'name':'tutorialpro', 'tag' : 'tutorialpro.org'} # Data to be submitted
header = {
'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
} # Header information
data = urllib.parse.urlencode(data).encode('utf8') # Encode parameters, decode using urllib.parse.urldecode
request=urllib.request.Request(url, data, header) # Request handling
response=urllib.request.urlopen(request).read() # Read the result
fh = open("./urllib_test_post_tutorialpro.html","wb") # Write the file to the current directory
fh.write(response)
fh.close()
Executing the above code will submit form data to the py3_urllib_test.php file and write the output to the urllib_test_post_tutorialpro.html file.
Open the urllib_test_post_tutorialpro.html file (can be opened with a browser), the display result is as follows:
If the allow_fragments
parameter is set to false
, fragment identifiers cannot be recognized. Instead, they are parsed as part of the path, parameters, or query components, and the fragment
is set to an empty string in the return value.
Example
from urllib.parse import urlparse
o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7%A8%8B")
print(o)
The output of the above example is:
ParseResult(scheme='https', netloc='www.tutorialpro.org', path='/', params='', query='s=python+%E6%95%99%E7A8%8B', fragment='')
From the result, it can be seen that the content is a tuple containing 6 strings: scheme, netloc, path, params, query, and fragment.
We can directly read the scheme content:
Example
from urllib.parse import urlparse
o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7A8%8B")
print(o.scheme)
The output of the above example is:
https
The complete content is as follows:
Attribute | Index | Value | Value (if not present) |
---|---|---|---|
scheme | 0 | URL scheme | scheme parameter |
netloc | 1 | Network location part | empty string |
path | 2 | Hierarchical path | empty string |
params | 3 | Parameters for the last path element | empty string |
query | 4 | Query component | empty string |
fragment | 5 | Fragment identifier | empty string |
username | Username | None | |
password | Password | None | |
hostname | Hostname (lowercase) | None | |
port | Port number as an integer (if present) | None |
urllib.robotparser
urllib.robotparser
is used to parse robots.txt
files.
robots.txt
(all lowercase) is a file located in the root directory of a website that is typically used to inform search engines about the crawling rules for the site.
urllib.robotparser
provides the RobotFileParser
class, with the following syntax:
class urllib.robotparser.RobotFileParser(url='')
This class offers several methods to read and parse robots.txt
files:
set_url(url)
- Sets the URL of therobots.txt
file.read()
- Reads therobots.txt
URL and feeds it into the parser.parse(lines)
- Parses the lines parameter.can_fetch(useragent, url)
- ReturnsTrue
if theuseragent
is allowed to fetch theurl
according to the rules parsed from therobots.txt
file.mtime()
- Returns the last time therobots.txt
file was fetched. This is useful for long-running web spiders that need to check for updates periodically.modified()
- Sets the last fetched time of therobots.txt
file to the current time.crawl_delay(useragent)
- Returns theCrawl-delay
parameter fromrobots.txt
for the specifieduseragent
. ReturnsNone
if this parameter does not exist or is not applicable to the specifieduseragent
, or if therobots.txt
entry has a syntax error.request_rate(useragent)
- Returns theRequest-rate
parameter fromrobots.txt
as a named tupleRequestRate(requests, seconds)
. ReturnsNone
if this parameter does not exist or is not applicable to the specifieduseragent
, or if therobots.txt
entry has a syntax error.site_maps()
- Returns theSitemap
parameter fromrobots.txt
as alist()
. ReturnsNone
if this parameter does not exist or if therobots.txt
entry has a syntax error.
Example
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://www.musi-cal.com/robots.txt")
rp.read()
rrate = rp.request_rate("*")
rrate.requests
3
rrate.seconds
20
rp.crawl_delay("*")
6
rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
rp.can_fetch("*", "http://www.musi-cal.com/")
True