Easy Tutorial
❮ Python Random Number Python Func Filter ❯

Python urllib

The Python urllib library is used to manipulate web URLs and scrape content from web pages.

This article primarily introduces urllib for Python3.

The urllib package includes the following modules:


urllib.request

urllib.request defines functions and classes for opening URLs, including authentication, redirects, and browser cookies.

urllib.request can simulate the process of a browser making a request.

We can use the urlopen method of urllib.request to open a URL, with the following syntax:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Here is an example:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read())

The above code uses urlopen to open a URL and then uses the read() function to fetch the HTML entity code of the web page.

read() fetches the entire web page content, and we can specify the length to read:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read(300))

In addition to the read() function, there are two other functions for reading web content:

When scraping web pages, it is often necessary to check if the web page is accessible. We can use the getcode() function to get the status code of the web page. A return of 200 indicates the page is normal, while a return of 404 indicates the page does not exist:

Example

import urllib.request

myURL1 = urllib.request.urlopen("https://www.tutorialpro.org/")
print(myURL1.getcode())   # 200

try:
    myURL2 = urllib.request.urlopen("https://www.tutorialpro.org/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404)   # 404

For more web status codes, refer to: https://www.tutorialpro.org/http/http-status-codes.html.

To save the scraped web page locally, you can use the Python3 File write() method function:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
with open("local_copy.html", "wb") as file:
    file.write(myURL.read())
myURL = urlopen("https://www.tutorialpro.org/")
f = open("tutorialpro_urllib_test.html", "wb")
content = myURL.read()  # Read the webpage content
f.write(content)
f.close()

Executing the above code will generate a tutorialpro_urllib_test.html file locally, which contains the content of the https://www.tutorialpro.org/ webpage.

For more information on Python file handling, you can refer to: https://www.tutorialpro.org/python3/python-file-methods.html

URL encoding and decoding can be done using urllib.request.quote() and urllib.request.unquote() methods:

Example

import urllib.request

encode_url = urllib.request.quote("https://www.tutorialpro.org/")  # Encoding
print(encode_url)

unencode_url = urllib.request.unquote(encode_url)  # Decoding
print(unencode_url)

The output will be:

https%3A//www.tutorialpro.org/
https://www.tutorialpro.org/

Simulating Header Information

When scraping a webpage, it is usually necessary to simulate headers. This requires the use of the urllib.request.Request class:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Example - py3_urllib_test.py file code

import urllib.request
import urllib.parse

url = 'https://www.tutorialpro.org/?s='  # tutorialpro.org search page
keyword = 'Python 教程'
key_code = urllib.request.quote(keyword)  # Encoding the request
url_all = url + key_code
header = {
    'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}  # Header information
request = urllib.request.Request(url_all, headers=header)
response = urllib.request.urlopen(request).read()

fh = open("./urllib_test_tutorialpro_search.html", "wb")  # Writing the file to the current directory
fh.write(response)
fh.close()

Executing the above Python code will generate a urllib_test_tutorialpro_search.html file in the current directory. Opening the urllib_test_tutorialpro_search.html file (can be opened with a browser) will show the following content:

For form POST data transmission, we first create a form. The code is as follows, where I use PHP code to retrieve the form data:

Example - py3_urllib_test.php file code:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>tutorialpro.org(tutorialpro.org) urllib POST Test</title>
</head>
<body>
<form action="" method="post" name="myForm">
    Name: <input type="text" name="name"><br>
    Tag: <input type="text" name="tag"><br>
    <input type="submit" value="Submit">
</form>
<hr>
<?php
// Using PHP to fetch form submission data, you can replace it with others
if(isset($_POST['name']) && $_POST['tag']) {
    echo $_POST["name"] . ', ' . $_POST['tag'];
}
?>
</body>
</html>

Example

import urllib.request
import urllib.parse

url = 'https://www.tutorialpro.org/try/py3/py3_urllib_test.php'  # Submit to the form page
data = {'name':'tutorialpro', 'tag' : 'tutorialpro.org'}   # Data to be submitted
header = {
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}   # Header information
data = urllib.parse.urlencode(data).encode('utf8')  # Encode parameters, decode using urllib.parse.urldecode
request=urllib.request.Request(url, data, header)   # Request handling
response=urllib.request.urlopen(request).read()      # Read the result

fh = open("./urllib_test_post_tutorialpro.html","wb")    # Write the file to the current directory
fh.write(response)
fh.close()

Executing the above code will submit form data to the py3_urllib_test.php file and write the output to the urllib_test_post_tutorialpro.html file.

Open the urllib_test_post_tutorialpro.html file (can be opened with a browser), the display result is as follows: If the allow_fragments parameter is set to false, fragment identifiers cannot be recognized. Instead, they are parsed as part of the path, parameters, or query components, and the fragment is set to an empty string in the return value.

Example

from urllib.parse import urlparse

o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7%A8%8B")
print(o)

The output of the above example is:

ParseResult(scheme='https', netloc='www.tutorialpro.org', path='/', params='', query='s=python+%E6%95%99%E7A8%8B', fragment='')

From the result, it can be seen that the content is a tuple containing 6 strings: scheme, netloc, path, params, query, and fragment.

We can directly read the scheme content:

Example

from urllib.parse import urlparse

o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7A8%8B")
print(o.scheme)

The output of the above example is:

https

The complete content is as follows:

Attribute Index Value Value (if not present)
scheme 0 URL scheme scheme parameter
netloc 1 Network location part empty string
path 2 Hierarchical path empty string
params 3 Parameters for the last path element empty string
query 4 Query component empty string
fragment 5 Fragment identifier empty string
username Username None
password Password None
hostname Hostname (lowercase) None
port Port number as an integer (if present) None

urllib.robotparser

urllib.robotparser is used to parse robots.txt files.

robots.txt (all lowercase) is a file located in the root directory of a website that is typically used to inform search engines about the crawling rules for the site.

urllib.robotparser provides the RobotFileParser class, with the following syntax:

class urllib.robotparser.RobotFileParser(url='')

This class offers several methods to read and parse robots.txt files:

Example

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://www.musi-cal.com/robots.txt")
rp.read()
rrate = rp.request_rate("*")
rrate.requests
3
rrate.seconds
20
rp.crawl_delay("*")
6
rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
rp.can_fetch("*", "http://www.musi-cal.com/")
True
❮ Python Random Number Python Func Filter ❯