❮ Python Random Number Python Func Filter ❯

Python urllib

The Python urllib library is used to manipulate web URLs and scrape content from web pages.

This article primarily introduces urllib for Python3.

The urllib package includes the following modules:

urllib.request - Opens and reads URLs.
urllib.error - Contains exceptions thrown by urllib.request.
urllib.parse - Parses URLs.
urllib.robotparser - Parses robots.txt files.

urllib.request

urllib.request defines functions and classes for opening URLs, including authentication, redirects, and browser cookies.

urllib.request can simulate the process of a browser making a request.

We can use the urlopen method of urllib.request to open a URL, with the following syntax:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url: URL address.
data: Additional data object to be sent to the server, default is None.
timeout: Sets the access timeout duration.
cafile and capath: cafile is the CA certificate, capath is the path to the CA certificate, used for HTTPS.
cadefault: Deprecated.
context: ssl.SSLContext type, used to specify SSL settings.

Here is an example:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read())

The above code uses urlopen to open a URL and then uses the read() function to fetch the HTML entity code of the web page.

read() fetches the entire web page content, and we can specify the length to read:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.read(300))

In addition to the read() function, there are two other functions for reading web content:

readline(): Reads a single line of the file.

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
print(myURL.readline()) # Reads one line of content

readlines(): Reads all the content of the file and assigns it to a list variable.

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
lines = myURL.readlines()
for line in lines:
    print(line)

When scraping web pages, it is often necessary to check if the web page is accessible. We can use the getcode() function to get the status code of the web page. A return of 200 indicates the page is normal, while a return of 404 indicates the page does not exist:

Example

import urllib.request

myURL1 = urllib.request.urlopen("https://www.tutorialpro.org/")
print(myURL1.getcode())   # 200

try:
    myURL2 = urllib.request.urlopen("https://www.tutorialpro.org/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404)   # 404

For more web status codes, refer to: https://www.tutorialpro.org/http/http-status-codes.html.

To save the scraped web page locally, you can use the Python3 File write() method function:

Example

from urllib.request import urlopen

myURL = urlopen("https://www.tutorialpro.org/")
with open("local_copy.html", "wb") as file:
    file.write(myURL.read())

myURL = urlopen("https://www.tutorialpro.org/")
f = open("tutorialpro_urllib_test.html", "wb")
content = myURL.read()  # Read the webpage content
f.write(content)
f.close()

Executing the above code will generate a tutorialpro_urllib_test.html file locally, which contains the content of the https://www.tutorialpro.org/ webpage.

For more information on Python file handling, you can refer to: https://www.tutorialpro.org/python3/python-file-methods.html

URL encoding and decoding can be done using urllib.request.quote() and urllib.request.unquote() methods:

Example

import urllib.request

encode_url = urllib.request.quote("https://www.tutorialpro.org/")  # Encoding
print(encode_url)

unencode_url = urllib.request.unquote(encode_url)  # Decoding
print(unencode_url)

The output will be:

https%3A//www.tutorialpro.org/
https://www.tutorialpro.org/

Simulating Header Information

When scraping a webpage, it is usually necessary to simulate headers. This requires the use of the urllib.request.Request class:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url: The URL address.
data: Additional data object to be sent to the server, default is None.
headers: HTTP request headers in dictionary format.
originreqhost: The host address of the request, IP or domain name.
unverifiable: Rarely used parameter to set whether the webpage requires verification, default is False.
method: The request method, such as GET, POST, DELETE, PUT, etc.

Example - `py3_urllib_test.py` file code

import urllib.request
import urllib.parse

url = 'https://www.tutorialpro.org/?s='  # tutorialpro.org search page
keyword = 'Python 教程'
key_code = urllib.request.quote(keyword)  # Encoding the request
url_all = url + key_code
header = {
    'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}  # Header information
request = urllib.request.Request(url_all, headers=header)
response = urllib.request.urlopen(request).read()

fh = open("./urllib_test_tutorialpro_search.html", "wb")  # Writing the file to the current directory
fh.write(response)
fh.close()

Executing the above Python code will generate a urllib_test_tutorialpro_search.html file in the current directory. Opening the urllib_test_tutorialpro_search.html file (can be opened with a browser) will show the following content:

For form POST data transmission, we first create a form. The code is as follows, where I use PHP code to retrieve the form data:

Example - `py3_urllib_test.php` file code:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">

<title>tutorialpro.org(tutorialpro.org) urllib POST Test</title>
</head>
<body>
<form action="" method="post" name="myForm">
    Name: <input type="text" name="name"><br>
    Tag: <input type="text" name="tag"><br>
    <input type="submit" value="Submit">
</form>
<hr>
<?php
// Using PHP to fetch form submission data, you can replace it with others
if(isset($_POST['name']) && $_POST['tag']) {
    echo $_POST["name"] . ', ' . $_POST['tag'];
}
?>
</body>
</html>

Example

import urllib.request
import urllib.parse

url = 'https://www.tutorialpro.org/try/py3/py3_urllib_test.php'  # Submit to the form page
data = {'name':'tutorialpro', 'tag' : 'tutorialpro.org'}   # Data to be submitted
header = {
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}   # Header information
data = urllib.parse.urlencode(data).encode('utf8')  # Encode parameters, decode using urllib.parse.urldecode
request=urllib.request.Request(url, data, header)   # Request handling
response=urllib.request.urlopen(request).read()      # Read the result

fh = open("./urllib_test_post_tutorialpro.html","wb")    # Write the file to the current directory
fh.write(response)
fh.close()

Executing the above code will submit form data to the py3_urllib_test.php file and write the output to the urllib_test_post_tutorialpro.html file.

Open the urllib_test_post_tutorialpro.html file (can be opened with a browser), the display result is as follows: If the allow_fragments parameter is set to false, fragment identifiers cannot be recognized. Instead, they are parsed as part of the path, parameters, or query components, and the fragment is set to an empty string in the return value.

Example

from urllib.parse import urlparse

o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7%A8%8B")
print(o)

The output of the above example is:

ParseResult(scheme='https', netloc='www.tutorialpro.org', path='/', params='', query='s=python+%E6%95%99%E7A8%8B', fragment='')

From the result, it can be seen that the content is a tuple containing 6 strings: scheme, netloc, path, params, query, and fragment.

We can directly read the scheme content:

Example

from urllib.parse import urlparse

o = urlparse("https://www.tutorialpro.org/?s=python+%E6%95%99%E7A8%8B")
print(o.scheme)

The output of the above example is:

https

The complete content is as follows:

Attribute	Index	Value	Value (if not present)
scheme	0	URL scheme	scheme parameter
netloc	1	Network location part	empty string
path	2	Hierarchical path	empty string
params	3	Parameters for the last path element	empty string
query	4	Query component	empty string
fragment	5	Fragment identifier	empty string
username		Username	None
password		Password	None
hostname		Hostname (lowercase)	None
port		Port number as an integer (if present)	None

urllib.robotparser

urllib.robotparser is used to parse robots.txt files.

robots.txt (all lowercase) is a file located in the root directory of a website that is typically used to inform search engines about the crawling rules for the site.

urllib.robotparser provides the RobotFileParser class, with the following syntax:

class urllib.robotparser.RobotFileParser(url='')

This class offers several methods to read and parse robots.txt files:

set_url(url) - Sets the URL of the robots.txt file.
read() - Reads the robots.txt URL and feeds it into the parser.
parse(lines) - Parses the lines parameter.
can_fetch(useragent, url) - Returns True if the useragent is allowed to fetch the url according to the rules parsed from the robots.txt file.
mtime() - Returns the last time the robots.txt file was fetched. This is useful for long-running web spiders that need to check for updates periodically.
modified() - Sets the last fetched time of the robots.txt file to the current time.
crawl_delay(useragent) - Returns the Crawl-delay parameter from robots.txt for the specified useragent. Returns None if this parameter does not exist or is not applicable to the specified useragent, or if the robots.txt entry has a syntax error.
request_rate(useragent) - Returns the Request-rate parameter from robots.txt as a named tuple RequestRate(requests, seconds). Returns None if this parameter does not exist or is not applicable to the specified useragent, or if the robots.txt entry has a syntax error.
site_maps() - Returns the Sitemap parameter from robots.txt as a list(). Returns None if this parameter does not exist or if the robots.txt entry has a syntax error.

Example

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()

rp.set_url("http://www.musi-cal.com/robots.txt")
rp.read()
rrate = rp.request_rate("*")
rrate.requests
3
rrate.seconds
20
rp.crawl_delay("*")
6
rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
rp.can_fetch("*", "http://www.musi-cal.com/")
True

❮ Python Random Number Python Func Filter ❯

Python urllib

urllib.request

Example

Example

Example

Example

Example

Simulating Header Information

Example - py3_urllib_test.py file code

Example - py3_urllib_test.php file code:

Example

Example

Example

urllib.robotparser

Example

Example - `py3_urllib_test.py` file code

Example - `py3_urllib_test.php` file code: