curldl.curldl module#

Interface for PycURL functionality

class curldl.curldl.Curldl(basedir: str | PathLike[str], *, progress: bool = False, verbose: bool = False, user_agent: str = 'curl', retry_attempts: int = 3, retry_wait_sec: int | float = 2, timeout_sec: int | float = 120, max_redirects: int = 5, allowed_protocols_bitmask: int = 0, min_part_bytes: int = 65536, always_keep_part_bytes: int = 67108864, curl_config_callback: Callable[[Curl], None] | None = None)[source]#

Bases: object

Interface for downloading functionality of PycURL. Basic usage example:

import curldl, os
dl = curldl.Curldl(basedir='downloads', progress=True)
dl.get('https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz',
       'linux-0.01.tar.gz', size=73091,
       digests={'sha1': '566b6fb6365e25f47b972efa1506932b87d3ca7d'})
assert os.path.exists('downloads/linux-0.01.tar.gz')

For a more in-depth guide, refer to package documentation.

Initialize a PycURL-based downloader with a single pycurl.Curl instance that is reused and reconfigured for each download. The resulting downloader object should be therefore not shared among several threads.

Parameters:

basedir (str | os.PathLike[str]) – base directory path for downloaded file
progress (bool) – show progress bar on sys.stderr
verbose (bool) – enable verbose logging information from libcurl at DEBUG level
user_agent (str) – User-Agent header for HTTP(S) protocols
retry_attempts (int) – number of download retry attempts in case of failure in DOWNLOAD_RETRY_ERRORS
retry_wait_sec (int | float) – seconds to wait between download retry attempts
timeout_sec (int | float) – timeout seconds for libcurl operation
max_redirects (int) – maximum number of redirects allowed in HTTP(S) protocols
allowed_protocols_bitmask (int) – bitmask of allowed protocols, e.g. pycurl.PROTO_HTTP; default is or of values in DEFAULT_ALLOWED_PROTOCOLS
min_part_bytes (int) – partial downloads below this size are removed after unsuccessful download attempt; set to 0 to disable removal of unsuccessful partial downloads
always_keep_part_bytes (int) – do not remove partial downloads of this size or larger when resuming download even if no size or digest is provided for verification; set to 0 to never remove existing partial downloads
curl_config_callback (Callable[[pycurl.Curl], None] | None) – pass a callback to further configure a pycurl.Curl object

DOWNLOAD_RETRY_ERRORS = {5, 6, 7, 10, 12, 15, 16, 18, 22, 28, 30, 35, 47, 52, 55, 56, 79}#: libcurl errors accepted by download retry policy

DEFAULT_ALLOWED_PROTOCOLS = {1, 2, 4, 8, 32}#: URL schemes allowed by default, can be changed with allowed_protocols_bitmask constructor parameter

RESUME_FROM_SCHEMES = {'file', 'ftp', 'ftps', 'http', 'https'}#: URL schemes supported by pycurl.RESUME_FROM. SFTP is not included because its implementation is buggy (total download size is reduced twice by initial size). Scheme is extracted via urllib from initial URL, but there are no security implications since it is only used for removing partial downloads.

VERBOSE_LOGGING = {0: 'TEXT', 1: 'IHDR', 2: 'OHDR'}#: Info types logged by DEBUGFUNCTION() callback during verbose logging

_get_configured_curl(url: str, path: str, *, timestamp: int | float | None = None) → tuple[Curl, int][source]#

Reconfigure pycurl.Curl instance for requested download and return the instance. Methods should not work with self._unconfigured_curl directly, only with instance returned by this method.

Parameters:

url (str) – URL to download
path (str) – resolved download file path
timestamp (int | float | None) – last-modified timestamp of an already downloaded path, if it exists; used for skipping not-modified-since downloads with HTTP(S), FTP(S), FILE and RTSP protocols

Returns:

pycurl.Curl instance configured for requested download and initial download offset (i.e., file size to resume)

Return type:

tuple[Curl, int]

_perform_curl_download(curl: pycurl.Curl, write_stream: BinaryIO, progress_bar: tqdm[NoReturn]) → None[source]#

Complete pycurl.Curl configuration and start downloading.

Parameters:

curl (pycurl.Curl) – configured pycurl.Curl instance
write_stream (BinaryIO) – output stream to write to (a file opened in binary write mode)
progress_bar (tqdm[NoReturn]) – progress bar to use; XFERINFOFUNCTION() is configured if enabled

static _get_curl_progress_callback(progress_bar: tqdm[NoReturn]) → Callable[[int, int, int, int], None][source]#

Constructs a progress bar-updating callback for XFERINFOFUNCTION().

Parameters:: progress_bar (tqdm[NoReturn]) – progress bar to use, must be enabled
Returns:: XFERINFOFUNCTION() callback
Return type:: Callable[[int, int, int, int], None]

classmethod _curl_debug_cb(debug_type: int, debug_msg: bytes) → None[source]#

Callback for DEBUGFUNCTION() that logs libcurl messages at DEBUG level.

Parameters:

debug_type (int) – pycurl.Curl-supplied info type, e.g. pycurl.INFOTYPE_HEADER_IN
debug_msg (bytes) – pycurl.Curl-supplied debug message

get(url: str, rel_path: str, *, size: int | None = None, digests: dict[str, str] | None = None) → None[source]#

Download a URL to basedir-relative path and verify its expected size and digests. Resume a partial download with .part extension if exists and supported by protocol, and retry failures according to retry policy. The downloaded file is removed in case of size or digest mismatch, and ValueError is raised.

Parameters:

url (str) – URL to download
rel_path (str) – basedir-relative output file path
size (int | None) – expected file size in bytes, or None to ignore
digests (dict[str, str] | None) – mapping of digest algorithms to expected hexadecimal digest strings, or None to ignore (see curldl.util.fs.FileSystem.verify_size_and_digests())

Raises:

ValueError – relative path escapes base directory or is otherwise unsafe (see curldl.util.fs.FileSystem.verify_rel_path_is_safe()), or file size mismatch, or one of digests fails verification
pycurl.error – PycURL error when downloading after retries are exhausted

_download_partial(url: str, path: str, *, timestamp: int | float | None = None, description: str | None = None) → None[source]#

Start or resume a partial download of a URL to resolved path. If timestamp of an already downloaded file is provided, remove the partial file if the URL content is not more recent than the timestamp. This method should be invoked with a retry policy.

Parameters:

url (str) – URL to download
path (str) – resolved path of a partial download file
timestamp (int | float | None) – last-modified timestamp of an already downloaded path, if it exists
description (str | None) – description string for progress bar (e.g., base name of downloaded file)

Raises:

pycurl.error – PycURL error when downloading, may result in a retry according to policy

_prepare_full_path(rel_path: str) → str[source]#

Verify that basedir-relative path is safe and create the required directories.

Parameters:: rel_path (str) – basedir-relative path
Returns:: resulting complete path
Raises:: ValueError – relative path escapes base directory or is otherwise unsafe (see curldl.util.fs.FileSystem.verify_rel_path_is_safe())
Return type:: str

classmethod _get_response_status(curl: Curl, url: str, error: error | None) → str[source]#

Format response code and description from cURL with a possible error.

Parameters:

curl (Curl) – pycurl.Curl instance to extract response code from
url (str) – a URL to extract scheme protocol from if pycurl.EFFECTIVE_URL is unavailable
error (error | None) – PycURL exception instance

Returns:

formatted string that includes a response code and its meaning, if available

Return type:

str

static _get_url_scheme(url: str) → str[source]#

Return URL scheme (lowercase).

Parameters:: url (str) – a URL to extract URL scheme part from
Returns:: lowercase protocol scheme, e.g. http
Return type:: str

_discard_file(path: str, *, force_remove: bool = False) → None[source]#

If file size is below a threshold, it is removed. This is also done if force_remove is True.

Parameters:

path (str) – file path to remove if its size is below min_part_bytes
force_remove (bool) – unconditionally remove the file