curldl.curldl module#

Interface for PycURL functionality

class curldl.curldl.Curldl(basedir: str | PathLike[str], *, progress: bool = False, verbose: bool = False, user_agent: str = 'curl', retry_attempts: int = 3, retry_wait_sec: int | float = 2, timeout_sec: int | float = 120, max_redirects: int = 5, allowed_protocols_bitmask: int = 0, min_part_bytes: int = 65536, always_keep_part_bytes: int = 67108864, curl_config_callback: Callable[[Curl], None] | None = None)[source]#

Bases: object

Interface for downloading functionality of PycURL. Basic usage example:

import curldl, os
dl = curldl.Curldl(basedir='downloads', progress=True)
dl.get('https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz',
       'linux-0.01.tar.gz', size=73091,
       digests={'sha1': '566b6fb6365e25f47b972efa1506932b87d3ca7d'})
assert os.path.exists('downloads/linux-0.01.tar.gz')

For a more in-depth guide, refer to package documentation.

Initialize a PycURL-based downloader with a single pycurl.Curl instance that is reused and reconfigured for each download. The resulting downloader object should be therefore not shared among several threads.

Parameters:
  • basedir (str | os.PathLike[str]) – base directory path for downloaded file

  • progress (bool) – show progress bar on sys.stderr

  • verbose (bool) – enable verbose logging information from libcurl at DEBUG level

  • user_agent (str) – User-Agent header for HTTP(S) protocols

  • retry_attempts (int) – number of download retry attempts in case of failure in DOWNLOAD_RETRY_ERRORS

  • retry_wait_sec (int | float) – seconds to wait between download retry attempts

  • timeout_sec (int | float) – timeout seconds for libcurl operation

  • max_redirects (int) – maximum number of redirects allowed in HTTP(S) protocols

  • allowed_protocols_bitmask (int) – bitmask of allowed protocols, e.g. pycurl.PROTO_HTTP; default is or of values in DEFAULT_ALLOWED_PROTOCOLS

  • min_part_bytes (int) – partial downloads below this size are removed after unsuccessful download attempt; set to 0 to disable removal of unsuccessful partial downloads

  • always_keep_part_bytes (int) – do not remove partial downloads of this size or larger when resuming download even if no size or digest is provided for verification; set to 0 to never remove existing partial downloads

  • curl_config_callback (Callable[[pycurl.Curl], None] | None) – pass a callback to further configure a pycurl.Curl object

DOWNLOAD_RETRY_ERRORS = {5, 6, 7, 10, 12, 15, 16, 18, 22, 28, 30, 35, 47, 52, 55, 56, 79}#

libcurl errors accepted by download retry policy

DEFAULT_ALLOWED_PROTOCOLS = {1, 2, 4, 8, 32}#

URL schemes allowed by default, can be changed with allowed_protocols_bitmask constructor parameter

RESUME_FROM_SCHEMES = {'file', 'ftp', 'ftps', 'http', 'https'}#

URL schemes supported by pycurl.RESUME_FROM. SFTP is not included because its implementation is buggy (total download size is reduced twice by initial size). Scheme is extracted via urllib from initial URL, but there are no security implications since it is only used for removing partial downloads.

VERBOSE_LOGGING = {0: 'TEXT', 1: 'IHDR', 2: 'OHDR'}#

Info types logged by DEBUGFUNCTION() callback during verbose logging

_get_configured_curl(url: str, path: str, *, timestamp: int | float | None = None) tuple[Curl, int][source]#

Reconfigure pycurl.Curl instance for requested download and return the instance. Methods should not work with self._unconfigured_curl directly, only with instance returned by this method.

Parameters:
  • url (str) – URL to download

  • path (str) – resolved download file path

  • timestamp (int | float | None) – last-modified timestamp of an already downloaded path, if it exists; used for skipping not-modified-since downloads with HTTP(S), FTP(S), FILE and RTSP protocols

Returns:

pycurl.Curl instance configured for requested download and initial download offset (i.e., file size to resume)

Return type:

tuple[Curl, int]

_perform_curl_download(curl: pycurl.Curl, write_stream: BinaryIO, progress_bar: tqdm[NoReturn]) None[source]#

Complete pycurl.Curl configuration and start downloading.

Parameters:
  • curl (pycurl.Curl) – configured pycurl.Curl instance

  • write_stream (BinaryIO) – output stream to write to (a file opened in binary write mode)

  • progress_bar (tqdm[NoReturn]) – progress bar to use; XFERINFOFUNCTION() is configured if enabled

static _get_curl_progress_callback(progress_bar: tqdm[NoReturn]) Callable[[int, int, int, int], None][source]#

Constructs a progress bar-updating callback for XFERINFOFUNCTION().

Parameters:

progress_bar (tqdm[NoReturn]) – progress bar to use, must be enabled

Returns:

XFERINFOFUNCTION() callback

Return type:

Callable[[int, int, int, int], None]

classmethod _curl_debug_cb(debug_type: int, debug_msg: bytes) None[source]#

Callback for DEBUGFUNCTION() that logs libcurl messages at DEBUG level.

Parameters:
get(url: str, rel_path: str, *, size: int | None = None, digests: dict[str, str] | None = None) None[source]#

Download a URL to basedir-relative path and verify its expected size and digests. Resume a partial download with .part extension if exists and supported by protocol, and retry failures according to retry policy. The downloaded file is removed in case of size or digest mismatch, and ValueError is raised.

Parameters:
Raises:
_download_partial(url: str, path: str, *, timestamp: int | float | None = None, description: str | None = None) None[source]#

Start or resume a partial download of a URL to resolved path. If timestamp of an already downloaded file is provided, remove the partial file if the URL content is not more recent than the timestamp. This method should be invoked with a retry policy.

Parameters:
  • url (str) – URL to download

  • path (str) – resolved path of a partial download file

  • timestamp (int | float | None) – last-modified timestamp of an already downloaded path, if it exists

  • description (str | None) – description string for progress bar (e.g., base name of downloaded file)

Raises:

pycurl.error – PycURL error when downloading, may result in a retry according to policy

_prepare_full_path(rel_path: str) str[source]#

Verify that basedir-relative path is safe and create the required directories.

Parameters:

rel_path (str) – basedir-relative path

Returns:

resulting complete path

Raises:

ValueError – relative path escapes base directory or is otherwise unsafe (see curldl.util.fs.FileSystem.verify_rel_path_is_safe())

Return type:

str

classmethod _get_response_status(curl: Curl, url: str, error: error | None) str[source]#

Format response code and description from cURL with a possible error.

Parameters:
  • curl (Curl) – pycurl.Curl instance to extract response code from

  • url (str) – a URL to extract scheme protocol from if pycurl.EFFECTIVE_URL is unavailable

  • error (error | None) – PycURL exception instance

Returns:

formatted string that includes a response code and its meaning, if available

Return type:

str

static _get_url_scheme(url: str) str[source]#

Return URL scheme (lowercase).

Parameters:

url (str) – a URL to extract URL scheme part from

Returns:

lowercase protocol scheme, e.g. http

Return type:

str

_discard_file(path: str, *, force_remove: bool = False) None[source]#

If file size is below a threshold, it is removed. This is also done if force_remove is True.

Parameters:
  • path (str) – file path to remove if its size is below min_part_bytes

  • force_remove (bool) – unconditionally remove the file