Kanrisuru

wget

Download a file using wget onto a remote server in ruby.
linux manual

Basic Usage

require 'kanrisuru'
host = Kanrisuru::Remote::Host.new(host: '127.0.1.1', username: 'ubuntu', keys: ['~/.ssh/id_rsa'])

result = host.wget('https://kanrisuru.com', directory_prefix: '/home/ubuntu/downloads')
result.success?
true

Parameters

Field Type Description
url string
Required url to download
quiet boolean
Turn off Wget's output.
verbose boolean
Turn on verbose output, with all the available data. The default output is verbose.
log_file string
Log all messages to logfile. The messages are normally reported to standard error.
append_log_file string
Append to logfile. This is the same as log_file, only it appends to logfile instead of overwriting the old log file. If logfile does not exist, a new file is created.
Download Options
bind_address string
When making client TCP/IP connections, bind to ADDRESS on the local machine. ADDRESS may be specified as a hostname or IP address. This option can be useful if your machine is bound to multiple IPs.
retries integer
Set number of tries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like "connection refused" or "not found" (404), which are not retried.
output_document string
The documents will not be written to the appropriate files, but all will be concatenated together and written to file argument.
no_clobber boolean
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including no_clobber. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.
continue boolean
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program.
server_response boolean
Print the headers sent by HTTP servers and responses sent by FTP servers.
spider boolean
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
timeout integer
Set the network timeout to seconds seconds. This is equivalent to specifying dns_timeout, connect_timeout, and read_timeout, all at the same time.
dns_timeout integer
Set the DNS lookup timeout to seconds seconds. DNS lookups that don't complete within the specified time will fail. By default, there is no timeout on DNS lookups, other than that implemented by system libraries.
connect_timeout integer
Set the connect timeout to seconds seconds. TCP connections that take longer to establish will be aborted. By default, there is no connect timeout, other than that implemented by system libraries.
read_timeout integer
Set the read (and write) timeout to seconds seconds. The "time" of this timeout refers to idle time: if, at any point in the download, no data is received for more than the specified number of seconds, reading fails and the download is restarted.
limit_rate integer
string
Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix.
wait integer
string
Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix.
waitretry integer
If you don't want Wget to wait between every retrieval, but only between retries of failed downloads, you can use this option. Wget will use linear backoff, waiting 1 second after the first failure on a given file, then waiting 2 seconds after the second failure on that file, up to the maximum number of seconds you specify.
random_wait boolean
Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using th ewait option, in order to mask Wget's presence from such analysis.
no_proxy boolean
Don't use proxies, even if the appropriate *_proxy environment variable is defined.
no_dns_cache boolean
Turn off caching of DNS lookups. Normally, Wget remembers the IP addresses it looked up from DNS so it doesn't have to repeatedly contact the DNS server for the same (typically small) set of hosts it retrieves from. This cache exists in memory only; a new Wget run will contact DNS again.
quota string
integer
Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix). Setting quota to 0 or to inf unlimits the download quota.
restrict_file_names string
array
Change which characters found in remote URLs must be escaped during generation of local filenames. The modes are a comma-separated set of text values. The acceptable values are unix, windows, nocontrol, ascii, lowercase, and uppercase.
family string
Force connecting to IPv4 or IPv6 addresses only. Option can either be 'inet' or 'inet6'
retry_connrefused boolean
Consider "connection refused" a transient error and try again.
user string
Specify the username for both FTP and HTTP file retrieval.
password string
Specify the password for both FTP and HTTP file retrieval.
no_iri boolean
Turn off internationalized URI (IRI) support. Use iri to turn it on. IRI support is activated by default.
local_encoding string
Force Wget to use encoding as the default system encoding. That affects how Wget converts URLs specified as arguments from locale to UTF-8 for IRI support.
remote_encoding string
Force Wget to use encoding as the default remote server encoding. That affects how Wget converts URIs found in files from remote encoding to UTF-8 during a recursive fetch. This options is only useful for IRI support, for the interpretation of non-ASCII characters.
Directory Options
no_directories boolean
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering.
force_directories boolean
Create a hierarchy of directories, even if one would not have been created otherwise. E.g. http://fly.srk.fer.hr/robots.txt will save the downloaded file to fly.srk.fer.hr/robots.txt.
no_host_directories boolean
Disable generation of host-prefixed directories.
protocol_directories boolean
Use the protocol name as a directory component of local file names.
cut_dirs integer
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.
directory_prefix string
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree.
HTTP Options
default_page string
Use name as the default file name when it isn't known (i.e., for URLs that end in a slash), instead of index.html
adjust_extension boolean
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename.
http_user string
Specify the username on an HTTP server.
http_password string
Specify the password on an HTTP server.
load_cookies string
Load cookies from the specified file path before the first HTTP retrieval.
save_cookies string
Save cookies to a specified file path before exitingj
no_http_keep_alive boolean
Turn off the "keep-alive" feature for HTTP downloads. Normally, Wget asks the server to keep the connection open so that, when you download more than one document from the same server, they get transferred over the same TCP connection. This saves time and at the same time reduces the load on the server.
no_cache boolean
Disable server-side cache. In this case, Wget will send the remote server appropriate directives (Cache-Control: no-cache and Pragma: no-cache) to get the file from the remote service, rather than returning the cached version.
no_cookies boolean
Disable the use of cookies. Cookies are a mechanism for maintaining server-side state.
keep_session_cookies boolean
When specified, causes save_cookies to also save session cookies. Session cookies are normally not saved because they are meant to be kept in memory and forgotten when you exit the browser. Saving them is useful on sites that require you to log in or to visit the home page before you can access some pages. With this option, multiple Wget runs are considered a single browser session as far as the site is concerned.
ignore_length boolean
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus "Content-Length" headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.
max_redirect integer
Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use.
proxy_user string
Specify the username for authentication on a proxy server. Wget will encode the username using the "basic" authentication scheme.
proxy_password string
Specify the password for authentication on a proxy server. Wget will encode the username using the "basic" authentication scheme.
referer string
Include `Referer: url' header in HTTP request. Useful for retrieving documents with server-side processing that assume they are always being retrieved by interactive web browsers and only come out properly when Referer is set to one of the pages that point to them.
save_headers boolean
Save the headers sent by the HTTP server to the file, preceding the actual contents, with an empty line as the separator.
user_agent string
Identify as agent-string to the HTTP server.
headers Hash
A key, value hash of HTTP headers.
post_data string
Use POST as the method for all HTTP requests and send the specified data in the request body. Sends string as data.
post_file string
Use POST as the method for all HTTP requests and send the specified data in the request body. Sends string as data.
method string
For the purpose of RESTful scripting, Wget allows sending of other HTTP Methods.
content_disposition boolean
If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default.
trust_server_names boolean
If this is set, on a redirect, the local file name will be based on the redirection URL. By default the local file name is based on the original URL. When doing recursive retrieving this can be helpful because in many web sites redirected URLs correspond to an underlying file structure, while link URLs do not.
retry_on_host_error boolean
Consider host errors, such as "Temporary failure in name resolution", as non-fatal, transient errors.
HTTPS (SSL/TLS) Options
secure_protocol string
Choose the secure protocol to be used. Legal values are auto, SSLv2, SSLv3, TLSv1, TLSv1_1, TLSv1_2, TLSv1_3 and PFS.
no_check_certificate boolean
Don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate.
certificate string
Use the client certificate stored in file. This is needed for servers that are configured to require certificates from the clients that connect to them.
certificate_type string
Specify the type of the client certificate. Legal values are PEM (assumed by default) and DER, also known as ASN1.
private_key string
Read the private key from file. This allows you to provide the private key in a file separate from the certificate.
private_key_type string
Specify the type of the private key. Accepted values are PEM (the default) and DER.
ca_certificate string
Use file as the file with the bundle of certificate authorities ("CA") to verify the peers. The certificates must be in PEM format.
ca_directory string
Specifies directory containing CA certificates in PEM format.
random_file string
Use file as the source of random data for seeding the pseudo-random number generator on systems without /dev/urandom.
egd_file string
Use file as the EGD socket. EGD stands for Entropy Gathering Daemon, a user-space program that collects data from various unpredictable system sources and makes it available to other programs that might need it.
FTP / FTPS Options
ftp_user string
Specify the username for an FTP server.
ftp_password string
Specify the password for an FTP server.
no_remove_listing boolean
Don't remove the temporary .listing files generated by FTP retrievals.
no_glob boolean
Turn off FTP globbing. Globbing refers to the use of shell-like special characters (wildcards), like *, ?, [ and ] to retrieve more than one file from the same directory at once
no_passive_ftp boolean
Disable the use of the passive FTP transfer mode. Passive FTP mandates that the client connect to the server to establish the data connection rather than the other way around.
retr_symlinks boolean
By default, when retrieving FTP directories recursively and a symbolic link is encountered, the symbolic link is traversed and the pointed-to files are retrieved. Currently, Wget does not traverse symbolic links to directories to download them recursively, though this feature may be added in the future.
preserve_permissions boolean
Preserve remote file permissions instead of permissions set by umask.
ftps_implicit boolean
This option tells Wget to use FTPS implicitly. Implicit FTPS consists of initializing SSL/TLS from the very beginning of the control connection.
no_ftps_resume_ssl boolean
Do not resume the SSL/TLS session in the data channel.
ftps_clear_data_connection boolean
All the data connections will be in plain text.
ftps_fallback_to_ftp boolean
Fall back to FTP if FTPS is not supported by the target server.
https_only boolean
When in recursive mode, only HTTPS links are followed.
Recursive Retrieval Options
recursive boolean
Turn on recursive retrieving. The default maximum depth is 5.
level integer
Set the maximum number of subdirectories that Wget will recurse into to depth.
delete-after boolean
This option tells Wget to delete every single file it downloads, after having done so.
convert_links boolean
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
backup_converted boolean
When converting a file, back up the original version with a .orig suffix.
mirror boolean
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
page_requisites boolean
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
strict_comments boolean
Follow FTP links from HTML documents. Without this option, Wget will ignore all the FTP links.
Recursive Accept/Reject Options
accept string
array
Specify comma-separated or array lists of file name suffixes of patterns to accept.
reject string
array
Specify comma-separated or array lists of file name suffixes of patterns to reject.
accept_regex string
Specify a regular expression to accept the complete URL.
reject_regex string
Specify a regular expression to reject the complete URL.
regex_type string
Specify the regular expression type. Possible types are posix or pcre. Note that to be able to use pcre type, wget has to be compiled with libpcre support.
domains string
array
Set domains to be followed.
follow_tags string
array
Choose a subset of HTML tag / attributes to consider when looking for linked documents during a recursive retrieval.
ignore_tags string
array
Choose a subset of HTML tag / attributes to skip when looking for linked documents during a recursive retrieval.
include_directories string
array
Specify a comma-separated list of directories you wish to follow when downloading. Elements of list may contain wildcards.
exclude_directories string
array
Specify a comma-separated list of directories you wish to exclude from download. Elements of list may contain wildcards.
follow_ftp boolean
Follow FTP links from HTML documents. Without this option, Wget will ignore all the FTP links.
ignore_case boolean
Ignore case when matching files and directories.
span_hosts boolean
Enable spanning across hosts when doing recursive retrieving.
relative boolean
Follow relative links only.
no_parent boolean
Do not ever ascend to the parent directory when retrieving recursively.

Result

No explicit data struct returned, only option is success?, failure?, and status to see if the program exited properly.

Exit Status

Code Description
0 Success
1 Generic error code
2 Parse error, for instance, when parsing command-line options, the .wgetrc or .netrc
3 File I/O error
4 Network failure
5 SSL verification failure
6 Username/password authentication failure
7 Protocol errors
8 Server issued an error response

Tested On

  • Ubuntu, Debian, Centos, Fedora, Redhat, OpenSuse, SLES