wget
Download a file using wget onto a remote server in ruby.
linux manual
Basic Usage
require 'kanrisuru'
host = Kanrisuru::Remote::Host.new(host: '127.0.1.1', username: 'ubuntu', keys: ['~/.ssh/id_rsa'])
result = host.wget('https://kanrisuru.com', directory_prefix: '/home/ubuntu/downloads')
result.success?
true
Parameters
Field | Type | Description |
---|---|---|
url |
string
|
Required url to download |
quiet |
boolean
|
Turn off Wget's output. |
verbose |
boolean
|
Turn on verbose output, with all the available data. The default output is verbose. |
log_file |
string
|
Log all messages to logfile. The messages are normally reported to standard error. |
append_log_file |
string
|
Append to logfile. This is the same as log_file, only it appends to logfile instead of overwriting the old log file. If logfile does not exist, a new file is created. |
Download Options | ||
bind_address |
string
|
When making client TCP/IP connections, bind to ADDRESS on the local machine. ADDRESS may be specified as a hostname or IP address. This option can be useful if your machine is bound to multiple IPs. |
retries |
integer
|
Set number of tries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like "connection refused" or "not found" (404), which are not retried. |
output_document |
string
|
The documents will not be written to the appropriate files, but all will be concatenated together and written to file argument. |
no_clobber |
boolean
|
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including no_clobber. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved. |
continue |
boolean
|
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program. |
server_response |
boolean
|
Print the headers sent by HTTP servers and responses sent by FTP servers. |
spider |
boolean
|
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. |
timeout |
integer
|
Set the network timeout to seconds seconds. This is equivalent to specifying dns_timeout, connect_timeout, and read_timeout, all at the same time. |
dns_timeout |
integer
|
Set the DNS lookup timeout to seconds seconds. DNS lookups that don't complete within the specified time will fail. By default, there is no timeout on DNS lookups, other than that implemented by system libraries. |
connect_timeout |
integer
|
Set the connect timeout to seconds seconds. TCP connections that take longer to establish will be aborted. By default, there is no connect timeout, other than that implemented by system libraries. |
read_timeout |
integer
|
Set the read (and write) timeout to seconds seconds. The "time" of this timeout refers to idle time: if, at any point in the download, no data is received for more than the specified number of seconds, reading fails and the download is restarted. |
limit_rate |
integer
string
|
Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. |
wait |
integer
string
|
Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix. |
waitretry |
integer
|
If you don't want Wget to wait between every retrieval, but only between retries of failed downloads, you can use this option. Wget will use linear backoff, waiting 1 second after the first failure on a given file, then waiting 2 seconds after the second failure on that file, up to the maximum number of seconds you specify. |
random_wait |
boolean
|
Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using th ewait option, in order to mask Wget's presence from such analysis. |
no_proxy |
boolean
|
Don't use proxies, even if the appropriate *_proxy environment variable is defined. |
no_dns_cache |
boolean
|
Turn off caching of DNS lookups. Normally, Wget remembers the IP addresses it looked up from DNS so it doesn't have to repeatedly contact the DNS server for the same (typically small) set of hosts it retrieves from. This cache exists in memory only; a new Wget run will contact DNS again. |
quota |
string
integer
|
Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix). Setting quota to 0 or to inf unlimits the download quota. |
restrict_file_names |
string
array
|
Change which characters found in remote URLs must be escaped during generation of local filenames. The modes are a comma-separated set of text values. The acceptable values are unix, windows, nocontrol, ascii, lowercase, and uppercase. |
family |
string
|
Force connecting to IPv4 or IPv6 addresses only. Option can either be 'inet' or 'inet6' |
retry_connrefused |
boolean
|
Consider "connection refused" a transient error and try again. |
user |
string
|
Specify the username for both FTP and HTTP file retrieval. |
password |
string
|
Specify the password for both FTP and HTTP file retrieval. |
no_iri |
boolean
|
Turn off internationalized URI (IRI) support. Use iri to turn it on. IRI support is activated by default. |
local_encoding |
string
|
Force Wget to use encoding as the default system encoding. That affects how Wget converts URLs specified as arguments from locale to UTF-8 for IRI support. |
remote_encoding |
string
|
Force Wget to use encoding as the default remote server encoding. That affects how Wget converts URIs found in files from remote encoding to UTF-8 during a recursive fetch. This options is only useful for IRI support, for the interpretation of non-ASCII characters. |
Directory Options | ||
no_directories |
boolean
|
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering. |
force_directories |
boolean
|
Create a hierarchy of directories, even if one would not have been created otherwise. E.g. http://fly.srk.fer.hr/robots.txt will save the downloaded file to fly.srk.fer.hr/robots.txt. |
no_host_directories |
boolean
|
Disable generation of host-prefixed directories. |
protocol_directories |
boolean
|
Use the protocol name as a directory component of local file names. |
cut_dirs |
integer
|
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved. |
directory_prefix |
string
|
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. |
HTTP Options | ||
default_page |
string
|
Use name as the default file name when it isn't known (i.e., for URLs that end in a slash), instead of index.html |
adjust_extension |
boolean
|
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. |
http_user |
string
|
Specify the username on an HTTP server. |
http_password |
string
|
Specify the password on an HTTP server. |
load_cookies |
string
|
Load cookies from the specified file path before the first HTTP retrieval. |
save_cookies |
string
|
Save cookies to a specified file path before exitingj |
no_http_keep_alive |
boolean
|
Turn off the "keep-alive" feature for HTTP downloads. Normally, Wget asks the server to keep the connection open so that, when you download more than one document from the same server, they get transferred over the same TCP connection. This saves time and at the same time reduces the load on the server. |
no_cache |
boolean
|
Disable server-side cache. In this case, Wget will send the remote server appropriate directives (Cache-Control: no-cache and Pragma: no-cache) to get the file from the remote service, rather than returning the cached version. |
no_cookies |
boolean
|
Disable the use of cookies. Cookies are a mechanism for maintaining server-side state. |
keep_session_cookies |
boolean
|
When specified, causes save_cookies to also save session cookies. Session cookies are normally not saved because they are meant to be kept in memory and forgotten when you exit the browser. Saving them is useful on sites that require you to log in or to visit the home page before you can access some pages. With this option, multiple Wget runs are considered a single browser session as far as the site is concerned. |
ignore_length |
boolean
|
Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus "Content-Length" headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte. |
max_redirect |
integer
|
Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use. |
proxy_user |
string
|
Specify the username for authentication on a proxy server. Wget will encode the username using the "basic" authentication scheme. |
proxy_password |
string
|
Specify the password for authentication on a proxy server. Wget will encode the username using the "basic" authentication scheme. |
referer |
string
|
Include `Referer: url' header in HTTP request. Useful for retrieving documents with server-side processing that assume they are always being retrieved by interactive web browsers and only come out properly when Referer is set to one of the pages that point to them. |
save_headers |
boolean
|
Save the headers sent by the HTTP server to the file, preceding the actual contents, with an empty line as the separator. |
user_agent |
string
|
Identify as agent-string to the HTTP server. |
headers |
Hash
|
A key, value hash of HTTP headers. |
post_data |
string
|
Use POST as the method for all HTTP requests and send the specified data in the request body. Sends string as data. |
post_file |
string
|
Use POST as the method for all HTTP requests and send the specified data in the request body. Sends string as data. |
method |
string
|
For the purpose of RESTful scripting, Wget allows sending of other HTTP Methods. |
content_disposition |
boolean
|
If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. |
trust_server_names |
boolean
|
If this is set, on a redirect, the local file name will be based on the redirection URL. By default the local file name is based on the original URL. When doing recursive retrieving this can be helpful because in many web sites redirected URLs correspond to an underlying file structure, while link URLs do not. |
retry_on_host_error |
boolean
|
Consider host errors, such as "Temporary failure in name resolution", as non-fatal, transient errors. |
HTTPS (SSL/TLS) Options | ||
secure_protocol |
string
|
Choose the secure protocol to be used. Legal values are auto, SSLv2, SSLv3, TLSv1, TLSv1_1, TLSv1_2, TLSv1_3 and PFS. |
no_check_certificate |
boolean
|
Don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. |
certificate |
string
|
Use the client certificate stored in file. This is needed for servers that are configured to require certificates from the clients that connect to them. |
certificate_type |
string
|
Specify the type of the client certificate. Legal values are PEM (assumed by default) and DER, also known as ASN1. |
private_key |
string
|
Read the private key from file. This allows you to provide the private key in a file separate from the certificate. |
private_key_type |
string
|
Specify the type of the private key. Accepted values are PEM (the default) and DER. |
ca_certificate |
string
|
Use file as the file with the bundle of certificate authorities ("CA") to verify the peers. The certificates must be in PEM format. |
ca_directory |
string
|
Specifies directory containing CA certificates in PEM format. |
random_file |
string
|
Use file as the source of random data for seeding the pseudo-random number generator on systems without /dev/urandom. |
egd_file |
string
|
Use file as the EGD socket. EGD stands for Entropy Gathering Daemon, a user-space program that collects data from various unpredictable system sources and makes it available to other programs that might need it. |
FTP / FTPS Options | ||
ftp_user |
string
|
Specify the username for an FTP server. |
ftp_password |
string
|
Specify the password for an FTP server. |
no_remove_listing |
boolean
|
Don't remove the temporary .listing files generated by FTP retrievals. |
no_glob |
boolean
|
Turn off FTP globbing. Globbing refers to the use of shell-like special characters (wildcards), like *, ?, [ and ] to retrieve more than one file from the same directory at once |
no_passive_ftp |
boolean
|
Disable the use of the passive FTP transfer mode. Passive FTP mandates that the client connect to the server to establish the data connection rather than the other way around. |
retr_symlinks |
boolean
|
By default, when retrieving FTP directories recursively and a symbolic link is encountered, the symbolic link is traversed and the pointed-to files are retrieved. Currently, Wget does not traverse symbolic links to directories to download them recursively, though this feature may be added in the future. |
preserve_permissions |
boolean
|
Preserve remote file permissions instead of permissions set by umask. |
ftps_implicit |
boolean
|
This option tells Wget to use FTPS implicitly. Implicit FTPS consists of initializing SSL/TLS from the very beginning of the control connection. |
no_ftps_resume_ssl |
boolean
|
Do not resume the SSL/TLS session in the data channel. |
ftps_clear_data_connection |
boolean
|
All the data connections will be in plain text. |
ftps_fallback_to_ftp |
boolean
|
Fall back to FTP if FTPS is not supported by the target server. |
https_only |
boolean
|
When in recursive mode, only HTTPS links are followed. |
Recursive Retrieval Options | ||
recursive |
boolean
|
Turn on recursive retrieving. The default maximum depth is 5. |
level |
integer
|
Set the maximum number of subdirectories that Wget will recurse into to depth. |
delete-after |
boolean
|
This option tells Wget to delete every single file it downloads, after having done so. |
convert_links |
boolean
|
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc. |
backup_converted |
boolean
|
When converting a file, back up the original version with a .orig suffix. |
mirror |
boolean
|
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. |
page_requisites |
boolean
|
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. |
strict_comments |
boolean
|
Follow FTP links from HTML documents. Without this option, Wget will ignore all the FTP links. |
Recursive Accept/Reject Options | ||
accept |
string
array
|
Specify comma-separated or array lists of file name suffixes of patterns to accept. |
reject |
string
array
|
Specify comma-separated or array lists of file name suffixes of patterns to reject. |
accept_regex |
string
|
Specify a regular expression to accept the complete URL. |
reject_regex |
string
|
Specify a regular expression to reject the complete URL. |
regex_type |
string
|
Specify the regular expression type. Possible types are posix or pcre. Note that to be able to use pcre type, wget has to be compiled with libpcre support. |
domains |
string
array
|
Set domains to be followed. |
follow_tags |
string
array
|
Choose a subset of HTML tag / attributes to consider when looking for linked documents during a recursive retrieval. |
ignore_tags |
string
array
|
Choose a subset of HTML tag / attributes to skip when looking for linked documents during a recursive retrieval. |
include_directories |
string
array
|
Specify a comma-separated list of directories you wish to follow when downloading. Elements of list may contain wildcards. |
exclude_directories |
string
array
|
Specify a comma-separated list of directories you wish to exclude from download. Elements of list may contain wildcards. |
follow_ftp |
boolean
|
Follow FTP links from HTML documents. Without this option, Wget will ignore all the FTP links. |
ignore_case |
boolean
|
Ignore case when matching files and directories. |
span_hosts |
boolean
|
Enable spanning across hosts when doing recursive retrieving. |
relative |
boolean
|
Follow relative links only. |
no_parent |
boolean
|
Do not ever ascend to the parent directory when retrieving recursively. |
Result
No explicit data struct returned, only option is success?
, failure?
, and status
to see if the program exited properly.
Exit Status
Code | Description |
---|---|
0 | Success |
1 | Generic error code |
2 | Parse error, for instance, when parsing command-line options, the .wgetrc or .netrc |
3 | File I/O error |
4 | Network failure |
5 | SSL verification failure |
6 | Username/password authentication failure |
7 | Protocol errors |
8 | Server issued an error response |
Tested On
- Ubuntu, Debian, Centos, Fedora, Redhat, OpenSuse, SLES