pyhdfs module
WebHDFS client with support for NN HA and automatic error checking
For details on the WebHDFS endpoints, see the Hadoop documentation:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html
- class pyhdfs.ContentSummary(**kwargs: object)[source]
Bases:
_BoilerplateClass- Parameters:
directoryCount (int) – The number of directories.
fileCount (int) – The number of files.
length (int) – The number of bytes used by the content.
quota (int) – The namespace quota of this directory.
spaceConsumed (int) – The disk space consumed by the content.
spaceQuota (int) – The disk space quota.
typeQuota (Dict[str, TypeQuota]) – Quota usage for ARCHIVE, DISK, SSD
- directoryCount: int
- fileCount: int
- length: int
- quota: int
- spaceConsumed: int
- spaceQuota: int
- class pyhdfs.FileChecksum(**kwargs: object)[source]
Bases:
_BoilerplateClass- Parameters:
algorithm (str) – The name of the checksum algorithm.
bytes (str) – The byte sequence of the checksum in hexadecimal.
length (int) – The length of the bytes (not the length of the string).
- algorithm: str
- bytes: str
- length: int
- class pyhdfs.FileStatus(**kwargs: object)[source]
Bases:
_BoilerplateClass- Parameters:
accessTime (int) – The access time.
blockSize (int) – The block size of a file.
group (str) – The group owner.
length (int) – The number of bytes in a file.
modificationTime (int) – The modification time.
owner (str) – The user who is the owner.
pathSuffix (str) – The path suffix.
permission (str) – The permission represented as a octal string.
replication (int) – The number of replication of a file.
symlink (str | None) – The link target of a symlink.
type (str) – The type of the path object.
childrenNum (int) – How many children this directory has, or 0 for files.
- accessTime: int
- blockSize: int
- childrenNum: int
- group: str
- length: int
- modificationTime: int
- owner: str
- pathSuffix: str
- permission: str
- replication: int
- symlink: str | None
- type: str
- exception pyhdfs.HdfsAccessControlException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- class pyhdfs.HdfsClient(hosts: str | Iterable[str] = 'localhost', randomize_hosts: bool = True, user_name: str | None = None, timeout: float = 20, max_tries: int = 2, retry_delay: float = 5, scheme: str = 'http', requests_session: Session | None = None, requests_kwargs: dict[str, Any] | None = None)[source]
Bases:
objectHDFS client backed by WebHDFS.
All functions take arbitrary query parameters to pass to WebHDFS, in addition to any documented keyword arguments. In particular, any function will accept
user.name, which for convenience may be passed asuser_name.If multiple HA NameNodes are given, all functions submit HTTP requests to both NameNodes until they find the active NameNode.
- Parameters:
hosts (list or str) – List of NameNode HTTP host:port strings, either as
listor a comma separated string. Port defaults to 50070 if left unspecified. Note that in Hadoop 3, the default NameNode HTTP port changed to 9870; the old default of 50070 is left as-is for backwards compatibility.randomize_hosts (bool) – By default randomize host selection.
user_name – What Hadoop user to run as. Defaults to the
HADOOP_USER_NAMEenvironment variable if present, otherwisegetpass.getuser().timeout (float) – How long to wait on a single NameNode in seconds before moving on. In some cases the standby NameNode can be unresponsive (e.g. loading fsimage or checkpointing), so we don’t want to block on it.
max_tries (int) – How many times to retry a request for each NameNode. If NN1 is standby and NN2 is active, we might first contact NN1 and then observe a failover to NN1 when we contact NN2. In this situation we want to retry against NN1.
retry_delay (float) – How long to wait in seconds before going through NameNodes again
scheme (str) – Use http by default or https for secure HDFS cluster
requests_session – A
requests.Sessionobject for advanced usage. If absent, this class will use the default requests behavior of making a new session per HTTP request. Caller is responsible for closing session.requests_kwargs – Additional
**kwargsto pass to requests
- append(path: str, data: bytes | IO[bytes], **kwargs: str | int | None | list[str]) None[source]
Append to the given file.
- Parameters:
data –
bytesor afile-like objectbuffersize (int) – The size of the buffer used in transferring data.
- concat(target: str, sources: list[str], **kwargs: str | int | None | list[str]) None[source]
Concat existing files together.
For preconditions, see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#void_concatPath_p_Path_sources
- Parameters:
target – the path to the target destination.
sources (list) – the paths to the sources to use for the concatenation.
- copy_from_local(localsrc: str, dest: str, **kwargs: str | int | None | list[str]) None[source]
Copy a single file from the local file system to
destTakes all arguments that
create()takes.
- copy_to_local(src: str, localdest: str, **kwargs: str | int | None | list[str]) None[source]
Copy a single file from
srcto the local file systemTakes all arguments that
open()takes.
- create(path: str, data: IO[bytes] | bytes, **kwargs: str | int | None | list[str]) None[source]
Create a file at the given path.
- Parameters:
data –
bytesor afile-like object to uploadoverwrite (bool) – If a file already exists, should it be overwritten?
blocksize (long) – The block size of a file.
replication (short) – The number of replications of a file.
permission (octal) – The permission of a file/directory. Any radix-8 integer (leading zeros may be omitted.)
buffersize (int) – The size of the buffer used in transferring data.
- create_snapshot(path: str, **kwargs: str | int | None | list[str]) str[source]
Create a snapshot
- Parameters:
path – The directory where snapshots will be taken
snapshotname – The name of the snapshot
- Returns:
the snapshot path
- create_symlink(link: str, destination: str, **kwargs: str | int | None | list[str]) None[source]
Create a symbolic link at
linkpointing todestination.- Parameters:
link – the path to be created that points to target
destination – the target of the symbolic link
createParent (bool) – If the parent directories do not exist, should they be created?
- Raises:
HdfsUnsupportedOperationException – This feature doesn’t actually work, at least on CDH 5.3.0.
- delete(path: str, **kwargs: str | int | None | list[str]) bool[source]
Delete a file.
- Parameters:
recursive (bool) – If path is a directory and set to true, the directory is deleted else throws an exception. In case of a file the recursive can be set to either true or false.
- Returns:
true if delete is successful else false.
- Return type:
bool
- delete_snapshot(path: str, snapshotname: str, **kwargs: str | int | None | list[str]) None[source]
Delete a snapshot of a directory
- exists(path: str, **kwargs: str | int | None | list[str]) bool[source]
Return true if the given path exists
- get_active_namenode(max_staleness: float | None = None) str[source]
Return the address of the currently active NameNode.
- Parameters:
max_staleness (float) – This function caches the active NameNode. If this age of this cached result is less than
max_stalenessseconds, return it. Otherwise, or if this parameter is None, do a lookup.- Raises:
HdfsNoServerException – can’t find an active NameNode
- get_content_summary(path: str, **kwargs: str | int | None | list[str]) ContentSummary[source]
Return the
ContentSummaryof a given Path.
- get_file_checksum(path: str, **kwargs: str | int | None | list[str]) FileChecksum[source]
Get the checksum of a file.
- Return type:
- get_file_status(path: str, **kwargs: str | int | None | list[str]) FileStatus[source]
Return a
FileStatusobject that represents the path.
- get_home_directory(**kwargs: str | int | None | list[str]) str[source]
Return the current user’s home directory in this filesystem.
- get_xattrs(path: str, xattr_name: str | list[str] | None = None, encoding: str = 'text', **kwargs: str | int | None | list[str]) dict[str, bytes | str | None][source]
Get one or more xattr values for a file or directory.
- Parameters:
xattr_name –
strto get one attribute,listto get multiple attributes,Noneto get all attributes.encoding –
text|hex|base64, defaults totext
- Returns:
Dictionary mapping xattr name to value. With text encoding, the value will be a unicode string. With hex or base64 encoding, the value will be a byte array.
- Return type:
dict
- list_status(path: str, **kwargs: str | int | None | list[str]) list[FileStatus][source]
List the statuses of the files/directories in the given path if the path is a directory.
- Return type:
listofFileStatusobjects
- list_xattrs(path: str, **kwargs: str | int | None | list[str]) list[str][source]
Get all of the xattr names for a file or directory.
- Return type:
list
- listdir(path: str, **kwargs: str | int | None | list[str]) list[str][source]
Return a list containing names of files in the given path
- mkdirs(path: str, **kwargs: str | int | None | list[str]) bool[source]
Create a directory with the provided permission.
The permission of the directory is set to be the provided permission as in setPermission, not permission&~umask.
- Parameters:
permission (octal) – The permission of a file/directory. Any radix-8 integer (leading zeros may be omitted.)
- Returns:
true if the directory creation succeeds; false otherwise
- Return type:
bool
- open(path: str, **kwargs: str | int | None | list[str]) IO[bytes][source]
Return a file-like object for reading the given HDFS path.
- Parameters:
offset (long) – The starting byte position.
length (long) – The number of bytes to be processed.
buffersize (int) – The size of the buffer used in transferring data.
- Return type:
file-like object
- remove_xattr(path: str, xattr_name: str, **kwargs: str | int | None | list[str]) None[source]
Remove an xattr of a file or directory.
- rename(path: str, destination: str, **kwargs: str | int | None | list[str]) bool[source]
Renames Path src to Path dst.
- Returns:
true if rename is successful
- Return type:
bool
- rename_snapshot(path: str, oldsnapshotname: str, snapshotname: str, **kwargs: str | int | None | list[str]) None[source]
Rename a snapshot
- set_owner(path: str, **kwargs: str | int | None | list[str]) None[source]
Set owner of a path (i.e. a file or a directory).
The parameters owner and group cannot both be null.
- Parameters:
owner – user
group – group
- set_permission(path: str, **kwargs: str | int | None | list[str]) None[source]
Set permission of a path.
- Parameters:
permission (octal) – The permission of a file/directory. Any radix-8 integer (leading zeros may be omitted.)
- set_replication(path: str, **kwargs: str | int | None | list[str]) bool[source]
Set replication for an existing file.
- Parameters:
replication (short) – new replication
- Returns:
true if successful; false if file does not exist or is a directory
- Return type:
bool
- set_times(path: str, **kwargs: str | int | None | list[str]) None[source]
Set access time of a file.
- Parameters:
modificationtime (long) – Set the modification time of this file. The number of milliseconds since Jan 1, 1970.
accesstime (long) – Set the access time of this file. The number of milliseconds since Jan 1 1970.
- set_xattr(path: str, xattr_name: str, xattr_value: str | None, flag: str, **kwargs: str | int | None | list[str]) None[source]
Set an xattr of a file or directory.
- Parameters:
xattr_name – The name must be prefixed with the namespace followed by
.. For example,user.attr.flag –
CREATEorREPLACE
- walk(top: str, topdown: bool = True, onerror: Callable[[HdfsException], None] | None = None, **kwargs: str | int | None | list[str]) Iterator[tuple[str, list[str], list[str]]][source]
See
os.walkfor documentation
- exception pyhdfs.HdfsDSQuotaExceededException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsQuotaExceededException
- exception pyhdfs.HdfsException[source]
Bases:
ExceptionBase class for all errors while communicating with WebHDFS server
- exception pyhdfs.HdfsFileAlreadyExistsException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsFileNotFoundException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsHadoopIllegalArgumentException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIllegalArgumentException
- exception pyhdfs.HdfsHttpException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsExceptionThe client was able to talk to the server but got a HTTP error code.
- Parameters:
message – Exception message
exception – Name of the exception
javaClassName – Java class name of the exception
status_code (int) – HTTP status code
kwargs – any extra attributes in case Hadoop adds more stuff
- exception pyhdfs.HdfsIOException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsHttpException
- exception pyhdfs.HdfsIllegalArgumentException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsHttpException
- exception pyhdfs.HdfsInvalidPathException(message: str, exception: str, status_code: int, **kwargs: object)[source]
- exception pyhdfs.HdfsNSQuotaExceededException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsQuotaExceededException
- exception pyhdfs.HdfsNoServerException[source]
Bases:
HdfsExceptionThe client was not able to reach any of the given servers
- exception pyhdfs.HdfsPathIsNotEmptyDirectoryException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsQuotaExceededException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsRemoteException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsRetriableException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsRuntimeException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsHttpException
- exception pyhdfs.HdfsSecurityException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsHttpException
- exception pyhdfs.HdfsSnapshotException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsStandbyException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsIOException
- exception pyhdfs.HdfsUnsupportedOperationException(message: str, exception: str, status_code: int, **kwargs: object)[source]
Bases:
HdfsHttpException