kedro.io.PartitionedDataSet¶
-
class
kedro.io.PartitionedDataSet(path, dataset, filepath_arg='filepath', filename_suffix='', credentials=None, load_args=None, fs_args=None)[source]¶ Bases:
kedro.io.core.AbstractDataSetPartitionedDataSetloads and saves partitioned file-like data using the underlying dataset definition. For filesystem level operations it uses fsspec: https://github.com/intake/filesystem_spec.Example:
import pandas as pd from kedro.io import PartitionedDataSet # these credentials will be passed to both 'fsspec.filesystem()' call # and the dataset initializer credentials = {"key1": "secret1", "key2": "secret2"} data_set = PartitionedDataSet( path="s3://bucket-name/path/to/folder", dataset="CSVDataSet", credentials=credentials ) loaded = data_set.load() # assert isinstance(loaded, dict) combine_all = pd.DataFrame() for partition_id, partition_load_func in loaded.items(): partition_data = partition_load_func() combine_all = pd.concat( [combine_all, partition_data], ignore_index=True, sort=True ) new_data = pd.DataFrame({"new": [1, 2]}) # creates "s3://bucket-name/path/to/folder/new/partition.csv" data_set.save({"new/partition.csv": new_data})
Methods
PartitionedDataSet.__init__(path, dataset[, …])Creates a new instance of PartitionedDataSet.PartitionedDataSet.exists()Checks whether a data set’s output already exists by calling the provided _exists() method. PartitionedDataSet.from_config(name, config)Create a data set instance using the configuration provided. PartitionedDataSet.load()Loads data by delegation to the provided load method. PartitionedDataSet.release()Release any cached data. PartitionedDataSet.save(data)Saves data by delegation to the provided save method. -
__init__(path, dataset, filepath_arg='filepath', filename_suffix='', credentials=None, load_args=None, fs_args=None)[source]¶ Creates a new instance of
PartitionedDataSet.Parameters: - path (
str) – Path to the folder containing partitioned data. If path starts with the protocol (e.g.,s3://) then the correspondingfsspecconcrete filesystem implementation will be used. If protocol is not specified,fsspec.implementations.local.LocalFileSystemwill be used. Note: Some concrete implementations are bundled withfsspec, while others (likes3orgcs) must be installed separately prior to usage of thePartitionedDataSet. - dataset (
Union[str,Type[AbstractDataSet],Dict[str,Any]]) – Underlying dataset definition. This is used to instantiate the dataset for each file located inside thepath. Accepted formats are: a) object of a class that inherits fromAbstractDataSetb) a string representing a fully qualified class name to such class c) a dictionary withtypekey pointing to a string from b), other keys are passed to the Dataset initializer. Credentials for the dataset can be explicitly specified in this configuration. - filepath_arg (
str) – Underlying dataset initializer argument that will contain a path to each corresponding partition file. If unspecified, defaults to “filepath”. - filename_suffix (
str) – If specified, only partitions that end with this string will be processed. - credentials (
Optional[Dict[str,Any]]) – Protocol-specific options that will be passed tofsspec.filesystemhttps://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem and the dataset initializer. If the dataset config contains explicit credentials spec, then such spec will take precedence. Note:dataset_credentialskey has now been deprecated and should not be specified. All possible credentials management scenarios are documented here: https://kedro.readthedocs.io/en/stable/04_user_guide/08_advanced_io.html#partitioned-dataset-credentials - load_args (
Optional[Dict[str,Any]]) – Keyword arguments to be passed intofind()method of the filesystem implementation. - fs_args (
Optional[Dict[str,Any]]) – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} forGCSFileSystem)
Raises: DataSetError– If versioning is enabled for the underlying dataset.- path (
-
exists()¶ Checks whether a data set’s output already exists by calling the provided _exists() method.
Return type: boolReturns: Flag indicating whether the output already exists. Raises: DataSetError– when underlying exists method raises error.
-
classmethod
from_config(name, config, load_version=None, save_version=None)¶ Create a data set instance using the configuration provided.
Parameters: - name (
str) – Data set name. - config (
Dict[str,Any]) – Data set config dictionary. - load_version (
Optional[str]) – Version string to be used forloadoperation if the data set is versioned. Has no effect on the data set if versioning was not enabled. - save_version (
Optional[str]) – Version string to be used forsaveoperation if the data set is versioned. Has no effect on the data set if versioning was not enabled.
Return type: AbstractDataSetReturns: An instance of an
AbstractDataSetsubclass.Raises: DataSetError– When the function fails to create the data set from its config.- name (
-
load()¶ Loads data by delegation to the provided load method.
Return type: AnyReturns: Data returned by the provided load method. Raises: DataSetError– When underlying load method raises error.
-
release()¶ Release any cached data.
Raises: DataSetError– when underlying release method raises error.Return type: None
-
save(data)¶ Saves data by delegation to the provided save method.
Parameters: data ( Any) – the value to be saved by provided save method.Raises: DataSetError– when underlying save method raises error.Return type: None
-