scitex_io
scitex-io: Universal scientific data I/O with plugin registry.
Supports 30+ formats out of the box. Register custom handlers via:
from scitex_io import register_saver, register_loader
@register_saver(".myformat")
def save_myformat(obj, path, **kw): ...
@register_loader(".myformat")
def load_myformat(path, **kw): ...
- scitex_io.register_saver(ext, fn=None, *, builtin=False)[source]
Register a save handler for a file extension.
Can be used as a decorator or called directly:
@register_saver(".json") def my_json_saver(obj, path, **kwargs): ... register_saver(".json", my_json_saver)
- scitex_io.register_loader(ext, fn=None, *, builtin=False)[source]
Register a load handler for a file extension.
Same API as
register_saver().
- scitex_io.unregister_saver(ext)[source]
Remove a user-registered saver. Returns True if found.
- Return type:
- scitex_io.unregister_loader(ext)[source]
Remove a user-registered loader. Returns True if found.
- Return type:
- scitex_io.save(obj, specified_path, makedirs=True, verbose=True, symlink_from_cwd=False, symlink_to=None, dry_run=False, no_csv=False, use_caller_path=False, **kwargs)[source]
Save an object to a file with the specified format.
- Parameters:
obj (Any) – The object to be saved.
specified_path (Union[str, Path]) – The file name or path where the object should be saved.
makedirs (bool, optional) – If True, create the directory path if it does not exist. Default is True.
verbose (bool, optional) – If True, print a message upon successful saving. Default is True.
symlink_from_cwd (bool, optional) – If True, create a symlink from the current working directory. Default is False.
symlink_to (Union[str, Path], optional) – If specified, create a symlink at this path pointing to the saved file.
dry_run (bool, optional) – If True, simulate the saving process without writing files. Default is False.
no_csv (bool, optional) – If True, skip CSV export for image saves. Default is False.
use_caller_path (bool, optional) – If True, skip internal library frames for path detection. Default is False.
**kwargs – Additional keyword arguments to pass to the underlying save function.
- Returns:
Path to saved file on success, False on error.
- Return type:
Path or None
- scitex_io.load(lpath, ext=None, show=False, verbose=False, cache=True, **kwargs)[source]
Load data from various file formats.
This function supports loading data from multiple file formats with optional caching.
- Parameters:
lpath (Union[str, Path]) – The path to the file to be loaded. Can be a string or pathlib.Path object.
ext (str, optional) – File extension to use for loading. If None, automatically detects from filename. Useful for files without extensions (e.g., UUID-named files). Examples: ‘pdf’, ‘json’, ‘csv’
show (bool, optional) – If True, display additional information during loading. Default is False.
verbose (bool, optional) – If True, print verbose output during loading. Default is False.
cache (bool, optional) – If True, enable caching for faster repeated loads. Default is True.
**kwargs (dict) – Additional keyword arguments to be passed to the specific loading function.
- Returns:
The loaded data object, which can be of various types depending on the input file format.
- Return type:
- Raises:
ValueError – If the file extension is not supported.
FileNotFoundError – If the specified file does not exist.
Supported Extensions –
------------------- –
- Data formats – .csv, .tsv, .xls, .xlsx, .xlsm, .xlsb, .json, .yaml, .yml:
- Scientific – .npy, .npz, .mat, .hdf5, .con:
- ML/DL – .pth, .pt, .cbm, .joblib, .pkl:
- Documents – .txt, .log, .event, .md, .docx, .pdf, .xml:
- Images – .jpg, .png, .tiff, .tif:
- EEG data – .vhdr, .vmrk, .edf, .bdf, .gdf, .cnt, .egi, .eeg, .set:
- Database – .db:
Examples
>>> data = load('data.csv') >>> image = load('image.png') >>> model = load('model.pth') >>> # Load file without extension (e.g., UUID PDF) >>> pdf = load('f2694ccb-1b6f-4994-add8-5111fd4d52f1', ext='pdf')
- scitex_io.load_configs(IS_DEBUG=None, show=False, verbose=False, config_dir=None)[source]
Load YAML configuration files from specified directory.
- Parameters:
IS_DEBUG (bool, optional) – Debug mode flag. If None, reads from IS_DEBUG.yaml
show (bool) – Show configuration changes
verbose (bool) – Print detailed information
config_dir (Union[str, Path], optional) – Directory containing configuration files. Can be a string or pathlib.Path object. Defaults to “./config” if None
- Returns:
Merged configuration dictionary
- Return type:
DotDict
- scitex_io.glob(expression, parse=False, ensure_one=False)[source]
Perform a glob operation with natural sorting and extended pattern support.
This function extends the standard glob functionality by adding natural sorting and support for curly brace expansion in the glob pattern.
Parameters:
- expressionUnion[str, Path]
The glob pattern to match against file paths. Can be a string or pathlib.Path object. Supports standard glob syntax and curly brace expansion (e.g., ‘dir/{a,b}/*.txt’).
- parsebool, optional
Whether to parse the matched paths. Default is False.
- ensure_onebool, optional
Ensure exactly one match is found. Default is False.
Returns:
: Union[List[str], Tuple[List[str], List[dict]]]
If parse=False: A naturally sorted list of file paths If parse=True: Tuple of (paths, parsed results)
Examples:
>>> glob('data/*.txt') ['data/file1.txt', 'data/file2.txt', 'data/file10.txt']
>>> glob('data/{a,b}/*.txt') ['data/a/file1.txt', 'data/a/file2.txt', 'data/b/file1.txt']
>>> paths, parsed = glob('data/subj_{id}/run_{run}.txt', parse=True) >>> paths ['data/subj_001/run_01.txt', 'data/subj_001/run_02.txt'] >>> parsed [{'id': '001', 'run': '01'}, {'id': '001', 'run': '02'}]
>>> paths, parsed = glob('data/subj_{id}/run_{run}.txt', parse=True, ensure_one=True) AssertionError # if more than one file matches
- scitex_io.parse_glob(expression, ensure_one=False)[source]
Convenience function for glob with parsing enabled.
Parameters:
- expressionUnion[str, Path]
The glob pattern to match against file paths. Can be a string or pathlib.Path object.
- ensure_onebool, optional
Ensure exactly one match is found. Default is False.
Returns:
: Tuple[List[str], List[dict]]
Matched paths and parsed results.
Examples:
>>> paths, parsed = pglob('data/subj_{id}/run_{run}.txt') >>> paths ['data/subj_001/run_01.txt', 'data/subj_001/run_02.txt'] >>> parsed [{'id': '001', 'run': '01'}, {'id': '001', 'run': '02'}]
>>> paths, parsed = pglob('data/subj_{id}/run_{run}.txt', ensure_one=True) AssertionError # if more than one file matches
- scitex_io.reload(module_or_func, verbose=False)[source]
Reload a module or the module containing a given function.
This function attempts to reload a module directly if a module is passed, or reloads the module containing the function if a function is passed. This is useful during development to reflect changes without restarting the Python interpreter.
Parameters:
- module_or_funcmodule or function
The module to reload, or a function whose containing module should be reloaded.
- verbosebool, optional
If True, print additional information during the reload process. Default is False.
Returns:
: None
Raises:
- Exception
If the module cannot be found or if there’s an error during the reload process.
Notes:
Reloading modules can have unexpected side effects, especially for modules that maintain state or have complex imports. Use with caution.
This function modifies sys.modules, which affects the global state of the Python interpreter.
Examples:
>>> import my_module >>> reload(my_module)
>>> from my_module import my_function >>> reload(my_function)
- scitex_io.flush(sys=<module 'sys' (built-in)>)[source]
Flushes the system’s stdout and stderr, and syncs the file system. This ensures all pending write operations are completed.
- scitex_io.cache(id, *args)[source]
Store or fetch data using a pickle file.
This function provides a simple caching mechanism for storing and retrieving Python objects. It uses pickle to serialize the data and stores it in a file with a unique identifier. If the data is already cached, it can be retrieved without recomputation.
Parameters:
- idstr
A unique identifier for the cache file.
- *argsstr
Variable names to be cached or loaded.
Returns:
: tuple
A tuple of cached values corresponding to the input variable names.
Raises:
- ValueError
If the cache file is not found and not all variables are defined.
Example:
>>> import scitex >>> import numpy as np >>> >>> # Variables to cache >>> var1 = "x" >>> var2 = 1 >>> var3 = np.ones(10) >>> >>> # Saving >>> var1, var2, var3 = scitex.io.cache("my_id", "var1", "var2", "var3") >>> print(var1, var2, var3) >>> >>> # Loading when not all variables are defined and the id exists >>> del var1, var2, var3 >>> var1, var2, var3 = scitex.io.cache("my_id", "var1", "var2", "var3") >>> print(var1, var2, var3)
- class scitex_io.H5Explorer(filepath, mode='r')[source]
Bases:
objectInteractive HDF5 file explorer.
This class provides convenient methods to explore HDF5 files, inspect their structure, and load data.
Example
>>> explorer = H5Explorer('data.h5') >>> explorer.explore() # Display file structure >>> data = explorer.load('group1/dataset1') # Load specific dataset >>> explorer.close()
- scitex_io.has_h5_key(h5_path, key, max_retries=3, action_on_corrupted='delete')[source]
Robust version of has_h5_key that handles corrupted files and lock conflicts.
- class scitex_io.ZarrExplorer(storepath, mode='r')[source]
Bases:
objectInteractive Zarr store explorer.
- scitex_io.has_zarr_key(zarr_path, key)[source]
Check if key exists in Zarr store (no locking issues!).
- Return type:
- scitex_io.get_cache_info()[source]
Get cache statistics and configuration.
- Returns:
Cache information including stats and config
- Return type:
Dict[str, Any]
- scitex_io.configure_cache(enabled=None, max_size=None, verbose=None)[source]
Configure cache settings.
- scitex_io.save_text(obj, spath)
Save text content to a file.
- scitex_io.save_mp4(fig, spath_mp4)
- scitex_io.save_listed_dfs_as_csv(listed_dfs, spath_csv, indi_suffix=None, overwrite=False, verbose=False)
- listed_dfs:
[df1, df2, df3, …, dfN]. They will be written vertically in the order.
- spath_csv:
/hoge/fuga/foo.csv
- indi_suffix:
At the left top cell on the output csv file, ‘{}’.format(indi_suffix[i]) will be added, where i is the index of the df.On the other hand, when indi_suffix=None is passed, only ‘{}’.format(i) will be added.
- scitex_io.save_listed_scalars_as_csv(listed_scalars, spath_csv, column_name='_', indi_suffix=None, round=3, overwrite=False, verbose=False)
Puts to df and save it as csv
- scitex_io.migrate_h5_to_zarr(h5_path, zarr_path=None, compressor='zstd', chunks=True, overwrite=False, show_progress=True, validate=True)[source]
Migrate HDF5 file to Zarr format.
- Parameters:
h5_path (str or Path) – Path to input HDF5 file
zarr_path (str or Path, optional) – Path for output Zarr store. If None, uses h5_path with .zarr extension
compressor (str or compressor object, optional) – Compression to use: ‘zstd’, ‘lz4’, ‘gzip’, ‘blosc’, or None
chunks (bool or tuple, optional) – Chunking strategy. True for auto, False for no chunks, or specific shape
overwrite (bool, optional) – Whether to overwrite existing Zarr store
show_progress (bool, optional) – Whether to show migration progress
validate (bool, optional) – Whether to validate the migration by comparing shapes
- Returns:
Path to created Zarr store
- Return type:
- scitex_io.migrate_h5_to_zarr_batch(h5_paths, output_dir=None, compressor='zstd', chunks=True, overwrite=False, parallel=False, n_workers=None)[source]
Migrate multiple HDF5 files to Zarr format.
- Parameters:
h5_paths (list of str or Path) – List of HDF5 files to migrate
output_dir (str or Path, optional) – Directory for output Zarr stores
compressor (str or compressor object, optional) – Compression to use
overwrite (bool, optional) – Whether to overwrite existing Zarr stores
parallel (bool, optional) – Whether to process files in parallel
n_workers (int, optional) – Number of parallel workers
- Returns:
Paths to created Zarr stores
- Return type:
Core I/O
- scitex_io.save(obj, specified_path, makedirs=True, verbose=True, symlink_from_cwd=False, symlink_to=None, dry_run=False, no_csv=False, use_caller_path=False, **kwargs)[source]
Save an object to a file with the specified format.
- Parameters:
obj (Any) – The object to be saved.
specified_path (Union[str, Path]) – The file name or path where the object should be saved.
makedirs (bool, optional) – If True, create the directory path if it does not exist. Default is True.
verbose (bool, optional) – If True, print a message upon successful saving. Default is True.
symlink_from_cwd (bool, optional) – If True, create a symlink from the current working directory. Default is False.
symlink_to (Union[str, Path], optional) – If specified, create a symlink at this path pointing to the saved file.
dry_run (bool, optional) – If True, simulate the saving process without writing files. Default is False.
no_csv (bool, optional) – If True, skip CSV export for image saves. Default is False.
use_caller_path (bool, optional) – If True, skip internal library frames for path detection. Default is False.
**kwargs – Additional keyword arguments to pass to the underlying save function.
- Returns:
Path to saved file on success, False on error.
- Return type:
Path or None
- scitex_io.load(lpath, ext=None, show=False, verbose=False, cache=True, **kwargs)[source]
Load data from various file formats.
This function supports loading data from multiple file formats with optional caching.
- Parameters:
lpath (Union[str, Path]) – The path to the file to be loaded. Can be a string or pathlib.Path object.
ext (str, optional) – File extension to use for loading. If None, automatically detects from filename. Useful for files without extensions (e.g., UUID-named files). Examples: ‘pdf’, ‘json’, ‘csv’
show (bool, optional) – If True, display additional information during loading. Default is False.
verbose (bool, optional) – If True, print verbose output during loading. Default is False.
cache (bool, optional) – If True, enable caching for faster repeated loads. Default is True.
**kwargs (dict) – Additional keyword arguments to be passed to the specific loading function.
- Returns:
The loaded data object, which can be of various types depending on the input file format.
- Return type:
- Raises:
ValueError – If the file extension is not supported.
FileNotFoundError – If the specified file does not exist.
Supported Extensions –
------------------- –
- Data formats – .csv, .tsv, .xls, .xlsx, .xlsm, .xlsb, .json, .yaml, .yml:
- Scientific – .npy, .npz, .mat, .hdf5, .con:
- ML/DL – .pth, .pt, .cbm, .joblib, .pkl:
- Documents – .txt, .log, .event, .md, .docx, .pdf, .xml:
- Images – .jpg, .png, .tiff, .tif:
- EEG data – .vhdr, .vmrk, .edf, .bdf, .gdf, .cnt, .egi, .eeg, .set:
- Database – .db:
Examples
>>> data = load('data.csv') >>> image = load('image.png') >>> model = load('model.pth') >>> # Load file without extension (e.g., UUID PDF) >>> pdf = load('f2694ccb-1b6f-4994-add8-5111fd4d52f1', ext='pdf')
- scitex_io.load_configs(IS_DEBUG=None, show=False, verbose=False, config_dir=None)[source]
Load YAML configuration files from specified directory.
- Parameters:
IS_DEBUG (bool, optional) – Debug mode flag. If None, reads from IS_DEBUG.yaml
show (bool) – Show configuration changes
verbose (bool) – Print detailed information
config_dir (Union[str, Path], optional) – Directory containing configuration files. Can be a string or pathlib.Path object. Defaults to “./config” if None
- Returns:
Merged configuration dictionary
- Return type:
DotDict
- scitex_io.glob(expression, parse=False, ensure_one=False)[source]
Perform a glob operation with natural sorting and extended pattern support.
This function extends the standard glob functionality by adding natural sorting and support for curly brace expansion in the glob pattern.
Parameters:
- expressionUnion[str, Path]
The glob pattern to match against file paths. Can be a string or pathlib.Path object. Supports standard glob syntax and curly brace expansion (e.g., ‘dir/{a,b}/*.txt’).
- parsebool, optional
Whether to parse the matched paths. Default is False.
- ensure_onebool, optional
Ensure exactly one match is found. Default is False.
Returns:
: Union[List[str], Tuple[List[str], List[dict]]]
If parse=False: A naturally sorted list of file paths If parse=True: Tuple of (paths, parsed results)
Examples:
>>> glob('data/*.txt') ['data/file1.txt', 'data/file2.txt', 'data/file10.txt']
>>> glob('data/{a,b}/*.txt') ['data/a/file1.txt', 'data/a/file2.txt', 'data/b/file1.txt']
>>> paths, parsed = glob('data/subj_{id}/run_{run}.txt', parse=True) >>> paths ['data/subj_001/run_01.txt', 'data/subj_001/run_02.txt'] >>> parsed [{'id': '001', 'run': '01'}, {'id': '001', 'run': '02'}]
>>> paths, parsed = glob('data/subj_{id}/run_{run}.txt', parse=True, ensure_one=True) AssertionError # if more than one file matches
- scitex_io.reload(module_or_func, verbose=False)[source]
Reload a module or the module containing a given function.
This function attempts to reload a module directly if a module is passed, or reloads the module containing the function if a function is passed. This is useful during development to reflect changes without restarting the Python interpreter.
Parameters:
- module_or_funcmodule or function
The module to reload, or a function whose containing module should be reloaded.
- verbosebool, optional
If True, print additional information during the reload process. Default is False.
Returns:
: None
Raises:
- Exception
If the module cannot be found or if there’s an error during the reload process.
Notes:
Reloading modules can have unexpected side effects, especially for modules that maintain state or have complex imports. Use with caution.
This function modifies sys.modules, which affects the global state of the Python interpreter.
Examples:
>>> import my_module >>> reload(my_module)
>>> from my_module import my_function >>> reload(my_function)
- scitex_io.flush(sys=<module 'sys' (built-in)>)[source]
Flushes the system’s stdout and stderr, and syncs the file system. This ensures all pending write operations are completed.
- scitex_io.cache(id, *args)[source]
Store or fetch data using a pickle file.
This function provides a simple caching mechanism for storing and retrieving Python objects. It uses pickle to serialize the data and stores it in a file with a unique identifier. If the data is already cached, it can be retrieved without recomputation.
Parameters:
- idstr
A unique identifier for the cache file.
- *argsstr
Variable names to be cached or loaded.
Returns:
: tuple
A tuple of cached values corresponding to the input variable names.
Raises:
- ValueError
If the cache file is not found and not all variables are defined.
Example:
>>> import scitex >>> import numpy as np >>> >>> # Variables to cache >>> var1 = "x" >>> var2 = 1 >>> var3 = np.ones(10) >>> >>> # Saving >>> var1, var2, var3 = scitex.io.cache("my_id", "var1", "var2", "var3") >>> print(var1, var2, var3) >>> >>> # Loading when not all variables are defined and the id exists >>> del var1, var2, var3 >>> var1, var2, var3 = scitex.io.cache("my_id", "var1", "var2", "var3") >>> print(var1, var2, var3)
Registry
- scitex_io.register_saver(ext, fn=None, *, builtin=False)[source]
Register a save handler for a file extension.
Can be used as a decorator or called directly:
@register_saver(".json") def my_json_saver(obj, path, **kwargs): ... register_saver(".json", my_json_saver)
- scitex_io.register_loader(ext, fn=None, *, builtin=False)[source]
Register a load handler for a file extension.
Same API as
register_saver().
Cache Control
- scitex_io.get_cache_info()[source]
Get cache statistics and configuration.
- Returns:
Cache information including stats and config
- Return type:
Dict[str, Any]
Explorers
- class scitex_io.H5Explorer(filepath, mode='r')[source]
Interactive HDF5 file explorer.
This class provides convenient methods to explore HDF5 files, inspect their structure, and load data.
Example
>>> explorer = H5Explorer('data.h5') >>> explorer.explore() # Display file structure >>> data = explorer.load('group1/dataset1') # Load specific dataset >>> explorer.close()