Uaine.Pydat Package

This package provides comprehensive tools for data handling, processing, and database operations in Python. It includes specialised functions for file I/O operations, data transformation, data cleaning, system information gathering, and data generation. Key features include PSV file handling, cryptographic hash functions, data table cleaning methods, DuckDB integration with snippet queries, random data generation utilities, and XML/config file parsers. The package aims to simplify common data manipulation tasks while providing a flexible framework for both basic and advanced data operations.

Description

A Python package that streamlines data handling, processing, and database operations through a collection of utility functions and tools.

Uaine.Pydat provides a comprehensive toolkit for data scientists, analysts, and developers working with structured data. The package simplifies common data manipulation tasks while offering specialized functionality for file operations, data transformation, table cleaning, and database interactions.

Key Features include File I/O Operations for reading/writing various file formats, listing files by extension, and managing system paths; Data Transformation tools for reshaping, converting, and manipulating data structures with minimal code; Data Cleaning methods to sanitize, standardize, and prepare data tables for analysis; DuckDB Integration with helper functions and snippet queries; Cryptographic Hashing for data integrity and anonymization; System Information utilities to gather system metrics and resource usage; Data Generation for testing and development scenarios; and Configuration Handling for XML, INI, and other formats.

Core Modules include dataio.py for data input/output operations and format conversion; fileio.py for file system operations and path management; datatransform.py for data structure transformation and manipulation; dataclean.py for data cleaning and standardization; duckfunc.py for DuckDB database interactions; datahash.py for cryptographic hashing functions; systeminfo.py for system information gathering; datagen.py for random data generation; and bitgen.py for low-level bit generation utilities.

Common Use Cases for this package include simplifying ETL workflows, streamlining data preparation for analysis and machine learning, managing database operations with less boilerplate code, generating test data for development, monitoring system resources during data processing tasks, and securing sensitive data through hashing and anonymization.

This package aims to reduce the complexity of common data handling tasks by providing ready-made solutions that follow best practices while remaining flexible enough to adapt to various data processing requirements.

Dependencies

pandas
pyreadstat
requests
duckdb
wheel
twine
psutil
lxml
polars
azure-storage-blob
tqdm

bitgen

bitgen.gen_random_hex_string(bitlength) → str

Generates a random hexadecimal string of the given bit length.

Parameters:: bitlength (int): The length of the bit string to generate.
Returns:: str: A random hexadecimal string of the given bit length.

bitgen.generate_256_bit_string() → str

Generates a random 256-bit hexadecimal string.

Returns:: str: A random 256-bit hexadecimal string.

bitgen.generate_8_bit_string() → str

Generates a random 8-bit hexadecimal string.

Returns:: str: A random 8-bit hexadecimal string.

bitgen.generate_custom_uuid(marker: str = '-') → str

Generates a random UUID (Universally Unique Identifier) using UUID version 4, formatted with a specified marker.

Parameters:: marker (str): The character to use as a separator in the UUID. Default is ‘-‘.
Returns:: str: A string representation of a random UUID with the specified marker.

bitgen.generate_random_uuid() → str

Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.

Returns:: str: A string representation of a random UUID without dashes.

blobhelper

blobhelper.check_sas_token(account_url, container, sastoken)

Checks if the provided SAS token is valid for the given container.

Args:: account_url (str): The Azure Storage account URL. container (str): The name of the container to check access for. sastoken (str): The SAS token to validate.
Returns:: bool: True if the SAS token is valid and has access, False otherwise. str: Optional error message if invalid.

blobhelper.download_all_blobs(account_url, container, folder_path, sastoken, download_loc, file_extn='', makedirs=True)

Download all blobs from an Azure Storage container to a local directory.

Args:

account_url (str): The Azure Storage account URL. container (str): The name of the container to download blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. download_loc (str): Local directory path where blobs will be downloaded. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).

If empty, downloads all blobs. Defaults to “”.

makedirs (bool, optional): Whether to create the download directory if it doesn’t exist.: Defaults to True.

Note:

This function will create the download directory if it doesn’t exist and makedirs is True. Files are downloaded with their original names from the blob storage.

blobhelper.get_account_url(storage_account)

Generate the Azure Storage account URL from the storage account name.

Args:

storage_account (str): The name of the Azure Storage account.

Returns:

str: The complete Azure Storage account URL in the format: ‘https://{storage_account}.blob.core.windows.net’

Example:

>>> get_account_url('myaccount')
'https://myaccount.blob.core.windows.net'

blobhelper.get_blob_container_path(storage_account, container)

Returns the absolute path of the blob container.

Parameters:

storage_account – Name of the Azure Storage account.
container – Name of the blob container.

Returns:

Absolute path as a string.

blobhelper.get_blob_md5_checksums(account_url, container, sastoken, blob_list, use_hex=False)

Retrieves MD5 checksums for a list of blobs in Azure Blob Storage.

Args:: account_url (str): The Azure Storage account URL. container (str): The name of the container. sastoken (str): The SAS token for authentication. blob_list (list): List of BlobProperties objects. use_hex (bool): If True, returns checksum as hex; otherwise Base64. Default is False.
Returns:: dict: A dictionary mapping blob names to their MD5 checksums, or None if not available.

blobhelper.get_blob_subfolder_path(storage_account, container, subfolder)

Returns the absolute path of a blob subfolder using the container path function.

Parameters:

storage_account – Name of the Azure Storage account.
container – Name of the blob container.
subfolder – Path to subfolder within the container.

Returns:

Absolute path as a string.

blobhelper.list_blob_content(account_url, container, folder_path, sastoken, file_extn='')

List blobs in an Azure Storage container with optional file extension filtering.

Args:: account_url (str): The Azure Storage account URL. container (str): The name of the container to list blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).

If empty, returns all blobs. Defaults to “”.
Returns:: list: A list of BlobProperties objects matching the specified criteria.

dataclean

dataclean.check_column_completeness(df: DataFrame) → dict

Calculate the percentage of non-missing values for each column.

Args:: df (DataFrame): Input DataFrame
Returns:: dict: Dictionary mapping column names to completeness percentage

dataclean.clean_whitespace_in_df(df: DataFrame) → DataFrame

Remove leading and trailing whitespace from all string columns in a DataFrame.

Parameters:: df (DataFrame): The input DataFrame.
Returns:: DataFrame: Theinput DataFrame with leading and trailing whitespace removed from string columns.

dataclean.convert_to_numeric(df: DataFrame, columns: list) → DataFrame

Convert specified columns to numeric type, with errors coerced to NaN.

Args:: df (DataFrame): Input DataFrame columns (list): List of column names to convert
Returns:: DataFrame: DataFrame with specified columns converted to numeric

dataclean.keep_alphanumeric(input_string: str) → str

Filter a string to keep only alphanumeric characters.

Args:: input_string (str): The input string to filter
Returns:: str: String containing only alphanumeric characters

dataclean.keep_only_letters(input_string: str) → str

Filter a string to keep only alphabetic characters (letters).

Args:: input_string (str): The input string to filter
Returns:: str: String containing only letters from the input

dataclean.normalize_text(input_string: str) → str

Normalise text by converting to lowercase and removing accents.

Args:: input_string (str): The input string to normalize
Returns:: str: Normalized string

dataclean.remove_empty_rows(df: DataFrame) → DataFrame

Remove rows where all values are empty or NaN.

Args:: df (DataFrame): Input DataFrame
Returns:: DataFrame: The input DataFrame with all empty rows removed

datagen

datagen.gen_bool_column(size: int, true_prob: float = 0.5, null_prob: float = 0.0, labels: Tuple[str, str] | None = None) → Series

Generate a pandas Series of random boolean values.

Parameters:

sizeint: Number of values to generate
true_probfloat: Probability of generating a True value (0.0 to 1.0)
null_probfloat: Probability of generating a null value (0.0 to 1.0)
labelstuple(str, str), optional: Custom labels for (False, True) values. If provided, returns strings instead of booleans.

Returns:

pd.Series: Series of randomly generated boolean values or custom labels

datagen.gen_categorical_column(size: int, categories: List[str] | None = None, weights: List[float] | None = None, null_prob: float = 0.0) → Series

Generate a pandas Series of random categorical values.

Parameters:

sizeint: Number of values to generate
categorieslist of str: List of possible categorical values
weightslist of float, optional: Probability weights for each category. Must sum to 1.0 if provided.
null_probfloat: Probability of generating a null value (0.0 to 1.0)

Returns:

pd.Series: Series of randomly generated categorical values

datagen.gen_dataframe(rows: int, columns: dict, include_id: bool = True) → DataFrame

Generate a pandas DataFrame with specified columns.

Parameters:

rowsint: Number of rows to generate
columnsdict: Dictionary where keys are column names and values are functions to generate the column data
include_idbool: Whether to include an ‘id’ column with sequential integers

Returns:

pd.DataFrame: Generated DataFrame with the specified columns

datagen.gen_date_column(size: int, start_date: str | datetime = '2020-01-01', end_date: str | datetime = '2023-12-31', date_format: str = '%Y-%m-%d', null_prob: float = 0.0, distribution: str = 'uniform') → Series

Generate a pandas Series of random dates.

Parameters:

sizeint: Number of dates to generate
start_datestr or datetime: Starting date (inclusive)
end_datestr or datetime: Ending date (inclusive)
date_formatstr: Format string for date output (if returning strings)
null_probfloat: Probability of generating a null value (0.0 to 1.0)
distributionstr: Distribution to use for generating dates: - ‘uniform’: Uniform distribution between start and end dates - ‘normal’: Normal distribution centered on the midpoint - ‘recent’: Bias towards more recent dates

Returns:

pd.Series: Series of randomly generated dates as strings in the specified format

datagen.gen_numeric_column(size: int, data_type: str = 'float', min_val: int | float = 0, max_val: int | float = 100, distribution: str = 'uniform', null_prob: float = 0.0, precision: int | None = None) → Series

Generate a pandas Series of random numbers.

Parameters:

sizeint: Number of values to generate
data_typestr: Type of numeric data: ‘int’, ‘float’, or ‘decimal’
min_valint or float: Minimum value (inclusive)
max_valint or float: Maximum value (inclusive for ints, exclusive for floats)
distributionstr: Distribution to use for generating values: - ‘uniform’: Uniform distribution between min and max - ‘normal’: Normal distribution with mean=(min+max)/2 and std=(max-min)/6 - ‘exponential’: Exponential distribution - ‘lognormal’: Log-normal distribution
null_probfloat: Probability of generating a null value (0.0 to 1.0)
precisionint, optional: For float/decimal, number of decimal places to round to

Returns:

pd.Series: Series of randomly generated numeric values

datagen.gen_sample_dataframe(rows: int, include_id: bool = True) → DataFrame

Generate a sample pandas DataFrame from various column types.

Arguments:: rows (int): The number of rows to generate for the DataFrame. include_id (bool): Whether to include an ‘id’ column as a unique identifier for each row. Defaults to True.
Returns:: pd.DataFrame: A pandas DataFrame containing the sample data.

datagen.gen_string_column(size: int, length: int | Tuple[int, int] = 10, charset: str | None = None, prefix: str = '', suffix: str = '', null_prob: float = 0.0, pattern: str | None = None) → Series

Generate a pandas Series of random strings.

Parameters:

size: int

Number of strings to generate

length: int or tuple(int, int)

If int, the exact length of each string If tuple, the (min, max) length range for random string length

charset: str, optional

String containing characters to use. If None, uses lowercase letters

prefix: str, optional

Prefix to add to each generated string

suffix: str, optional

Suffix to add to each generated string

null_prob: float, optional

Probability of generating a null value (0.0 to 1.0)

pattern: str, optional

Pattern to use for string generation with character classes:

‘L’ = uppercase letter
‘l’ = lowercase letter
‘d’ = digit
‘c’ = special character
‘a’ = any alphanumeric character

Example: ‘Llldd-lldd’ would generate something like ‘Tgh45-jk78’

Returns:

pd.Series: Series of randomly generated strings

datahash

datahash.hash256(datastr: str, sha_salt: str) → str

Create a SHA-256 hash using the provided data and salt.

Args:: datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the SHA-256 hashing.
Returns:: str: The hexadecimal digest of the SHA-256 hash.

datahash.hashhmac(datastr: str, sha_salt: bytes, method=<built-in function openssl_sha256>) → str

Create an HMAC hash using the provided data, salt, and method.

Args:: datastr (str): The data to be hashed. sha_salt (bytes): The salt to be used in the HMAC hashing. method: The hashing method to be used (default is hashlib.sha256).
Returns:: str: The hexadecimal digest of the HMAC hash.

datahash.hashmd5(datastr: str, sha_salt: str) → str

Create an MD5 hash using the provided data and salt.

Args:: datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the MD5 hashing.
Returns:: str: The hexadecimal digest of the MD5 hash.

datahash.randomize_hash(hash_string: str, salt_length: int = 16) → str

Randomizes a hash by using hash256 with a random salt.

Args:: hash_string (str): The original hash or string to randomize. salt_length (int): The length of the random salt in bytes (default is 16).
Returns:: str: A new randomized hash derived from the original using hash256.

dataio

dataio.csv_to_parquet(input_file: str, separator: str = ',', output_file: str | None = None) → None

Converts a CSV file to a Parquet file using Polars in streaming mode.

Parameters:

input_file (str): The path to the CSV file.
separator (str): The delimiter used in the CSV (default: comma).
output_file (Optional[str]): The path for the output Parquet file.
If not provided, it defaults to the same prefix as input_file with a .parquet extension.

dataio.df_memory_usage(df: DataFrame) → float

Calculate the total memory usage of a DataFrame with deep=False.

Parameters: df (pd.DataFrame): The DataFrame whose memory usage is to be calculated.

Returns: float: The total memory usage of the DataFrame in bytes.

dataio.read_flat_df(filepath: str) → DataFrame

Read a flat file into a DataFrame.

Args:: filepath (str): The path to the flat file.
Returns:: pd.DataFrame: The DataFrame read from the file.

dataio.read_flat_psv(path: str) → DataFrame

Read a pipe-separated values (PSV) file into a DataFrame.

Args:: path (str): The path to the PSV file.
Returns:: pd.DataFrame: The DataFrame read from the PSV file.

dataio.read_ini_file(file_path: str) → dict

Read an INI file and return its contents as a dictionary.

Args:: file_path (str): The path to the INI file.
Returns:: dict: A dictionary containing the key-value pairs from the INI file.

dataio.read_json_file(filepath: str, orient: str = 'records', normalize: bool = False, record_path: str | None = None, meta: list | None = None, encoding: str = 'utf-8') → DataFrame

Read a JSON file into a DataFrame.

Args:: filepath (str): The path to the JSON file. orient (str): The format of the JSON structure. Default is ‘records’. normalize (bool): Whether to normalize nested JSON data. Default is False. record_path (str or list): Path to the records in nested JSON. Default is None. meta (list): Fields to use as metadata for each record. Default is None. encoding (str): The file encoding. Default is ‘utf-8’.
Returns:: pd.DataFrame: The DataFrame read from the JSON file.

dataio.read_sas_colnames(filepath: str, encoding: str = 'latin-1') → list

Read SAS file column names.

Args:: filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.
Returns:: list: A list of column names from the SAS file.

dataio.read_sas_metadata(filepath: str, encoding: str = 'latin-1') → dict

Read SAS file metadata and return names, labels, formats, and lengths of columns.

Args:: filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.
Returns:: dict: A dictionary containing the column names, labels, formats, and lengths.

dataio.read_xml_file(filepath: str, xpath: str = './*', attrs_only: bool = False, encoding: str = 'utf-8') → DataFrame

Read an XML file into a DataFrame.

Args:: filepath (str): The path to the XML file. xpath (str): XPath string to parse specific nodes. Default is ./* attrs_only (bool): Parse only the attributes, not the child elements. Default is False. encoding (str): The file encoding. Default is ‘utf-8’.
Returns:: pd.DataFrame: The DataFrame read from the XML file.

dataio.select_dataset_ui(directory: str, extension: str) → str

List the files with the specified extension in the given directory and prompt the user to select one.

Parameters: directory (str): The directory to search for files. extension (str): The file extension to filter by.

Returns: str: The filename of the selected dataset.

dataio.set_globals_from_config(configpath: str) → int

Sets global variables from a configuration file.

Parameters: configpath (str): Path to the configuration file.

Returns: int: The number of global variables set.

dataio.write_flat_df(df: DataFrame, filepath: str, index: bool = False)

Write a DataFrame to a flat file in different formats.

Args:: df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the file will be saved. index (bool): Whether to write row names (index). Default is False.
Returns:: None

dataio.write_json_file(df: DataFrame, filepath: str, orient: str = 'records', index: bool = False, indent: int = 4)

Write a DataFrame to a JSON file.

Args:: df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the JSON file will be saved. orient (str): The format of the JSON structure. Default is ‘records’. index (bool): Whether to include the index in the JSON. Default is False. indent (int): The indentation level for the JSON file. Default is 4.
Returns:: None

dataio.write_xml_file(df: DataFrame, filepath: str, index: bool = False, root_name: str = 'data', row_name: str = 'row', attr_cols: list | None = None)

Write a DataFrame to an XML file.

Args:: df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the XML file will be saved. index (bool): Whether to include the index in the XML. Default is False. root_name (str): The name of the root element. Default is ‘data’. row_name (str): The name of each row element. Default is ‘row’. attr_cols (list): List of columns to write as attributes, not elements. Default is None.
Returns:: None

datatransform

datatransform.add_prefix(string: str, prefix: str) → str

Add the specified prefix to the string.

Parameters: string (str): The original string. prefix (str): The prefix to add to the string.

Returns: str: The string with the prefix added.

datatransform.add_suffix(string: str, suffix: str) → str

Add the specified suffix to the string.

Parameters: string (str): The original string. suffix (str): The suffix to add to the string.

Returns: str: The string with the suffix added.

datatransform.break_into_lines(string: str) → list[str]

Breaks a string into a list of lines.

Args:: string (str): The input string to be broken into lines.
Returns:: list[str]: A list of lines from the input string.

datatransform.dataframe_to_json(df: DataFrame, orient: str = 'records', date_format: str = 'iso', indent: int | None = None) → str

Convert a DataFrame to a JSON string with various orientation options.

Parameters:

dfpd.DataFrame: The DataFrame to convert to JSON
orientstr, default ‘records’: The JSON string orientation. See json_to_dataframe for options.
date_formatstr, default ‘iso’: Format for dates in the resulting JSON: - ‘epoch’: Use Unix epoch (seconds since 1970-01-01) - ‘iso’: ISO 8601 formatted dates
indentint, default None: Indentation level for the resulting JSON string. None = no indentation.

Returns:

str: JSON string representation of the DataFrame

datatransform.dataframe_to_xml(df: DataFrame, root_name: str = 'data', row_name: str = 'row') → str

Convert a DataFrame to an XML string.

Parameters:

dfpd.DataFrame: The DataFrame to convert to XML
root_namestr, default ‘data’: The name of the root XML element
row_namestr, default ‘row’: The name of each row element

Returns:

str: XML string representation of the DataFrame

datatransform.json_extract_subtree(json_data, path: str) → any

Extract a subtree from a JSON object using a dot-notation path.

Parameters:

json_datadict or list: The JSON data to extract from
pathstr: Path to the subtree using dot notation (e.g., ‘person.address.city’) Use array indices like ‘results.0.name’ to access list elements

Returns:

any: The subtree at the specified path, or None if path doesn’t exist

Examples:

>>> data = {'person': {'name': 'John', 'addresses': [{'city': 'New York'}, {'city': 'Boston'}]}}
>>> json_extract_subtree(data, 'person.addresses.0.city')
'New York'

datatransform.json_to_dataframe(json_data, orient='records', normalize=False, record_path=None, meta=None, encoding='utf-8')

Convert JSON data into a pandas DataFrame.

Parameters:

json_datastr, dict, list, or path to file: The JSON data to convert. Can be: - A string containing JSON data - A Python dict or list containing JSON data - A file path to a JSON file
orientstr, default ‘records’: The JSON string orientation. Allowed values: - ‘records’: list-like [{column -> value}, … ] - ‘split’: dict-like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]} - ‘index’: dict-like {index -> {column -> value}} - ‘columns’: dict-like {column -> {index -> value}} - ‘values’: just the values array
normalizebool, default False: Whether to normalize semi-structured JSON data into a flat table
record_pathstr or list of str, default None: Path in each object to list of records. If not passed, data will be assumed to be an array of records.
metalist of str, default None: Fields to use as metadata for each record in resulting DataFrame
encodingstr, default ‘utf-8’: Encoding to use when reading JSON from a file

Returns:

pd.DataFrame: The converted DataFrame

Examples:

# From a JSON string >>> json_str = ‘{“name”: “John”, “age”: 30, “city”: “New York”}’ >>> df = json_to_dataframe(json_str)

# From a file >>> df = json_to_dataframe(‘data.json’)

# With nested data >>> json_str = ‘{“users”: [{“name”: “John”, “age”: 30}, {“name”: “Jane”, “age”: 25}]}’ >>> df = json_to_dataframe(json_str, record_path=’users’)

datatransform.merge_dataframes(df_list: list[DataFrame]) → DataFrame: Merges a list of DataFrames into a single DataFrame, aligning columns by name. Missing columns will be filled with NaN.

datatransform.merge_json_objects(json1: dict, json2: dict, merge_lists: bool = False) → dict

Merge two JSON objects, with the second one taking precedence for overlapping keys.

Parameters:

json1dict: First JSON object (base)
json2dict: Second JSON object (takes precedence when keys overlap)
merge_listsbool, default False: If True, merge list items; if False, replace lists entirely

Returns:

dict: Merged JSON object

datatransform.xml_to_dataframe(xml_data, xpath: str = './*') → DataFrame

Convert XML data to a pandas DataFrame.

Parameters:

xml_datastr or file-like object or path: The XML data to convert. Can be: - A string containing XML data - A file path to an XML file - A file-like object containing XML data
xpathstr, default ./*: XPath string to parse specific nodes

Returns:

pd.DataFrame: The DataFrame representation of the XML data

duckfunc

duckfunc.does_table_exist(db_con, dbname: str, tablename: str) → bool

Check if a table exists in the specified database.

Args:: db_con: The database connection object. dbname (str): The name of the database. tablename (str): The name of the table.
Returns:: bool: True if the table exists, False otherwise.

duckfunc.getCurrentTimeForDuck(timezone_included: bool = False) → str

Get the current time formatted for DuckDB, optionally including the timezone.

Args:: timezone_included (bool): If True, includes the timezone in the returned string.
Returns:: str: The current time formatted as ‘YYYY-MM-DD HH:MM:SS’ (with optional timezone).

duckfunc.getDuckVersion(con) → str

Get the connected DuckDB version.

Args:: con: The database connection object.
Returns:: str: The version of the DuckDB you have open.

duckfunc.get_attached_dbs(db_con) → DataFrame

Get the list of attached databases.

Args:: db_con: The database connection object.
Returns:: DataFrame: A DataFrame containing the database name, path, and type.

duckfunc.get_inventory(db_con) → DataFrame

Get the inventory of tables.

Args:: db_con: The database connection object.
Returns:: DataFrame: A DataFrame containing all tables.

duckfunc.get_table_as_df(con, db_name: str, table_name: str) → DataFrame

Query a table from the specified database and return it as a pandas DataFrame.

Args:: con: Database connection object db_name (str): Name of the database table_name (str): Name of the table
Returns:: DataFrame: The table contents as a pandas DataFrame, or None if the table doesn’t exist

duckfunc.init_table(con, frame: DataFrame, db: str, tablename: str) → bool

Initialize a table in the specified database.

Args:: con: The database connection object. frame (DataFrame): A DataFrame containing columns VARNAME and TYPE, which should be DuckDB-compatible. db (str): The name of the database. tablename (str): The name of the table.
Returns:: bool: True if the table was created, False if it already exists.

duckfunc.save_from_db(con, db_name: str, table_name: str, output_path: str) → bool

Query a table from the specified database and save it to the given output path. The output format is determined from the file extension of the output path.

Args:: con: Database connection object db_name (str): Name of the database table_name (str): Name of the table output_path (str): Path to save the output file (extension determines format)
Returns:: bool: True if the table existed and was saved, False otherwise

fileio

fileio.addsyspath(directory: str)

Add the specified directory to the system path if it is not already included.

Parameters:: directory (str): The directory to be added to the system path.
Returns:: None

fileio.calculate_checksums(dir_path)

Calculate MD5 checksums for all files in the specified directory.

Parameters:: dir_path (str) – Path to the directory containing files.
Returns:: Dictionary mapping file paths to their MD5 checksum.
Return type:: dict

fileio.create_filepath_dirs(path: str)

Creates all directories needed for a given file path.

If the path contains folders, this function creates all necessary directories in the path if they don’t already exist.

Parameters:: path (str): The file path for which to create directories.
Returns:: None

fileio.download_file_from_url(url: str, save_path: str)

Downloads a file from the given URL and saves it to the specified path.

Args:: url (str): The URL of the file to download. save_path (str): The file path where the downloaded file will be saved.
Returns:: None

fileio.gen_random_subfolder(master_dir: str) → str

Generates a random subfolder within the specified master directory.

Args:: master_dir (str): The path to the master directory where the subfolder will be created.
Returns:: str: The path to the newly created subfolder.

fileio.get_file_extension(filepath: str) → str

Get the file extension of the given file path.

Parameters:: filepath – The path of the file.
Returns:: The file extension of the file.

fileio.list_dirs(main_dir: str) → list

List all directories within the specified main directory.

Args:: main_dir (str): The main directory path to list directories from.
Returns:: list: A list of directory names within the specified main directory.

fileio.list_files_of_extension(directory: str, extn: str) → list[str]

List all files in the specified directory with the given extension.

Parameters:

directory – The directory to search in.
extn – The file extension to filter by.

Returns:

A list of file paths with the specified extension.

fileio.mv_file(src: str, dest: str)

Moves a file from the source path to the destination path using shutil.

Parameters:: src (str): The path of the file to be moved. dest (str): The destination path where the file should be moved.
Returns:: None

fileio.read_file_to_bytes(file_path: str) → bytes

Read the string content from the specified file and convert it to bytes using UTF-8 encoding.

Parameters:: file_path (str): The path to the file.
Returns:: bytes: The content of the file as bytes.

fileio.read_file_to_string(file_path: str) → str

Read the string content from the specified file.

Parameters:: file_path (str): The path to the file.
Returns:: str: The content of the file as a string.

fileio.remove_directory(dir_path: str) → bool

Removes a directory at the specified path.

Attempts to remove the directory and prints the result. If an error occurs during removal, the exception is caught and an error message is printed.

Parameters:: dir_path (str): The path to the directory to be removed.
Returns:: bool: True if directory was successfully removed, False otherwise.
Raises:: No exceptions are raised as they are caught and printed internally.

systeminfo

systeminfo.free_gb_in_drive(drive: str) → float

Calculate the free space in a specified drive in gigabytes (GB).

Parameters:: drive (str): The drive to check the free space of.
Returns:: float: The free space in gigabytes (GB).

systeminfo.gather_free_space_in_drive(drive: str) → float

Gather the free space in a specified drive.

Parameters:: drive (str): The drive to check the free space of. If the drive is a single letter, it is assumed to be a Windows drive.
Returns:: float: The free space in bytes.

systeminfo.get_battery_info() → dict

Get information about the system battery.

Returns:

dict: Dictionary containing battery percentage, time left, and power plugged status.: Returns None if no battery is present.

systeminfo.get_cpu_usage_percent() → float

Get the current CPU usage as a percentage.

Returns:: float: Current CPU usage percentage.

systeminfo.get_formatted_uptime() → str

Get the system uptime formatted as days, hours, minutes, seconds.

Returns:: str: Formatted uptime string.

systeminfo.get_free_ram() → int

Get the amount of free RAM available in bytes.

Returns:: int: The amount of free RAM in bytes.

systeminfo.get_free_ram_in_gb() → float

Get the amount of free RAM on the system in gigabytes.

This function uses the psutil library to retrieve the amount of free RAM and converts it from bytes to gigabytes.

Returns:: float: The amount of free RAM in gigabytes.

systeminfo.get_installed_ram_gb() → int

Get the total amount of installed RAM in gigabytes (GB).

Returns:: int: The total amount of installed RAM in gigabytes (GB).

systeminfo.get_largest_drive() → dict[str, any]

Identifies and returns information about the drive with the most free space.

The function finds the drive with the maximum available free space and returns its information with only the letters kept in the drive name.

Returns:: dict[str, any]: Dictionary containing information about the drive with the most free space, with the drive name containing only letters.

systeminfo.get_network_stats() → DataFrame

Get statistics for all network interfaces.

Returns:: pd.DataFrame: DataFrame with network interface statistics.

systeminfo.get_number_virtual_cores() → int

Get the number of virtual (logical) CPU cores including hyperthreads.

Returns:: int: The number of virtual CPU cores.

systeminfo.get_per_cpu_usage_percent() → list[float]

Get CPU usage percentage for each individual CPU core.

Returns:: list[float]: List of CPU usage percentages for each core.

systeminfo.get_physical_cores() → int

Get the number of physical CPU cores.

Returns:: int: The number of physical CPU cores.

systeminfo.get_system_info() → dict

Get general system information.

Returns:: dict: Dictionary containing OS, hostname, and platform information.

systeminfo.get_system_uptime() → float

Get the system uptime in seconds.

Returns:: float: System uptime in seconds.

systeminfo.get_top_processes(n=5) → DataFrame

Get the top n processes by memory usage.

Parameters:: n (int): Number of processes to return. Default is 5.
Returns:: pd.DataFrame: DataFrame with top processes information.

systeminfo.list_drive_spaces() → DataFrame

List all available drives and their free space in gigabytes (GB).

Returns:: pd.DataFrame: A DataFrame with the drive names and their free space in gigabytes (GB).

systeminfo.list_drives() → list[str]

List all available drives on the system.

Returns:: list[str]: A list of device names for all available drives.

Release History

Version 1.5.3

Blob helper module now contains a function for checking if a SAS token works for a container

Version 1.5.2

Blob helper module now contains a checksum function for listed blobs

Version 1.5.1

Blob helper module fix on makedirs argument for Downloads
New chunking based function for downloading blobs

Version 1.5

Enhanced blob helper module with new functions:
- list_blob_content: Lists all blobs in a specified container or subfolder
- download_all_blobs: Downloads a blob to a specified local path
- get_account_url: Generates Azure Storage account URL from storage account name

Version 1.4

Added new function to datatransform module:
- merge_dataframes: Merges a list of pandas DataFrames into a single DataFrame, aligning columns by name and filling missing columns with NaN.

Version 1.3

Added new data conversion functionality in dataio module:
- csv_to_parquet: Converts CSV files to Parquet format using Polars in streaming mode with support for custom separators and output file naming
Enhanced hashing capabilities with new functions in datahash module:
- hashmd5: Creates an MD5 hash using provided data and salt
Added new Azure blob module (blobhelper) for path operations:
- get_blob_container_path: Returns the absolute path of a blob container
- get_blob_subfolder_path: Returns the absolute path of a blob subfolder within a container

Version 1.2.1

Added new function to fileio module:
- calculate_checksums: Calculates MD5 checksums for all files in a specified directory, returning a dictionary mapping file paths to their checksums.
Methods that return no value are no longer displayed in the function signature
Fix applied for move file function to be platform independent

Version 1.2

Enhanced datatransform module with:
- json_to_dataframe: Converts JSON data into pandas DataFrames with support for:
  - Multiple input types (JSON strings, Python dicts/lists, file paths)
  - Custom orientation options for structured data
  - Normalization of nested JSON structures
  - Handling of both single objects and arrays of records
  - Customizable encoding for file reading
Fixed systeminfo module with:
- Reworked drive listing to continue functioning when some drives are inaccessible
- Improved reliability of get_largest_drive() function
Improved fileio module with:
- Enhanced remove_directory() to return boolean success/failure status instead of only printing messages
- Added proper type hints and improved documentation for key functions

Version 1.1

Added new generic functions for duck database to output tables to files
Expansion of systeminformation module to feedback:
- CPU usage
- System uptime
- System information
- Network stats
- Top processes
Enhanced dataclean module with:
- keep_only_letters: Filters strings to keep only alphabetical characters
- keep_alphanumeric: Filters strings to keep only alphanumeric characters
- normalize_text: Converts text to lowercase and removes accents
- remove_empty_rows: Removes rows where all values are empty or NaN
- convert_to_numeric: Converts specified columns to numeric type
- check_column_completeness: Calculates the percentage of non-missing values for each column
Added new datagen module for synthetic data generation:
- gen_string_column: Generates random string data with patterns and customization
- gen_numeric_column: Generates numeric data with various distributions
- gen_date_column: Generates date data with customizable ranges and formats
- gen_categorical_column: Generates categorical data with optional weighted distributions
- gen_bool_column: Generates boolean data with custom labels
- gen_dataframe: Creates complete dataframes with customizable columns
- gen_sample_dataframe: Generates ready-to-use test dataframes with common column types

Version 1.0.2

Added Bitgen Features:
- generate_random_uuid: Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.
- generate_custom_uuid: Generates a random UUID using UUID version 4, formatted with a specified marker. The default marker is a dash (‘-‘), but it can be customized.

Version 1.0.1

First full release of package