Uaine.Pydat Package

This package provides comprehensive tools for data handling, processing, and database operations in Python. It includes specialised functions for file I/O operations, data transformation, data cleaning, system information gathering, and data generation. Key features include PSV file handling, cryptographic hash functions, data table cleaning methods, DuckDB integration with snippet queries, random data generation utilities, and XML/config file parsers. The package aims to simplify common data manipulation tasks while providing a flexible framework for both basic and advanced data operations.

PyPI Downloads License: GPL v3

Description

A Python package that streamlines data handling, processing, and database operations through a collection of utility functions and tools.

Uaine.Pydat provides a comprehensive toolkit for data scientists, analysts, and developers working with structured data. The package simplifies common data manipulation tasks while offering specialized functionality for file operations, data transformation, table cleaning, and database interactions.

Key Features include File I/O Operations for reading/writing various file formats, listing files by extension, and managing system paths; Data Transformation tools for reshaping, converting, and manipulating data structures with minimal code; Data Cleaning methods to sanitize, standardize, and prepare data tables for analysis; DuckDB Integration with helper functions and snippet queries; Cryptographic Hashing for data integrity and anonymization; System Information utilities to gather system metrics and resource usage; Data Generation for testing and development scenarios; and Configuration Handling for XML, INI, and other formats.

Core Modules include dataio.py for data input/output operations and format conversion; fileio.py for file system operations and path management; datatransform.py for data structure transformation and manipulation; dataclean.py for data cleaning and standardization; duckfunc.py for DuckDB database interactions; datahash.py for cryptographic hashing functions; systeminfo.py for system information gathering; datagen.py for random data generation; and bitgen.py for low-level bit generation utilities.

Common Use Cases for this package include simplifying ETL workflows, streamlining data preparation for analysis and machine learning, managing database operations with less boilerplate code, generating test data for development, monitoring system resources during data processing tasks, and securing sensitive data through hashing and anonymization.

This package aims to reduce the complexity of common data handling tasks by providing ready-made solutions that follow best practices while remaining flexible enough to adapt to various data processing requirements.

Dependencies

  • pandas

  • pyreadstat

  • requests

  • duckdb

  • wheel

  • twine

  • psutil

  • lxml

  • polars

  • azure-storage-blob

  • tqdm

bitgen

bitgen.gen_random_hex_string(bitlength) str

Generates a random hexadecimal string of the given bit length.

Parameters:

bitlength (int): The length of the bit string to generate.

Returns:

str: A random hexadecimal string of the given bit length.

bitgen.generate_256_bit_string() str

Generates a random 256-bit hexadecimal string.

Returns:

str: A random 256-bit hexadecimal string.

bitgen.generate_8_bit_string() str

Generates a random 8-bit hexadecimal string.

Returns:

str: A random 8-bit hexadecimal string.

bitgen.generate_custom_uuid(marker: str = '-') str

Generates a random UUID (Universally Unique Identifier) using UUID version 4, formatted with a specified marker.

Parameters:

marker (str): The character to use as a separator in the UUID. Default is ‘-‘.

Returns:

str: A string representation of a random UUID with the specified marker.

bitgen.generate_random_uuid() str

Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.

Returns:

str: A string representation of a random UUID without dashes.

blobhelper

blobhelper.check_sas_token(account_url, container, sastoken)

Checks if the provided SAS token is valid for the given container.

Args:

account_url (str): The Azure Storage account URL. container (str): The name of the container to check access for. sastoken (str): The SAS token to validate.

Returns:

bool: True if the SAS token is valid and has access, False otherwise. str: Optional error message if invalid.

blobhelper.download_all_blobs(account_url, container, folder_path, sastoken, download_loc, file_extn='', makedirs=True)

Download all blobs from an Azure Storage container to a local directory.

Args:

account_url (str): The Azure Storage account URL. container (str): The name of the container to download blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. download_loc (str): Local directory path where blobs will be downloaded. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).

If empty, downloads all blobs. Defaults to “”.

makedirs (bool, optional): Whether to create the download directory if it doesn’t exist.

Defaults to True.

Note:

This function will create the download directory if it doesn’t exist and makedirs is True. Files are downloaded with their original names from the blob storage.

blobhelper.get_account_url(storage_account)

Generate the Azure Storage account URL from the storage account name.

Args:

storage_account (str): The name of the Azure Storage account.

Returns:
str: The complete Azure Storage account URL in the format

https://{storage_account}.blob.core.windows.net’

Example:
>>> get_account_url('myaccount')
'https://myaccount.blob.core.windows.net'
blobhelper.get_blob_container_path(storage_account, container)

Returns the absolute path of the blob container.

Parameters:
  • storage_account – Name of the Azure Storage account.

  • container – Name of the blob container.

Returns:

Absolute path as a string.

blobhelper.get_blob_md5_checksums(account_url, container, sastoken, blob_list, use_hex=False)

Retrieves MD5 checksums for a list of blobs in Azure Blob Storage.

Args:

account_url (str): The Azure Storage account URL. container (str): The name of the container. sastoken (str): The SAS token for authentication. blob_list (list): List of BlobProperties objects. use_hex (bool): If True, returns checksum as hex; otherwise Base64. Default is False.

Returns:

dict: A dictionary mapping blob names to their MD5 checksums, or None if not available.

blobhelper.get_blob_subfolder_path(storage_account, container, subfolder)

Returns the absolute path of a blob subfolder using the container path function.

Parameters:
  • storage_account – Name of the Azure Storage account.

  • container – Name of the blob container.

  • subfolder – Path to subfolder within the container.

Returns:

Absolute path as a string.

blobhelper.list_blob_content(account_url, container, folder_path, sastoken, file_extn='')

List blobs in an Azure Storage container with optional file extension filtering.

Args:

account_url (str): The Azure Storage account URL. container (str): The name of the container to list blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).

If empty, returns all blobs. Defaults to “”.

Returns:

list: A list of BlobProperties objects matching the specified criteria.

dataclean

dataclean.check_column_completeness(df: DataFrame) dict

Calculate the percentage of non-missing values for each column.

Args:

df (DataFrame): Input DataFrame

Returns:

dict: Dictionary mapping column names to completeness percentage

dataclean.clean_whitespace_in_df(df: DataFrame) DataFrame

Remove leading and trailing whitespace from all string columns in a DataFrame.

Parameters:

df (DataFrame): The input DataFrame.

Returns:

DataFrame: Theinput DataFrame with leading and trailing whitespace removed from string columns.

dataclean.convert_to_numeric(df: DataFrame, columns: list) DataFrame

Convert specified columns to numeric type, with errors coerced to NaN.

Args:

df (DataFrame): Input DataFrame columns (list): List of column names to convert

Returns:

DataFrame: DataFrame with specified columns converted to numeric

dataclean.keep_alphanumeric(input_string: str) str

Filter a string to keep only alphanumeric characters.

Args:

input_string (str): The input string to filter

Returns:

str: String containing only alphanumeric characters

dataclean.keep_only_letters(input_string: str) str

Filter a string to keep only alphabetic characters (letters).

Args:

input_string (str): The input string to filter

Returns:

str: String containing only letters from the input

dataclean.normalize_text(input_string: str) str

Normalise text by converting to lowercase and removing accents.

Args:

input_string (str): The input string to normalize

Returns:

str: Normalized string

dataclean.remove_empty_rows(df: DataFrame) DataFrame

Remove rows where all values are empty or NaN.

Args:

df (DataFrame): Input DataFrame

Returns:

DataFrame: The input DataFrame with all empty rows removed

datagen

datagen.gen_bool_column(size: int, true_prob: float = 0.5, null_prob: float = 0.0, labels: Tuple[str, str] | None = None) Series

Generate a pandas Series of random boolean values.

Parameters:

sizeint

Number of values to generate

true_probfloat

Probability of generating a True value (0.0 to 1.0)

null_probfloat

Probability of generating a null value (0.0 to 1.0)

labelstuple(str, str), optional

Custom labels for (False, True) values. If provided, returns strings instead of booleans.

Returns:

pd.Series

Series of randomly generated boolean values or custom labels

datagen.gen_categorical_column(size: int, categories: List[str] | None = None, weights: List[float] | None = None, null_prob: float = 0.0) Series

Generate a pandas Series of random categorical values.

Parameters:

sizeint

Number of values to generate

categorieslist of str

List of possible categorical values

weightslist of float, optional

Probability weights for each category. Must sum to 1.0 if provided.

null_probfloat

Probability of generating a null value (0.0 to 1.0)

Returns:

pd.Series

Series of randomly generated categorical values

datagen.gen_dataframe(rows: int, columns: dict, include_id: bool = True) DataFrame

Generate a pandas DataFrame with specified columns.

Parameters:

rowsint

Number of rows to generate

columnsdict

Dictionary where keys are column names and values are functions to generate the column data

include_idbool

Whether to include an ‘id’ column with sequential integers

Returns:

pd.DataFrame

Generated DataFrame with the specified columns

datagen.gen_date_column(size: int, start_date: str | datetime = '2020-01-01', end_date: str | datetime = '2023-12-31', date_format: str = '%Y-%m-%d', null_prob: float = 0.0, distribution: str = 'uniform') Series

Generate a pandas Series of random dates.

Parameters:

sizeint

Number of dates to generate

start_datestr or datetime

Starting date (inclusive)

end_datestr or datetime

Ending date (inclusive)

date_formatstr

Format string for date output (if returning strings)

null_probfloat

Probability of generating a null value (0.0 to 1.0)

distributionstr

Distribution to use for generating dates: - ‘uniform’: Uniform distribution between start and end dates - ‘normal’: Normal distribution centered on the midpoint - ‘recent’: Bias towards more recent dates

Returns:

pd.Series

Series of randomly generated dates as strings in the specified format

datagen.gen_numeric_column(size: int, data_type: str = 'float', min_val: int | float = 0, max_val: int | float = 100, distribution: str = 'uniform', null_prob: float = 0.0, precision: int | None = None) Series

Generate a pandas Series of random numbers.

Parameters:

sizeint

Number of values to generate

data_typestr

Type of numeric data: ‘int’, ‘float’, or ‘decimal’

min_valint or float

Minimum value (inclusive)

max_valint or float

Maximum value (inclusive for ints, exclusive for floats)

distributionstr

Distribution to use for generating values: - ‘uniform’: Uniform distribution between min and max - ‘normal’: Normal distribution with mean=(min+max)/2 and std=(max-min)/6 - ‘exponential’: Exponential distribution - ‘lognormal’: Log-normal distribution

null_probfloat

Probability of generating a null value (0.0 to 1.0)

precisionint, optional

For float/decimal, number of decimal places to round to

Returns:

pd.Series

Series of randomly generated numeric values

datagen.gen_sample_dataframe(rows: int, include_id: bool = True) DataFrame

Generate a sample pandas DataFrame from various column types.

Arguments:

rows (int): The number of rows to generate for the DataFrame. include_id (bool): Whether to include an ‘id’ column as a unique identifier for each row. Defaults to True.

Returns:

pd.DataFrame: A pandas DataFrame containing the sample data.

datagen.gen_string_column(size: int, length: int | Tuple[int, int] = 10, charset: str | None = None, prefix: str = '', suffix: str = '', null_prob: float = 0.0, pattern: str | None = None) Series

Generate a pandas Series of random strings.

Parameters:

size: int

Number of strings to generate

length: int or tuple(int, int)

If int, the exact length of each string If tuple, the (min, max) length range for random string length

charset: str, optional

String containing characters to use. If None, uses lowercase letters

prefix: str, optional

Prefix to add to each generated string

suffix: str, optional

Suffix to add to each generated string

null_prob: float, optional

Probability of generating a null value (0.0 to 1.0)

pattern: str, optional

Pattern to use for string generation with character classes:

  • ‘L’ = uppercase letter

  • ‘l’ = lowercase letter

  • ‘d’ = digit

  • ‘c’ = special character

  • ‘a’ = any alphanumeric character

Example: ‘Llldd-lldd’ would generate something like ‘Tgh45-jk78’

Returns:

pd.Series

Series of randomly generated strings

datahash

datahash.hash256(datastr: str, sha_salt: str) str

Create a SHA-256 hash using the provided data and salt.

Args:

datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the SHA-256 hashing.

Returns:

str: The hexadecimal digest of the SHA-256 hash.

datahash.hashhmac(datastr: str, sha_salt: bytes, method=<built-in function openssl_sha256>) str

Create an HMAC hash using the provided data, salt, and method.

Args:

datastr (str): The data to be hashed. sha_salt (bytes): The salt to be used in the HMAC hashing. method: The hashing method to be used (default is hashlib.sha256).

Returns:

str: The hexadecimal digest of the HMAC hash.

datahash.hashmd5(datastr: str, sha_salt: str) str

Create an MD5 hash using the provided data and salt.

Args:

datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the MD5 hashing.

Returns:

str: The hexadecimal digest of the MD5 hash.

datahash.randomize_hash(hash_string: str, salt_length: int = 16) str

Randomizes a hash by using hash256 with a random salt.

Args:

hash_string (str): The original hash or string to randomize. salt_length (int): The length of the random salt in bytes (default is 16).

Returns:

str: A new randomized hash derived from the original using hash256.

dataio

dataio.csv_to_parquet(input_file: str, separator: str = ',', output_file: str | None = None) None

Converts a CSV file to a Parquet file using Polars in streaming mode.

Parameters:
  • input_file (str): The path to the CSV file.

  • separator (str): The delimiter used in the CSV (default: comma).

  • output_file (Optional[str]): The path for the output Parquet file.

    If not provided, it defaults to the same prefix as input_file with a .parquet extension.

dataio.df_memory_usage(df: DataFrame) float

Calculate the total memory usage of a DataFrame with deep=False.

Parameters: df (pd.DataFrame): The DataFrame whose memory usage is to be calculated.

Returns: float: The total memory usage of the DataFrame in bytes.

dataio.read_flat_df(filepath: str) DataFrame

Read a flat file into a DataFrame.

Args:

filepath (str): The path to the flat file.

Returns:

pd.DataFrame: The DataFrame read from the file.

dataio.read_flat_psv(path: str) DataFrame

Read a pipe-separated values (PSV) file into a DataFrame.

Args:

path (str): The path to the PSV file.

Returns:

pd.DataFrame: The DataFrame read from the PSV file.

dataio.read_ini_file(file_path: str) dict

Read an INI file and return its contents as a dictionary.

Args:

file_path (str): The path to the INI file.

Returns:

dict: A dictionary containing the key-value pairs from the INI file.

dataio.read_json_file(filepath: str, orient: str = 'records', normalize: bool = False, record_path: str | None = None, meta: list | None = None, encoding: str = 'utf-8') DataFrame

Read a JSON file into a DataFrame.

Args:

filepath (str): The path to the JSON file. orient (str): The format of the JSON structure. Default is ‘records’. normalize (bool): Whether to normalize nested JSON data. Default is False. record_path (str or list): Path to the records in nested JSON. Default is None. meta (list): Fields to use as metadata for each record. Default is None. encoding (str): The file encoding. Default is ‘utf-8’.

Returns:

pd.DataFrame: The DataFrame read from the JSON file.

dataio.read_sas_colnames(filepath: str, encoding: str = 'latin-1') list

Read SAS file column names.

Args:

filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.

Returns:

list: A list of column names from the SAS file.

dataio.read_sas_metadata(filepath: str, encoding: str = 'latin-1') dict

Read SAS file metadata and return names, labels, formats, and lengths of columns.

Args:

filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.

Returns:

dict: A dictionary containing the column names, labels, formats, and lengths.

dataio.read_xml_file(filepath: str, xpath: str = './*', attrs_only: bool = False, encoding: str = 'utf-8') DataFrame

Read an XML file into a DataFrame.

Args:

filepath (str): The path to the XML file. xpath (str): XPath string to parse specific nodes. Default is ./* attrs_only (bool): Parse only the attributes, not the child elements. Default is False. encoding (str): The file encoding. Default is ‘utf-8’.

Returns:

pd.DataFrame: The DataFrame read from the XML file.

dataio.select_dataset_ui(directory: str, extension: str) str

List the files with the specified extension in the given directory and prompt the user to select one.

Parameters: directory (str): The directory to search for files. extension (str): The file extension to filter by.

Returns: str: The filename of the selected dataset.

dataio.set_globals_from_config(configpath: str) int

Sets global variables from a configuration file.

Parameters: configpath (str): Path to the configuration file.

Returns: int: The number of global variables set.

dataio.write_flat_df(df: DataFrame, filepath: str, index: bool = False)

Write a DataFrame to a flat file in different formats.

Args:

df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the file will be saved. index (bool): Whether to write row names (index). Default is False.

Returns:

None

dataio.write_json_file(df: DataFrame, filepath: str, orient: str = 'records', index: bool = False, indent: int = 4)

Write a DataFrame to a JSON file.

Args:

df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the JSON file will be saved. orient (str): The format of the JSON structure. Default is ‘records’. index (bool): Whether to include the index in the JSON. Default is False. indent (int): The indentation level for the JSON file. Default is 4.

Returns:

None

dataio.write_xml_file(df: DataFrame, filepath: str, index: bool = False, root_name: str = 'data', row_name: str = 'row', attr_cols: list | None = None)

Write a DataFrame to an XML file.

Args:

df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the XML file will be saved. index (bool): Whether to include the index in the XML. Default is False. root_name (str): The name of the root element. Default is ‘data’. row_name (str): The name of each row element. Default is ‘row’. attr_cols (list): List of columns to write as attributes, not elements. Default is None.

Returns:

None

datatransform

datatransform.add_prefix(string: str, prefix: str) str

Add the specified prefix to the string.

Parameters: string (str): The original string. prefix (str): The prefix to add to the string.

Returns: str: The string with the prefix added.

datatransform.add_suffix(string: str, suffix: str) str

Add the specified suffix to the string.

Parameters: string (str): The original string. suffix (str): The suffix to add to the string.

Returns: str: The string with the suffix added.

datatransform.break_into_lines(string: str) list[str]

Breaks a string into a list of lines.

Args:

string (str): The input string to be broken into lines.

Returns:

list[str]: A list of lines from the input string.

datatransform.dataframe_to_json(df: DataFrame, orient: str = 'records', date_format: str = 'iso', indent: int | None = None) str

Convert a DataFrame to a JSON string with various orientation options.

Parameters:

dfpd.DataFrame

The DataFrame to convert to JSON

orientstr, default ‘records’

The JSON string orientation. See json_to_dataframe for options.

date_formatstr, default ‘iso’

Format for dates in the resulting JSON: - ‘epoch’: Use Unix epoch (seconds since 1970-01-01) - ‘iso’: ISO 8601 formatted dates

indentint, default None

Indentation level for the resulting JSON string. None = no indentation.

Returns:

str

JSON string representation of the DataFrame

datatransform.dataframe_to_xml(df: DataFrame, root_name: str = 'data', row_name: str = 'row') str

Convert a DataFrame to an XML string.

Parameters:

dfpd.DataFrame

The DataFrame to convert to XML

root_namestr, default ‘data’

The name of the root XML element

row_namestr, default ‘row’

The name of each row element

Returns:

str

XML string representation of the DataFrame

datatransform.json_extract_subtree(json_data, path: str) any

Extract a subtree from a JSON object using a dot-notation path.

Parameters:

json_datadict or list

The JSON data to extract from

pathstr

Path to the subtree using dot notation (e.g., ‘person.address.city’) Use array indices like ‘results.0.name’ to access list elements

Returns:

any

The subtree at the specified path, or None if path doesn’t exist

Examples:

>>> data = {'person': {'name': 'John', 'addresses': [{'city': 'New York'}, {'city': 'Boston'}]}}
>>> json_extract_subtree(data, 'person.addresses.0.city')
'New York'
datatransform.json_to_dataframe(json_data, orient='records', normalize=False, record_path=None, meta=None, encoding='utf-8')

Convert JSON data into a pandas DataFrame.

Parameters:

json_datastr, dict, list, or path to file

The JSON data to convert. Can be: - A string containing JSON data - A Python dict or list containing JSON data - A file path to a JSON file

orientstr, default ‘records’

The JSON string orientation. Allowed values: - ‘records’: list-like [{column -> value}, … ] - ‘split’: dict-like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]} - ‘index’: dict-like {index -> {column -> value}} - ‘columns’: dict-like {column -> {index -> value}} - ‘values’: just the values array

normalizebool, default False

Whether to normalize semi-structured JSON data into a flat table

record_pathstr or list of str, default None

Path in each object to list of records. If not passed, data will be assumed to be an array of records.

metalist of str, default None

Fields to use as metadata for each record in resulting DataFrame

encodingstr, default ‘utf-8’

Encoding to use when reading JSON from a file

Returns:

pd.DataFrame

The converted DataFrame

Examples:

# From a JSON string >>> json_str = ‘{“name”: “John”, “age”: 30, “city”: “New York”}’ >>> df = json_to_dataframe(json_str)

# From a file >>> df = json_to_dataframe(‘data.json’)

# With nested data >>> json_str = ‘{“users”: [{“name”: “John”, “age”: 30}, {“name”: “Jane”, “age”: 25}]}’ >>> df = json_to_dataframe(json_str, record_path=’users’)

datatransform.merge_dataframes(df_list: list[DataFrame]) DataFrame

Merges a list of DataFrames into a single DataFrame, aligning columns by name. Missing columns will be filled with NaN.

datatransform.merge_json_objects(json1: dict, json2: dict, merge_lists: bool = False) dict

Merge two JSON objects, with the second one taking precedence for overlapping keys.

Parameters:

json1dict

First JSON object (base)

json2dict

Second JSON object (takes precedence when keys overlap)

merge_listsbool, default False

If True, merge list items; if False, replace lists entirely

Returns:

dict

Merged JSON object

datatransform.xml_to_dataframe(xml_data, xpath: str = './*') DataFrame

Convert XML data to a pandas DataFrame.

Parameters:

xml_datastr or file-like object or path

The XML data to convert. Can be: - A string containing XML data - A file path to an XML file - A file-like object containing XML data

xpathstr, default ./*

XPath string to parse specific nodes

Returns:

pd.DataFrame

The DataFrame representation of the XML data

duckfunc

duckfunc.does_table_exist(db_con, dbname: str, tablename: str) bool

Check if a table exists in the specified database.

Args:

db_con: The database connection object. dbname (str): The name of the database. tablename (str): The name of the table.

Returns:

bool: True if the table exists, False otherwise.

duckfunc.getCurrentTimeForDuck(timezone_included: bool = False) str

Get the current time formatted for DuckDB, optionally including the timezone.

Args:

timezone_included (bool): If True, includes the timezone in the returned string.

Returns:

str: The current time formatted as ‘YYYY-MM-DD HH:MM:SS’ (with optional timezone).

duckfunc.getDuckVersion(con) str

Get the connected DuckDB version.

Args:

con: The database connection object.

Returns:

str: The version of the DuckDB you have open.

duckfunc.get_attached_dbs(db_con) DataFrame

Get the list of attached databases.

Args:

db_con: The database connection object.

Returns:

DataFrame: A DataFrame containing the database name, path, and type.

duckfunc.get_inventory(db_con) DataFrame

Get the inventory of tables.

Args:

db_con: The database connection object.

Returns:

DataFrame: A DataFrame containing all tables.

duckfunc.get_table_as_df(con, db_name: str, table_name: str) DataFrame

Query a table from the specified database and return it as a pandas DataFrame.

Args:

con: Database connection object db_name (str): Name of the database table_name (str): Name of the table

Returns:

DataFrame: The table contents as a pandas DataFrame, or None if the table doesn’t exist

duckfunc.init_table(con, frame: DataFrame, db: str, tablename: str) bool

Initialize a table in the specified database.

Args:

con: The database connection object. frame (DataFrame): A DataFrame containing columns VARNAME and TYPE, which should be DuckDB-compatible. db (str): The name of the database. tablename (str): The name of the table.

Returns:

bool: True if the table was created, False if it already exists.

duckfunc.save_from_db(con, db_name: str, table_name: str, output_path: str) bool

Query a table from the specified database and save it to the given output path. The output format is determined from the file extension of the output path.

Args:

con: Database connection object db_name (str): Name of the database table_name (str): Name of the table output_path (str): Path to save the output file (extension determines format)

Returns:

bool: True if the table existed and was saved, False otherwise

fileio

fileio.addsyspath(directory: str)

Add the specified directory to the system path if it is not already included.

Parameters:

directory (str): The directory to be added to the system path.

Returns:

None

fileio.calculate_checksums(dir_path)

Calculate MD5 checksums for all files in the specified directory.

Parameters:

dir_path (str) – Path to the directory containing files.

Returns:

Dictionary mapping file paths to their MD5 checksum.

Return type:

dict

fileio.create_filepath_dirs(path: str)

Creates all directories needed for a given file path.

If the path contains folders, this function creates all necessary directories in the path if they don’t already exist.

Parameters:

path (str): The file path for which to create directories.

Returns:

None

fileio.download_file_from_url(url: str, save_path: str)

Downloads a file from the given URL and saves it to the specified path.

Args:

url (str): The URL of the file to download. save_path (str): The file path where the downloaded file will be saved.

Returns:

None

fileio.gen_random_subfolder(master_dir: str) str

Generates a random subfolder within the specified master directory.

Args:

master_dir (str): The path to the master directory where the subfolder will be created.

Returns:

str: The path to the newly created subfolder.

fileio.get_file_extension(filepath: str) str

Get the file extension of the given file path.

Parameters:

filepath – The path of the file.

Returns:

The file extension of the file.

fileio.list_dirs(main_dir: str) list

List all directories within the specified main directory.

Args:

main_dir (str): The main directory path to list directories from.

Returns:

list: A list of directory names within the specified main directory.

fileio.list_files_of_extension(directory: str, extn: str) list[str]

List all files in the specified directory with the given extension.

Parameters:
  • directory – The directory to search in.

  • extn – The file extension to filter by.

Returns:

A list of file paths with the specified extension.

fileio.mv_file(src: str, dest: str)

Moves a file from the source path to the destination path using shutil.

Parameters:

src (str): The path of the file to be moved. dest (str): The destination path where the file should be moved.

Returns:

None

fileio.read_file_to_bytes(file_path: str) bytes

Read the string content from the specified file and convert it to bytes using UTF-8 encoding.

Parameters:

file_path (str): The path to the file.

Returns:

bytes: The content of the file as bytes.

fileio.read_file_to_string(file_path: str) str

Read the string content from the specified file.

Parameters:

file_path (str): The path to the file.

Returns:

str: The content of the file as a string.

fileio.remove_directory(dir_path: str) bool

Removes a directory at the specified path.

Attempts to remove the directory and prints the result. If an error occurs during removal, the exception is caught and an error message is printed.

Parameters:

dir_path (str): The path to the directory to be removed.

Returns:

bool: True if directory was successfully removed, False otherwise.

Raises:

No exceptions are raised as they are caught and printed internally.

systeminfo

systeminfo.free_gb_in_drive(drive: str) float

Calculate the free space in a specified drive in gigabytes (GB).

Parameters:

drive (str): The drive to check the free space of.

Returns:

float: The free space in gigabytes (GB).

systeminfo.gather_free_space_in_drive(drive: str) float

Gather the free space in a specified drive.

Parameters:

drive (str): The drive to check the free space of. If the drive is a single letter, it is assumed to be a Windows drive.

Returns:

float: The free space in bytes.

systeminfo.get_battery_info() dict

Get information about the system battery.

Returns:
dict: Dictionary containing battery percentage, time left, and power plugged status.

Returns None if no battery is present.

systeminfo.get_cpu_usage_percent() float

Get the current CPU usage as a percentage.

Returns:

float: Current CPU usage percentage.

systeminfo.get_formatted_uptime() str

Get the system uptime formatted as days, hours, minutes, seconds.

Returns:

str: Formatted uptime string.

systeminfo.get_free_ram() int

Get the amount of free RAM available in bytes.

Returns:

int: The amount of free RAM in bytes.

systeminfo.get_free_ram_in_gb() float

Get the amount of free RAM on the system in gigabytes.

This function uses the psutil library to retrieve the amount of free RAM and converts it from bytes to gigabytes.

Returns:

float: The amount of free RAM in gigabytes.

systeminfo.get_installed_ram_gb() int

Get the total amount of installed RAM in gigabytes (GB).

Returns:

int: The total amount of installed RAM in gigabytes (GB).

systeminfo.get_largest_drive() dict[str, any]

Identifies and returns information about the drive with the most free space.

The function finds the drive with the maximum available free space and returns its information with only the letters kept in the drive name.

Returns:

dict[str, any]: Dictionary containing information about the drive with the most free space, with the drive name containing only letters.

systeminfo.get_network_stats() DataFrame

Get statistics for all network interfaces.

Returns:

pd.DataFrame: DataFrame with network interface statistics.

systeminfo.get_number_virtual_cores() int

Get the number of virtual (logical) CPU cores including hyperthreads.

Returns:

int: The number of virtual CPU cores.

systeminfo.get_per_cpu_usage_percent() list[float]

Get CPU usage percentage for each individual CPU core.

Returns:

list[float]: List of CPU usage percentages for each core.

systeminfo.get_physical_cores() int

Get the number of physical CPU cores.

Returns:

int: The number of physical CPU cores.

systeminfo.get_system_info() dict

Get general system information.

Returns:

dict: Dictionary containing OS, hostname, and platform information.

systeminfo.get_system_uptime() float

Get the system uptime in seconds.

Returns:

float: System uptime in seconds.

systeminfo.get_top_processes(n=5) DataFrame

Get the top n processes by memory usage.

Parameters:

n (int): Number of processes to return. Default is 5.

Returns:

pd.DataFrame: DataFrame with top processes information.

systeminfo.list_drive_spaces() DataFrame

List all available drives and their free space in gigabytes (GB).

Returns:

pd.DataFrame: A DataFrame with the drive names and their free space in gigabytes (GB).

systeminfo.list_drives() list[str]

List all available drives on the system.

Returns:

list[str]: A list of device names for all available drives.

Release History

Version 1.5.3

  • Blob helper module now contains a function for checking if a SAS token works for a container

Version 1.5.2

  • Blob helper module now contains a checksum function for listed blobs

Version 1.5.1

  • Blob helper module fix on makedirs argument for Downloads

  • New chunking based function for downloading blobs

Version 1.5

  • Enhanced blob helper module with new functions:

    • list_blob_content: Lists all blobs in a specified container or subfolder

    • download_all_blobs: Downloads a blob to a specified local path

    • get_account_url: Generates Azure Storage account URL from storage account name

Version 1.4

  • Added new function to datatransform module:

    • merge_dataframes: Merges a list of pandas DataFrames into a single DataFrame, aligning columns by name and filling missing columns with NaN.

Version 1.3

  • Added new data conversion functionality in dataio module:

    • csv_to_parquet: Converts CSV files to Parquet format using Polars in streaming mode with support for custom separators and output file naming

  • Enhanced hashing capabilities with new functions in datahash module:

    • hashmd5: Creates an MD5 hash using provided data and salt

  • Added new Azure blob module (blobhelper) for path operations:

    • get_blob_container_path: Returns the absolute path of a blob container

    • get_blob_subfolder_path: Returns the absolute path of a blob subfolder within a container

Version 1.2.1

  • Added new function to fileio module:

    • calculate_checksums: Calculates MD5 checksums for all files in a specified directory, returning a dictionary mapping file paths to their checksums.

  • Methods that return no value are no longer displayed in the function signature

  • Fix applied for move file function to be platform independent

Version 1.2

  • Enhanced datatransform module with:

    • json_to_dataframe: Converts JSON data into pandas DataFrames with support for:

      • Multiple input types (JSON strings, Python dicts/lists, file paths)

      • Custom orientation options for structured data

      • Normalization of nested JSON structures

      • Handling of both single objects and arrays of records

      • Customizable encoding for file reading

  • Fixed systeminfo module with:

    • Reworked drive listing to continue functioning when some drives are inaccessible

    • Improved reliability of get_largest_drive() function

  • Improved fileio module with:

    • Enhanced remove_directory() to return boolean success/failure status instead of only printing messages

    • Added proper type hints and improved documentation for key functions

Version 1.1

  • Added new generic functions for duck database to output tables to files

  • Expansion of systeminformation module to feedback:

    • CPU usage

    • System uptime

    • System information

    • Network stats

    • Top processes

  • Enhanced dataclean module with:

    • keep_only_letters: Filters strings to keep only alphabetical characters

    • keep_alphanumeric: Filters strings to keep only alphanumeric characters

    • normalize_text: Converts text to lowercase and removes accents

    • remove_empty_rows: Removes rows where all values are empty or NaN

    • convert_to_numeric: Converts specified columns to numeric type

    • check_column_completeness: Calculates the percentage of non-missing values for each column

  • Added new datagen module for synthetic data generation:

    • gen_string_column: Generates random string data with patterns and customization

    • gen_numeric_column: Generates numeric data with various distributions

    • gen_date_column: Generates date data with customizable ranges and formats

    • gen_categorical_column: Generates categorical data with optional weighted distributions

    • gen_bool_column: Generates boolean data with custom labels

    • gen_dataframe: Creates complete dataframes with customizable columns

    • gen_sample_dataframe: Generates ready-to-use test dataframes with common column types

Version 1.0.2

  • Added Bitgen Features:

    • generate_random_uuid: Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.

    • generate_custom_uuid: Generates a random UUID using UUID version 4, formatted with a specified marker. The default marker is a dash (‘-‘), but it can be customized.

Version 1.0.1

  • First full release of package