Uaine.Pydat Package
This package provides comprehensive tools for data handling, processing, and database operations in Python. It includes specialised functions for file I/O operations, data transformation, data cleaning, system information gathering, and data generation. Key features include PSV file handling, cryptographic hash functions, data table cleaning methods, DuckDB integration with snippet queries, random data generation utilities, and XML/config file parsers. The package aims to simplify common data manipulation tasks while providing a flexible framework for both basic and advanced data operations.
Description
A Python package that streamlines data handling, processing, and database operations through a collection of utility functions and tools.
Uaine.Pydat provides a comprehensive toolkit for data scientists, analysts, and developers working with structured data. The package simplifies common data manipulation tasks while offering specialized functionality for file operations, data transformation, table cleaning, and database interactions.
Key Features include File I/O Operations for reading/writing various file formats, listing files by extension, and managing system paths; Data Transformation tools for reshaping, converting, and manipulating data structures with minimal code; Data Cleaning methods to sanitize, standardize, and prepare data tables for analysis; DuckDB Integration with helper functions and snippet queries; Cryptographic Hashing for data integrity and anonymization; System Information utilities to gather system metrics and resource usage; Data Generation for testing and development scenarios; and Configuration Handling for XML, INI, and other formats.
Core Modules include dataio.py for data input/output operations and format conversion; fileio.py for file system operations and path management; datatransform.py for data structure transformation and manipulation; dataclean.py for data cleaning and standardization; duckfunc.py for DuckDB database interactions; datahash.py for cryptographic hashing functions; systeminfo.py for system information gathering; datagen.py for random data generation; and bitgen.py for low-level bit generation utilities.
Common Use Cases for this package include simplifying ETL workflows, streamlining data preparation for analysis and machine learning, managing database operations with less boilerplate code, generating test data for development, monitoring system resources during data processing tasks, and securing sensitive data through hashing and anonymization.
This package aims to reduce the complexity of common data handling tasks by providing ready-made solutions that follow best practices while remaining flexible enough to adapt to various data processing requirements.
Dependencies
pandas
pyreadstat
requests
duckdb
wheel
twine
psutil
lxml
polars
azure-storage-blob
tqdm
bitgen
- bitgen.gen_random_hex_string(bitlength) str
Generates a random hexadecimal string of the given bit length.
- Parameters:
bitlength (int): The length of the bit string to generate.
- Returns:
str: A random hexadecimal string of the given bit length.
- bitgen.generate_256_bit_string() str
Generates a random 256-bit hexadecimal string.
- Returns:
str: A random 256-bit hexadecimal string.
- bitgen.generate_8_bit_string() str
Generates a random 8-bit hexadecimal string.
- Returns:
str: A random 8-bit hexadecimal string.
- bitgen.generate_custom_uuid(marker: str = '-') str
Generates a random UUID (Universally Unique Identifier) using UUID version 4, formatted with a specified marker.
- Parameters:
marker (str): The character to use as a separator in the UUID. Default is ‘-‘.
- Returns:
str: A string representation of a random UUID with the specified marker.
- bitgen.generate_random_uuid() str
Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.
- Returns:
str: A string representation of a random UUID without dashes.
blobhelper
- blobhelper.check_sas_token(account_url, container, sastoken)
Checks if the provided SAS token is valid for the given container.
- Args:
account_url (str): The Azure Storage account URL. container (str): The name of the container to check access for. sastoken (str): The SAS token to validate.
- Returns:
bool: True if the SAS token is valid and has access, False otherwise. str: Optional error message if invalid.
- blobhelper.download_all_blobs(account_url, container, folder_path, sastoken, download_loc, file_extn='', makedirs=True)
Download all blobs from an Azure Storage container to a local directory.
- Args:
account_url (str): The Azure Storage account URL. container (str): The name of the container to download blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. download_loc (str): Local directory path where blobs will be downloaded. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).
If empty, downloads all blobs. Defaults to “”.
- makedirs (bool, optional): Whether to create the download directory if it doesn’t exist.
Defaults to True.
- Note:
This function will create the download directory if it doesn’t exist and makedirs is True. Files are downloaded with their original names from the blob storage.
- blobhelper.get_account_url(storage_account)
Generate the Azure Storage account URL from the storage account name.
- Args:
storage_account (str): The name of the Azure Storage account.
- Returns:
- str: The complete Azure Storage account URL in the format
‘https://{storage_account}.blob.core.windows.net’
- Example:
>>> get_account_url('myaccount') 'https://myaccount.blob.core.windows.net'
- blobhelper.get_blob_container_path(storage_account, container)
Returns the absolute path of the blob container.
- Parameters:
storage_account – Name of the Azure Storage account.
container – Name of the blob container.
- Returns:
Absolute path as a string.
- blobhelper.get_blob_md5_checksums(account_url, container, sastoken, blob_list, use_hex=False)
Retrieves MD5 checksums for a list of blobs in Azure Blob Storage.
- Args:
account_url (str): The Azure Storage account URL. container (str): The name of the container. sastoken (str): The SAS token for authentication. blob_list (list): List of BlobProperties objects. use_hex (bool): If True, returns checksum as hex; otherwise Base64. Default is False.
- Returns:
dict: A dictionary mapping blob names to their MD5 checksums, or None if not available.
- blobhelper.get_blob_subfolder_path(storage_account, container, subfolder)
Returns the absolute path of a blob subfolder using the container path function.
- Parameters:
storage_account – Name of the Azure Storage account.
container – Name of the blob container.
subfolder – Path to subfolder within the container.
- Returns:
Absolute path as a string.
- blobhelper.list_blob_content(account_url, container, folder_path, sastoken, file_extn='')
List blobs in an Azure Storage container with optional file extension filtering.
- Args:
account_url (str): The Azure Storage account URL. container (str): The name of the container to list blobs from. folder_path (str): The folder path prefix to filter blobs by. sastoken (str): The SAS token for authentication. file_extn (str, optional): File extension to filter blobs by (e.g., ‘txt’, ‘pdf’).
If empty, returns all blobs. Defaults to “”.
- Returns:
list: A list of BlobProperties objects matching the specified criteria.
dataclean
- dataclean.check_column_completeness(df: DataFrame) dict
Calculate the percentage of non-missing values for each column.
- Args:
df (DataFrame): Input DataFrame
- Returns:
dict: Dictionary mapping column names to completeness percentage
- dataclean.clean_whitespace_in_df(df: DataFrame) DataFrame
Remove leading and trailing whitespace from all string columns in a DataFrame.
- Parameters:
df (DataFrame): The input DataFrame.
- Returns:
DataFrame: Theinput DataFrame with leading and trailing whitespace removed from string columns.
- dataclean.convert_to_numeric(df: DataFrame, columns: list) DataFrame
Convert specified columns to numeric type, with errors coerced to NaN.
- Args:
df (DataFrame): Input DataFrame columns (list): List of column names to convert
- Returns:
DataFrame: DataFrame with specified columns converted to numeric
- dataclean.keep_alphanumeric(input_string: str) str
Filter a string to keep only alphanumeric characters.
- Args:
input_string (str): The input string to filter
- Returns:
str: String containing only alphanumeric characters
- dataclean.keep_only_letters(input_string: str) str
Filter a string to keep only alphabetic characters (letters).
- Args:
input_string (str): The input string to filter
- Returns:
str: String containing only letters from the input
- dataclean.normalize_text(input_string: str) str
Normalise text by converting to lowercase and removing accents.
- Args:
input_string (str): The input string to normalize
- Returns:
str: Normalized string
- dataclean.remove_empty_rows(df: DataFrame) DataFrame
Remove rows where all values are empty or NaN.
- Args:
df (DataFrame): Input DataFrame
- Returns:
DataFrame: The input DataFrame with all empty rows removed
datagen
- datagen.gen_bool_column(size: int, true_prob: float = 0.5, null_prob: float = 0.0, labels: Tuple[str, str] | None = None) Series
Generate a pandas Series of random boolean values.
Parameters:
- sizeint
Number of values to generate
- true_probfloat
Probability of generating a True value (0.0 to 1.0)
- null_probfloat
Probability of generating a null value (0.0 to 1.0)
- labelstuple(str, str), optional
Custom labels for (False, True) values. If provided, returns strings instead of booleans.
Returns:
- pd.Series
Series of randomly generated boolean values or custom labels
- datagen.gen_categorical_column(size: int, categories: List[str] | None = None, weights: List[float] | None = None, null_prob: float = 0.0) Series
Generate a pandas Series of random categorical values.
Parameters:
- sizeint
Number of values to generate
- categorieslist of str
List of possible categorical values
- weightslist of float, optional
Probability weights for each category. Must sum to 1.0 if provided.
- null_probfloat
Probability of generating a null value (0.0 to 1.0)
Returns:
- pd.Series
Series of randomly generated categorical values
- datagen.gen_dataframe(rows: int, columns: dict, include_id: bool = True) DataFrame
Generate a pandas DataFrame with specified columns.
Parameters:
- rowsint
Number of rows to generate
- columnsdict
Dictionary where keys are column names and values are functions to generate the column data
- include_idbool
Whether to include an ‘id’ column with sequential integers
Returns:
- pd.DataFrame
Generated DataFrame with the specified columns
- datagen.gen_date_column(size: int, start_date: str | datetime = '2020-01-01', end_date: str | datetime = '2023-12-31', date_format: str = '%Y-%m-%d', null_prob: float = 0.0, distribution: str = 'uniform') Series
Generate a pandas Series of random dates.
Parameters:
- sizeint
Number of dates to generate
- start_datestr or datetime
Starting date (inclusive)
- end_datestr or datetime
Ending date (inclusive)
- date_formatstr
Format string for date output (if returning strings)
- null_probfloat
Probability of generating a null value (0.0 to 1.0)
- distributionstr
Distribution to use for generating dates: - ‘uniform’: Uniform distribution between start and end dates - ‘normal’: Normal distribution centered on the midpoint - ‘recent’: Bias towards more recent dates
Returns:
- pd.Series
Series of randomly generated dates as strings in the specified format
- datagen.gen_numeric_column(size: int, data_type: str = 'float', min_val: int | float = 0, max_val: int | float = 100, distribution: str = 'uniform', null_prob: float = 0.0, precision: int | None = None) Series
Generate a pandas Series of random numbers.
Parameters:
- sizeint
Number of values to generate
- data_typestr
Type of numeric data: ‘int’, ‘float’, or ‘decimal’
- min_valint or float
Minimum value (inclusive)
- max_valint or float
Maximum value (inclusive for ints, exclusive for floats)
- distributionstr
Distribution to use for generating values: - ‘uniform’: Uniform distribution between min and max - ‘normal’: Normal distribution with mean=(min+max)/2 and std=(max-min)/6 - ‘exponential’: Exponential distribution - ‘lognormal’: Log-normal distribution
- null_probfloat
Probability of generating a null value (0.0 to 1.0)
- precisionint, optional
For float/decimal, number of decimal places to round to
Returns:
- pd.Series
Series of randomly generated numeric values
- datagen.gen_sample_dataframe(rows: int, include_id: bool = True) DataFrame
Generate a sample pandas DataFrame from various column types.
- Arguments:
rows (int): The number of rows to generate for the DataFrame. include_id (bool): Whether to include an ‘id’ column as a unique identifier for each row. Defaults to True.
- Returns:
pd.DataFrame: A pandas DataFrame containing the sample data.
- datagen.gen_string_column(size: int, length: int | Tuple[int, int] = 10, charset: str | None = None, prefix: str = '', suffix: str = '', null_prob: float = 0.0, pattern: str | None = None) Series
Generate a pandas Series of random strings.
Parameters:
- size: int
Number of strings to generate
- length: int or tuple(int, int)
If int, the exact length of each string If tuple, the (min, max) length range for random string length
- charset: str, optional
String containing characters to use. If None, uses lowercase letters
- prefix: str, optional
Prefix to add to each generated string
- suffix: str, optional
Suffix to add to each generated string
- null_prob: float, optional
Probability of generating a null value (0.0 to 1.0)
- pattern: str, optional
Pattern to use for string generation with character classes:
‘L’ = uppercase letter
‘l’ = lowercase letter
‘d’ = digit
‘c’ = special character
‘a’ = any alphanumeric character
Example: ‘Llldd-lldd’ would generate something like ‘Tgh45-jk78’
Returns:
- pd.Series
Series of randomly generated strings
datahash
- datahash.hash256(datastr: str, sha_salt: str) str
Create a SHA-256 hash using the provided data and salt.
- Args:
datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the SHA-256 hashing.
- Returns:
str: The hexadecimal digest of the SHA-256 hash.
- datahash.hashhmac(datastr: str, sha_salt: bytes, method=<built-in function openssl_sha256>) str
Create an HMAC hash using the provided data, salt, and method.
- Args:
datastr (str): The data to be hashed. sha_salt (bytes): The salt to be used in the HMAC hashing. method: The hashing method to be used (default is hashlib.sha256).
- Returns:
str: The hexadecimal digest of the HMAC hash.
- datahash.hashmd5(datastr: str, sha_salt: str) str
Create an MD5 hash using the provided data and salt.
- Args:
datastr (str): The data to be hashed. sha_salt (str): The salt to be used in the MD5 hashing.
- Returns:
str: The hexadecimal digest of the MD5 hash.
- datahash.randomize_hash(hash_string: str, salt_length: int = 16) str
Randomizes a hash by using hash256 with a random salt.
- Args:
hash_string (str): The original hash or string to randomize. salt_length (int): The length of the random salt in bytes (default is 16).
- Returns:
str: A new randomized hash derived from the original using hash256.
dataio
- dataio.csv_to_parquet(input_file: str, separator: str = ',', output_file: str | None = None) None
Converts a CSV file to a Parquet file using Polars in streaming mode.
- Parameters:
input_file (str): The path to the CSV file.
separator (str): The delimiter used in the CSV (default: comma).
- output_file (Optional[str]): The path for the output Parquet file.
If not provided, it defaults to the same prefix as input_file with a .parquet extension.
- dataio.df_memory_usage(df: DataFrame) float
Calculate the total memory usage of a DataFrame with deep=False.
Parameters: df (pd.DataFrame): The DataFrame whose memory usage is to be calculated.
Returns: float: The total memory usage of the DataFrame in bytes.
- dataio.read_flat_df(filepath: str) DataFrame
Read a flat file into a DataFrame.
- Args:
filepath (str): The path to the flat file.
- Returns:
pd.DataFrame: The DataFrame read from the file.
- dataio.read_flat_psv(path: str) DataFrame
Read a pipe-separated values (PSV) file into a DataFrame.
- Args:
path (str): The path to the PSV file.
- Returns:
pd.DataFrame: The DataFrame read from the PSV file.
- dataio.read_ini_file(file_path: str) dict
Read an INI file and return its contents as a dictionary.
- Args:
file_path (str): The path to the INI file.
- Returns:
dict: A dictionary containing the key-value pairs from the INI file.
- dataio.read_json_file(filepath: str, orient: str = 'records', normalize: bool = False, record_path: str | None = None, meta: list | None = None, encoding: str = 'utf-8') DataFrame
Read a JSON file into a DataFrame.
- Args:
filepath (str): The path to the JSON file. orient (str): The format of the JSON structure. Default is ‘records’. normalize (bool): Whether to normalize nested JSON data. Default is False. record_path (str or list): Path to the records in nested JSON. Default is None. meta (list): Fields to use as metadata for each record. Default is None. encoding (str): The file encoding. Default is ‘utf-8’.
- Returns:
pd.DataFrame: The DataFrame read from the JSON file.
- dataio.read_sas_colnames(filepath: str, encoding: str = 'latin-1') list
Read SAS file column names.
- Args:
filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.
- Returns:
list: A list of column names from the SAS file.
- dataio.read_sas_metadata(filepath: str, encoding: str = 'latin-1') dict
Read SAS file metadata and return names, labels, formats, and lengths of columns.
- Args:
filepath (str): The path to the SAS file. encoding (str): The encoding to use for reading the SAS file. Default is “latin-1”.
- Returns:
dict: A dictionary containing the column names, labels, formats, and lengths.
- dataio.read_xml_file(filepath: str, xpath: str = './*', attrs_only: bool = False, encoding: str = 'utf-8') DataFrame
Read an XML file into a DataFrame.
- Args:
filepath (str): The path to the XML file. xpath (str): XPath string to parse specific nodes. Default is ./* attrs_only (bool): Parse only the attributes, not the child elements. Default is False. encoding (str): The file encoding. Default is ‘utf-8’.
- Returns:
pd.DataFrame: The DataFrame read from the XML file.
- dataio.select_dataset_ui(directory: str, extension: str) str
List the files with the specified extension in the given directory and prompt the user to select one.
Parameters: directory (str): The directory to search for files. extension (str): The file extension to filter by.
Returns: str: The filename of the selected dataset.
- dataio.set_globals_from_config(configpath: str) int
Sets global variables from a configuration file.
Parameters: configpath (str): Path to the configuration file.
Returns: int: The number of global variables set.
- dataio.write_flat_df(df: DataFrame, filepath: str, index: bool = False)
Write a DataFrame to a flat file in different formats.
- Args:
df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the file will be saved. index (bool): Whether to write row names (index). Default is False.
- Returns:
None
- dataio.write_json_file(df: DataFrame, filepath: str, orient: str = 'records', index: bool = False, indent: int = 4)
Write a DataFrame to a JSON file.
- Args:
df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the JSON file will be saved. orient (str): The format of the JSON structure. Default is ‘records’. index (bool): Whether to include the index in the JSON. Default is False. indent (int): The indentation level for the JSON file. Default is 4.
- Returns:
None
- dataio.write_xml_file(df: DataFrame, filepath: str, index: bool = False, root_name: str = 'data', row_name: str = 'row', attr_cols: list | None = None)
Write a DataFrame to an XML file.
- Args:
df (pd.DataFrame): The DataFrame to be written. filepath (str): The path where the XML file will be saved. index (bool): Whether to include the index in the XML. Default is False. root_name (str): The name of the root element. Default is ‘data’. row_name (str): The name of each row element. Default is ‘row’. attr_cols (list): List of columns to write as attributes, not elements. Default is None.
- Returns:
None
datatransform
- datatransform.add_prefix(string: str, prefix: str) str
Add the specified prefix to the string.
Parameters: string (str): The original string. prefix (str): The prefix to add to the string.
Returns: str: The string with the prefix added.
- datatransform.add_suffix(string: str, suffix: str) str
Add the specified suffix to the string.
Parameters: string (str): The original string. suffix (str): The suffix to add to the string.
Returns: str: The string with the suffix added.
- datatransform.break_into_lines(string: str) list[str]
Breaks a string into a list of lines.
- Args:
string (str): The input string to be broken into lines.
- Returns:
list[str]: A list of lines from the input string.
- datatransform.dataframe_to_json(df: DataFrame, orient: str = 'records', date_format: str = 'iso', indent: int | None = None) str
Convert a DataFrame to a JSON string with various orientation options.
Parameters:
- dfpd.DataFrame
The DataFrame to convert to JSON
- orientstr, default ‘records’
The JSON string orientation. See json_to_dataframe for options.
- date_formatstr, default ‘iso’
Format for dates in the resulting JSON: - ‘epoch’: Use Unix epoch (seconds since 1970-01-01) - ‘iso’: ISO 8601 formatted dates
- indentint, default None
Indentation level for the resulting JSON string. None = no indentation.
Returns:
- str
JSON string representation of the DataFrame
- datatransform.dataframe_to_xml(df: DataFrame, root_name: str = 'data', row_name: str = 'row') str
Convert a DataFrame to an XML string.
Parameters:
- dfpd.DataFrame
The DataFrame to convert to XML
- root_namestr, default ‘data’
The name of the root XML element
- row_namestr, default ‘row’
The name of each row element
Returns:
- str
XML string representation of the DataFrame
- datatransform.json_extract_subtree(json_data, path: str) any
Extract a subtree from a JSON object using a dot-notation path.
Parameters:
- json_datadict or list
The JSON data to extract from
- pathstr
Path to the subtree using dot notation (e.g., ‘person.address.city’) Use array indices like ‘results.0.name’ to access list elements
Returns:
- any
The subtree at the specified path, or None if path doesn’t exist
Examples:
>>> data = {'person': {'name': 'John', 'addresses': [{'city': 'New York'}, {'city': 'Boston'}]}} >>> json_extract_subtree(data, 'person.addresses.0.city') 'New York'
- datatransform.json_to_dataframe(json_data, orient='records', normalize=False, record_path=None, meta=None, encoding='utf-8')
Convert JSON data into a pandas DataFrame.
Parameters:
- json_datastr, dict, list, or path to file
The JSON data to convert. Can be: - A string containing JSON data - A Python dict or list containing JSON data - A file path to a JSON file
- orientstr, default ‘records’
The JSON string orientation. Allowed values: - ‘records’: list-like [{column -> value}, … ] - ‘split’: dict-like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]} - ‘index’: dict-like {index -> {column -> value}} - ‘columns’: dict-like {column -> {index -> value}} - ‘values’: just the values array
- normalizebool, default False
Whether to normalize semi-structured JSON data into a flat table
- record_pathstr or list of str, default None
Path in each object to list of records. If not passed, data will be assumed to be an array of records.
- metalist of str, default None
Fields to use as metadata for each record in resulting DataFrame
- encodingstr, default ‘utf-8’
Encoding to use when reading JSON from a file
Returns:
- pd.DataFrame
The converted DataFrame
Examples:
# From a JSON string >>> json_str = ‘{“name”: “John”, “age”: 30, “city”: “New York”}’ >>> df = json_to_dataframe(json_str)
# From a file >>> df = json_to_dataframe(‘data.json’)
# With nested data >>> json_str = ‘{“users”: [{“name”: “John”, “age”: 30}, {“name”: “Jane”, “age”: 25}]}’ >>> df = json_to_dataframe(json_str, record_path=’users’)
- datatransform.merge_dataframes(df_list: list[DataFrame]) DataFrame
Merges a list of DataFrames into a single DataFrame, aligning columns by name. Missing columns will be filled with NaN.
- datatransform.merge_json_objects(json1: dict, json2: dict, merge_lists: bool = False) dict
Merge two JSON objects, with the second one taking precedence for overlapping keys.
Parameters:
- json1dict
First JSON object (base)
- json2dict
Second JSON object (takes precedence when keys overlap)
- merge_listsbool, default False
If True, merge list items; if False, replace lists entirely
Returns:
- dict
Merged JSON object
- datatransform.xml_to_dataframe(xml_data, xpath: str = './*') DataFrame
Convert XML data to a pandas DataFrame.
Parameters:
- xml_datastr or file-like object or path
The XML data to convert. Can be: - A string containing XML data - A file path to an XML file - A file-like object containing XML data
- xpathstr, default ./*
XPath string to parse specific nodes
Returns:
- pd.DataFrame
The DataFrame representation of the XML data
duckfunc
- duckfunc.does_table_exist(db_con, dbname: str, tablename: str) bool
Check if a table exists in the specified database.
- Args:
db_con: The database connection object. dbname (str): The name of the database. tablename (str): The name of the table.
- Returns:
bool: True if the table exists, False otherwise.
- duckfunc.getCurrentTimeForDuck(timezone_included: bool = False) str
Get the current time formatted for DuckDB, optionally including the timezone.
- Args:
timezone_included (bool): If True, includes the timezone in the returned string.
- Returns:
str: The current time formatted as ‘YYYY-MM-DD HH:MM:SS’ (with optional timezone).
- duckfunc.getDuckVersion(con) str
Get the connected DuckDB version.
- Args:
con: The database connection object.
- Returns:
str: The version of the DuckDB you have open.
- duckfunc.get_attached_dbs(db_con) DataFrame
Get the list of attached databases.
- Args:
db_con: The database connection object.
- Returns:
DataFrame: A DataFrame containing the database name, path, and type.
- duckfunc.get_inventory(db_con) DataFrame
Get the inventory of tables.
- Args:
db_con: The database connection object.
- Returns:
DataFrame: A DataFrame containing all tables.
- duckfunc.get_table_as_df(con, db_name: str, table_name: str) DataFrame
Query a table from the specified database and return it as a pandas DataFrame.
- Args:
con: Database connection object db_name (str): Name of the database table_name (str): Name of the table
- Returns:
DataFrame: The table contents as a pandas DataFrame, or None if the table doesn’t exist
- duckfunc.init_table(con, frame: DataFrame, db: str, tablename: str) bool
Initialize a table in the specified database.
- Args:
con: The database connection object. frame (DataFrame): A DataFrame containing columns VARNAME and TYPE, which should be DuckDB-compatible. db (str): The name of the database. tablename (str): The name of the table.
- Returns:
bool: True if the table was created, False if it already exists.
- duckfunc.save_from_db(con, db_name: str, table_name: str, output_path: str) bool
Query a table from the specified database and save it to the given output path. The output format is determined from the file extension of the output path.
- Args:
con: Database connection object db_name (str): Name of the database table_name (str): Name of the table output_path (str): Path to save the output file (extension determines format)
- Returns:
bool: True if the table existed and was saved, False otherwise
fileio
- fileio.addsyspath(directory: str)
Add the specified directory to the system path if it is not already included.
- Parameters:
directory (str): The directory to be added to the system path.
- Returns:
None
- fileio.calculate_checksums(dir_path)
Calculate MD5 checksums for all files in the specified directory.
- Parameters:
dir_path (str) – Path to the directory containing files.
- Returns:
Dictionary mapping file paths to their MD5 checksum.
- Return type:
dict
- fileio.create_filepath_dirs(path: str)
Creates all directories needed for a given file path.
If the path contains folders, this function creates all necessary directories in the path if they don’t already exist.
- Parameters:
path (str): The file path for which to create directories.
- Returns:
None
- fileio.download_file_from_url(url: str, save_path: str)
Downloads a file from the given URL and saves it to the specified path.
- Args:
url (str): The URL of the file to download. save_path (str): The file path where the downloaded file will be saved.
- Returns:
None
- fileio.gen_random_subfolder(master_dir: str) str
Generates a random subfolder within the specified master directory.
- Args:
master_dir (str): The path to the master directory where the subfolder will be created.
- Returns:
str: The path to the newly created subfolder.
- fileio.get_file_extension(filepath: str) str
Get the file extension of the given file path.
- Parameters:
filepath – The path of the file.
- Returns:
The file extension of the file.
- fileio.list_dirs(main_dir: str) list
List all directories within the specified main directory.
- Args:
main_dir (str): The main directory path to list directories from.
- Returns:
list: A list of directory names within the specified main directory.
- fileio.list_files_of_extension(directory: str, extn: str) list[str]
List all files in the specified directory with the given extension.
- Parameters:
directory – The directory to search in.
extn – The file extension to filter by.
- Returns:
A list of file paths with the specified extension.
- fileio.mv_file(src: str, dest: str)
Moves a file from the source path to the destination path using shutil.
- Parameters:
src (str): The path of the file to be moved. dest (str): The destination path where the file should be moved.
- Returns:
None
- fileio.read_file_to_bytes(file_path: str) bytes
Read the string content from the specified file and convert it to bytes using UTF-8 encoding.
- Parameters:
file_path (str): The path to the file.
- Returns:
bytes: The content of the file as bytes.
- fileio.read_file_to_string(file_path: str) str
Read the string content from the specified file.
- Parameters:
file_path (str): The path to the file.
- Returns:
str: The content of the file as a string.
- fileio.remove_directory(dir_path: str) bool
Removes a directory at the specified path.
Attempts to remove the directory and prints the result. If an error occurs during removal, the exception is caught and an error message is printed.
- Parameters:
dir_path (str): The path to the directory to be removed.
- Returns:
bool: True if directory was successfully removed, False otherwise.
- Raises:
No exceptions are raised as they are caught and printed internally.
systeminfo
- systeminfo.free_gb_in_drive(drive: str) float
Calculate the free space in a specified drive in gigabytes (GB).
- Parameters:
drive (str): The drive to check the free space of.
- Returns:
float: The free space in gigabytes (GB).
- systeminfo.gather_free_space_in_drive(drive: str) float
Gather the free space in a specified drive.
- Parameters:
drive (str): The drive to check the free space of. If the drive is a single letter, it is assumed to be a Windows drive.
- Returns:
float: The free space in bytes.
- systeminfo.get_battery_info() dict
Get information about the system battery.
- Returns:
- dict: Dictionary containing battery percentage, time left, and power plugged status.
Returns None if no battery is present.
- systeminfo.get_cpu_usage_percent() float
Get the current CPU usage as a percentage.
- Returns:
float: Current CPU usage percentage.
- systeminfo.get_formatted_uptime() str
Get the system uptime formatted as days, hours, minutes, seconds.
- Returns:
str: Formatted uptime string.
- systeminfo.get_free_ram() int
Get the amount of free RAM available in bytes.
- Returns:
int: The amount of free RAM in bytes.
- systeminfo.get_free_ram_in_gb() float
Get the amount of free RAM on the system in gigabytes.
This function uses the psutil library to retrieve the amount of free RAM and converts it from bytes to gigabytes.
- Returns:
float: The amount of free RAM in gigabytes.
- systeminfo.get_installed_ram_gb() int
Get the total amount of installed RAM in gigabytes (GB).
- Returns:
int: The total amount of installed RAM in gigabytes (GB).
- systeminfo.get_largest_drive() dict[str, any]
Identifies and returns information about the drive with the most free space.
The function finds the drive with the maximum available free space and returns its information with only the letters kept in the drive name.
- Returns:
dict[str, any]: Dictionary containing information about the drive with the most free space, with the drive name containing only letters.
- systeminfo.get_network_stats() DataFrame
Get statistics for all network interfaces.
- Returns:
pd.DataFrame: DataFrame with network interface statistics.
- systeminfo.get_number_virtual_cores() int
Get the number of virtual (logical) CPU cores including hyperthreads.
- Returns:
int: The number of virtual CPU cores.
- systeminfo.get_per_cpu_usage_percent() list[float]
Get CPU usage percentage for each individual CPU core.
- Returns:
list[float]: List of CPU usage percentages for each core.
- systeminfo.get_physical_cores() int
Get the number of physical CPU cores.
- Returns:
int: The number of physical CPU cores.
- systeminfo.get_system_info() dict
Get general system information.
- Returns:
dict: Dictionary containing OS, hostname, and platform information.
- systeminfo.get_system_uptime() float
Get the system uptime in seconds.
- Returns:
float: System uptime in seconds.
- systeminfo.get_top_processes(n=5) DataFrame
Get the top n processes by memory usage.
- Parameters:
n (int): Number of processes to return. Default is 5.
- Returns:
pd.DataFrame: DataFrame with top processes information.
- systeminfo.list_drive_spaces() DataFrame
List all available drives and their free space in gigabytes (GB).
- Returns:
pd.DataFrame: A DataFrame with the drive names and their free space in gigabytes (GB).
- systeminfo.list_drives() list[str]
List all available drives on the system.
- Returns:
list[str]: A list of device names for all available drives.
Release History
Version 1.5.3
Blob helper module now contains a function for checking if a SAS token works for a container
Version 1.5.2
Blob helper module now contains a checksum function for listed blobs
Version 1.5.1
Blob helper module fix on makedirs argument for Downloads
New chunking based function for downloading blobs
Version 1.5
Enhanced blob helper module with new functions:
list_blob_content: Lists all blobs in a specified container or subfolder
download_all_blobs: Downloads a blob to a specified local path
get_account_url: Generates Azure Storage account URL from storage account name
Version 1.4
Added new function to datatransform module:
merge_dataframes: Merges a list of pandas DataFrames into a single DataFrame, aligning columns by name and filling missing columns with NaN.
Version 1.3
Added new data conversion functionality in dataio module:
csv_to_parquet: Converts CSV files to Parquet format using Polars in streaming mode with support for custom separators and output file naming
Enhanced hashing capabilities with new functions in datahash module:
hashmd5: Creates an MD5 hash using provided data and salt
Added new Azure blob module (blobhelper) for path operations:
get_blob_container_path: Returns the absolute path of a blob container
get_blob_subfolder_path: Returns the absolute path of a blob subfolder within a container
Version 1.2.1
Added new function to fileio module:
calculate_checksums: Calculates MD5 checksums for all files in a specified directory, returning a dictionary mapping file paths to their checksums.
Methods that return no value are no longer displayed in the function signature
Fix applied for move file function to be platform independent
Version 1.2
Enhanced datatransform module with:
json_to_dataframe: Converts JSON data into pandas DataFrames with support for:
Multiple input types (JSON strings, Python dicts/lists, file paths)
Custom orientation options for structured data
Normalization of nested JSON structures
Handling of both single objects and arrays of records
Customizable encoding for file reading
Fixed systeminfo module with:
Reworked drive listing to continue functioning when some drives are inaccessible
Improved reliability of get_largest_drive() function
Improved fileio module with:
Enhanced remove_directory() to return boolean success/failure status instead of only printing messages
Added proper type hints and improved documentation for key functions
Version 1.1
Added new generic functions for duck database to output tables to files
Expansion of systeminformation module to feedback:
CPU usage
System uptime
System information
Network stats
Top processes
Enhanced dataclean module with:
keep_only_letters: Filters strings to keep only alphabetical characters
keep_alphanumeric: Filters strings to keep only alphanumeric characters
normalize_text: Converts text to lowercase and removes accents
remove_empty_rows: Removes rows where all values are empty or NaN
convert_to_numeric: Converts specified columns to numeric type
check_column_completeness: Calculates the percentage of non-missing values for each column
Added new datagen module for synthetic data generation:
gen_string_column: Generates random string data with patterns and customization
gen_numeric_column: Generates numeric data with various distributions
gen_date_column: Generates date data with customizable ranges and formats
gen_categorical_column: Generates categorical data with optional weighted distributions
gen_bool_column: Generates boolean data with custom labels
gen_dataframe: Creates complete dataframes with customizable columns
gen_sample_dataframe: Generates ready-to-use test dataframes with common column types
Version 1.0.2
Added Bitgen Features:
generate_random_uuid: Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed.
generate_custom_uuid: Generates a random UUID using UUID version 4, formatted with a specified marker. The default marker is a dash (‘-‘), but it can be customized.
Version 1.0.1
First full release of package