.. Uaine.Pydat documentation master file, created by sphinx-quickstart on Wed Feb 12 22:51:18 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Uaine.Pydat Package ========================= This package provides comprehensive tools for data handling, processing, and database operations in Python. It includes specialised functions for file I/O operations, data transformation, data cleaning, system information gathering, and data generation. Key features include PSV file handling, cryptographic hash functions, data table cleaning methods, DuckDB integration with snippet queries, random data generation utilities, and XML/config file parsers. The package aims to simplify common data manipulation tasks while providing a flexible framework for both basic and advanced data operations. .. raw:: html PyPI Downloads License: GPL v3 Description ========================= A Python package that streamlines data handling, processing, and database operations through a collection of utility functions and tools. Uaine.Pydat provides a comprehensive toolkit for data scientists, analysts, and developers working with structured data. The package simplifies common data manipulation tasks while offering specialized functionality for file operations, data transformation, table cleaning, and database interactions. Key Features include File I/O Operations for reading/writing various file formats, listing files by extension, and managing system paths; Data Transformation tools for reshaping, converting, and manipulating data structures with minimal code; Data Cleaning methods to sanitize, standardize, and prepare data tables for analysis; DuckDB Integration with helper functions and snippet queries; Cryptographic Hashing for data integrity and anonymization; System Information utilities to gather system metrics and resource usage; Data Generation for testing and development scenarios; and Configuration Handling for XML, INI, and other formats. Core Modules include dataio.py for data input/output operations and format conversion; fileio.py for file system operations and path management; datatransform.py for data structure transformation and manipulation; dataclean.py for data cleaning and standardization; duckfunc.py for DuckDB database interactions; datahash.py for cryptographic hashing functions; systeminfo.py for system information gathering; datagen.py for random data generation; and bitgen.py for low-level bit generation utilities. Common Use Cases for this package include simplifying ETL workflows, streamlining data preparation for analysis and machine learning, managing database operations with less boilerplate code, generating test data for development, monitoring system resources during data processing tasks, and securing sensitive data through hashing and anonymization. This package aims to reduce the complexity of common data handling tasks by providing ready-made solutions that follow best practices while remaining flexible enough to adapt to various data processing requirements. .. toctree:: :maxdepth: 2 :caption: Contents: Dependencies ========================= * pandas * pyreadstat * requests * duckdb * wheel * twine * psutil * lxml * polars * azure-storage-blob * tqdm bitgen ========================= .. automodule:: bitgen :members: blobhelper ========================= .. automodule:: blobhelper :members: dataclean ========================= .. automodule:: dataclean :members: datagen ========================= .. automodule:: datagen :members: datahash ========================= .. automodule:: datahash :members: dataio ========================= .. automodule:: dataio :members: datatransform ========================= .. automodule:: datatransform :members: duckfunc ========================= .. automodule:: duckfunc :members: fileio ========================= .. automodule:: fileio :members: systeminfo ========================= .. automodule:: systeminfo :members: Release History ========================= Version 1.5.3 ^^^^^^^^^^^^^ * Blob helper module now contains a function for checking if a SAS token works for a container Version 1.5.2 ^^^^^^^^^^^^^ * Blob helper module now contains a checksum function for listed blobs Version 1.5.1 ^^^^^^^^^^^^^ * Blob helper module fix on makedirs argument for Downloads * New chunking based function for downloading blobs Version 1.5 ^^^^^^^^^^^^^ * Enhanced blob helper module with new functions: - `list_blob_content`: Lists all blobs in a specified container or subfolder - `download_all_blobs`: Downloads a blob to a specified local path - `get_account_url`: Generates Azure Storage account URL from storage account name Version 1.4 ^^^^^^^^^^^^^ * Added new function to datatransform module: - `merge_dataframes`: Merges a list of pandas DataFrames into a single DataFrame, aligning columns by name and filling missing columns with NaN. Version 1.3 ^^^^^^^^^^^^^ * Added new data conversion functionality in dataio module: - `csv_to_parquet`: Converts CSV files to Parquet format using Polars in streaming mode with support for custom separators and output file naming * Enhanced hashing capabilities with new functions in datahash module: - `hashmd5`: Creates an MD5 hash using provided data and salt * Added new Azure blob module (blobhelper) for path operations: - `get_blob_container_path`: Returns the absolute path of a blob container - `get_blob_subfolder_path`: Returns the absolute path of a blob subfolder within a container Version 1.2.1 ^^^^^^^^^^^^^ * Added new function to fileio module: - `calculate_checksums`: Calculates MD5 checksums for all files in a specified directory, returning a dictionary mapping file paths to their checksums. * Methods that return no value are no longer displayed in the function signature * Fix applied for move file function to be platform independent Version 1.2 ^^^^^^^^^^^^^ * Enhanced datatransform module with: - `json_to_dataframe`: Converts JSON data into pandas DataFrames with support for: - Multiple input types (JSON strings, Python dicts/lists, file paths) - Custom orientation options for structured data - Normalization of nested JSON structures - Handling of both single objects and arrays of records - Customizable encoding for file reading * Fixed systeminfo module with: - Reworked drive listing to continue functioning when some drives are inaccessible - Improved reliability of `get_largest_drive()` function * Improved fileio module with: - Enhanced `remove_directory()` to return boolean success/failure status instead of only printing messages - Added proper type hints and improved documentation for key functions Version 1.1 ^^^^^^^^^^^^^ * Added new generic functions for duck database to output tables to files * Expansion of systeminformation module to feedback: - CPU usage - System uptime - System information - Network stats - Top processes * Enhanced dataclean module with: - `keep_only_letters`: Filters strings to keep only alphabetical characters - `keep_alphanumeric`: Filters strings to keep only alphanumeric characters - `normalize_text`: Converts text to lowercase and removes accents - `remove_empty_rows`: Removes rows where all values are empty or NaN - `convert_to_numeric`: Converts specified columns to numeric type - `check_column_completeness`: Calculates the percentage of non-missing values for each column * Added new datagen module for synthetic data generation: - `gen_string_column`: Generates random string data with patterns and customization - `gen_numeric_column`: Generates numeric data with various distributions - `gen_date_column`: Generates date data with customizable ranges and formats - `gen_categorical_column`: Generates categorical data with optional weighted distributions - `gen_bool_column`: Generates boolean data with custom labels - `gen_dataframe`: Creates complete dataframes with customizable columns - `gen_sample_dataframe`: Generates ready-to-use test dataframes with common column types Version 1.0.2 ^^^^^^^^^^^^^ * Added Bitgen Features: - `generate_random_uuid`: Generates a random UUID (Universally Unique Identifier) using UUID version 4, with dashes removed. - `generate_custom_uuid`: Generates a random UUID using UUID version 4, formatted with a specified marker. The default marker is a dash ('-'), but it can be customized. Version 1.0.1 ^^^^^^^^^^^^^ * First full release of package