8000 GitHub - fnyamweya/augmentx: A powerful, flexible, and modular data augmentation tool designed to enhance the performance of machine learning models by expanding datasets in meaningful ways.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A powerful, flexible, and modular data augmentation tool designed to enhance the performance of machine learning models by expanding datasets in meaningful ways.

Notifications You must be signed in to change notification settings

fnyamweya/augmentx

Repository files navigation

AugmentX: Data Augmentation Made Easy

AugmentX is a powerful, flexible, and modular data augmentation tool designed to enhance the performance of machine learning models by expanding datasets in meaningful ways. It provides out-of-the-box support for tabular, image, and text data augmentation, ensuring that datasets are well-balanced, diverse, and rich for training robust models.

Built with FastAPI for the API layer, and utilizing modern data augmentation techniques and libraries, AugmentX is designed to be simple to use, yet scalable for large datasets and production environments. The application can run as a standalone service using Docker, making it portable and easy to integrate with existing machine learning workflows.

Key Features:

  • Tabular Data Augmentation:

    • SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced datasets.
    • Noise injection to simulate variations and increase data diversity.
    • Outlier detection and handling to either amplify or remove anomalies for robust model training.
  • Image Data Augmentation:

    • Geometric transformations such as rotation, flipping, and scaling.
    • Color adjustments for brightness, contrast, and saturation.
    • Noise addition and blurring to simulate real-world scenarios.
  • Text Data Augmentation:

    • Synonym replacement using WordNet for natural variation in text data.
    • Random word insertion and deletion to test text model robustness.
    • Back-translation using pre-trained models from Hugging Face for generating diverse paraphrases.
  • REST API for Easy Integration:

    • Exposes a user-friendly API via FastAPI, allowing users to upload data and apply augmentations via simple HTTP requests.
    • Auto-generated API documentation with Swagger UI.
  • Dockerized for Portability:

    • Can be run in any environment using Docker, ensuring consistency across development, testing, and production.
    • Supports Redis and Celery for background processing and scheduling long-running augmentation tasks.
  • Customizable Pipelines:

    • Allows users to create custom augmentation pipelines, chaining multiple transformations for complex data needs.
    • Configurable via environment variables for easy adjustments in different environments.

Ideal Use Cases:

  • Machine Learning Model Training:

    • Generate more training data to avoid overfitting and improve model generalization.
  • Data Preprocessing:

    • Prepare imbalanced, noisy, or incomplete datasets for machine learning pipelines with a few augmentation techniques.
  • Natural Language Processing:

    • Enhance text datasets with semantic variations to improve NLP model robustness.
  • Computer Vision:

    • Augment image datasets for classification, object detection, or segmentation tasks with a variety of transformations.

Technologies Used:

  • FastAPI: High-performance Python framework for building the REST API.
  • Uvicorn: ASGI server for running FastAPI.
  • Pandas, Scikit-learn, Imbalanced-learn: Libraries for handling and augmenting tabular data.
  • Pillow, OpenCV, Torchvision: Libraries for image processing and augmentation.
  • NLTK, Transformers: Libraries for text augmentation, including synonym replacement and back-translation.
  • Docker: For containerizing the app and simplifying deployment.
  • Redis + Celery: For background task management.

Example Usage:

  1. Tabular Data: Apply SMOTE and noise injection to balance your dataset before training a classification model.
  2. Image Data: Randomly rotate, flip, and adjust the brightness of images in a dataset to improve the robustness of a computer vision model.
  3. Text Data: Use back-translation to create paraphrased variations of training sentences, improving the generalization of NLP models.

Getting Started:

To run AugmentX locally or in a cloud environment, simply clone the repository, configure your environment variables, and use Docker to build and deploy the application.

About

A powerful, flexible, and modular data augmentation tool designed to enhance the performance of machine learning models by expanding datasets in meaningful ways.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0