fluid/stable/: fluid-ml-training-0.1.0 metadata and description

Repo to manage training data and create training jobs

author	Ameya Kirtane
author_email	ameya.kirtane@fluidanalytics.ai
classifiers	Programming Language :: Python :: 3 Programming Language :: Python :: 3.10 Programming Language :: Python :: 3.11 Programming Language :: Python :: 3.12 Programming Language :: Python :: 3.13 Programming Language :: Python :: 3.14
description_content_type	text/markdown
metadata_version	2.4
requires_dist	pydantic (>=2.12.3,<3.0.0) boto3 (>=1.40.64,<2.0.0) pycocotools (>=2.0.0,<3.0.0)
requires_python	>=3.10

Because this project isn't in the mirror_whitelist, no releases from root/pypi are included.

File	Tox results	History
fluid_ml_training-0.1.0-py3-none-any.whl Size 13 KB Type Python Wheel Python 3		Replaced 1 time(s) Uploaded to fluid/stable by fluid 2025-12-05 05:39:24
fluid_ml_training-0.1.0.tar.gz Size 13 KB Type Source		Replaced 1 time(s) Uploaded to fluid/stable by fluid 2025-12-05 05:39:25

Fluid ML Training

Repository for managing versioned training datasets stored in S3 and creating training jobs.

Overview

This repository provides tools for managing versioned datasets in S3, including:

VersionedDatasetManager: A Python class for managing versioned datasets with frames, annotations, and manifests in S3
DatasetReviewer: An interactive tool for reviewing and editing COCO-format annotations with bounding boxes

VersionedDatasetManager

The VersionedDatasetManager class provides a comprehensive interface for managing versioned datasets stored in Amazon S3. It handles frames (as zip files), annotations (JSON files), and dataset manifests that track version history.

Features

Version Management: Automatically track and increment dataset versions
S3 Integration: Seamlessly upload and download datasets from S3
Manifest Tracking: Maintain JSON manifests that track all dataset versions and metadata
Path Management: Automatic S3 path construction following a consistent structure

S3 Structure

Datasets are organized in S3 with the following structure:

{prefix}/
├── RAW-DATA/
│   └── frames/
│       └── dataset_id={dataset_id}/
│           └── frames.zip
├── annotations/
│   └── dataset_id={dataset_id}/
│       └── version={version}/
│           └── annotations.json
└── manifests/
    └── dataset_id={dataset_id}/
        └── dataset_manifest.json

Installation

pip install -e .

Basic Usage

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Initialize the manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(
    bucket="my-bucket",
    s3_client=s3_client,
    prefix="datasets"  # Optional S3 prefix
)

# Create a new dataset
manager.create_dataset(
    dataset_id="project1",
    dataset_name="Project 1 Dataset",
    frames_dir="./local_frames",
    annotation_file="./annotations.json",
    annotation_type="COCO"
)

# Update an existing dataset (creates new version)
manager.update_dataset(
    dataset_id="project1",
    frames_dir="./updated_frames",
    annotation_file="./updated_annotations.json",
    annotation_type="COCO"
)

# Download a dataset
manager.download_dataset(
    dataset_id="project1",
    output_dir="./downloaded_data",
    version="2.0"  # Optional: defaults to latest
)

# List all datasets
all_datasets = manager.get_datasets()
for dataset_id, manifest in all_datasets.items():
    print(f"{dataset_id}: {manifest.dataset_name}")
    print(f"  Versions: {list(manifest.versions.keys())}")

API Reference

Initialization

VersionedDatasetManager(bucket: str, s3_client, prefix: str = "")

bucket: Name of the S3 bucket where datasets are stored
s3_client: Boto3 S3 client instance
prefix: Optional S3 prefix/path to prepend to all dataset paths

Creating Datasets

create_dataset(dataset_id, dataset_name, frames_dir, annotation_file, annotation_type="COCO")

Creates a new dataset with version 1.0. Uploads frames (as zip), annotations, and creates the initial manifest.

update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO")

Updates an existing dataset by creating a new version. Automatically increments the version number by 1.0.

create_or_update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO", dataset_name="")

Automatically determines whether to create or update a dataset. If the dataset doesn't exist, creates it with version 1.0. If it exists, creates a new version.

Downloading Datasets

download_dataset(dataset_id, output_dir, version="")

Downloads a dataset version from S3 to a local directory. Downloads:

dataset_manifest.json - The dataset manifest
annotations.json - Annotations for the specified version
data/frames.zip - The frames zip file

If version is empty, downloads the latest version.

Querying Datasets

get_datasets() -> dict[str, DatasetManifest]

Returns a dictionary mapping dataset_id to DatasetManifest objects for all datasets in S3.

get_latest_version(dataset_id) -> str

Returns the latest version string for a dataset by comparing numeric version values.

version_exists(dataset_id, version) -> bool

Checks if a specific version exists for a dataset.

is_new_dataset(dataset_id) -> bool

Checks if a dataset is new (doesn't exist in S3 yet).

Path Helpers

get_manifest_path(dataset_id, allow_new=False) -> str

Returns the S3 key path for a dataset's manifest file.

get_frames_path(dataset_id, allow_new=False) -> str

Returns the S3 key path for a dataset's frames zip file.

get_annotations_path(dataset_id, version="", allow_new=False) -> str

Returns the S3 key path for a dataset's annotation file for a specific version.

Dataset Manifest Structure

The dataset manifest is a JSON file that tracks dataset metadata and version history:

{
  "dataset_id": "project1",
  "dataset_name": "Project 1 Dataset",
  "frames_s3_paths": "RAW-DATA/frames/dataset_id=project1/frames.zip",
  "versions": {
    "1.0": {
      "version": "1.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=1.0/annotations.json",
      "num_frames": 1000,
      "num_clips": 0,
      "num_videos": 1
    },
    "2.0": {
      "version": "2.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=2.0/annotations.json",
      "num_frames": 1200,
      "num_clips": 0,
      "num_videos": 1
    }
  }
}

Example Workflow

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Setup
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-training-data", s3_client=s3_client)

# Step 1: Create initial dataset
manager.create_dataset(
    dataset_id="pipe-inspection-v1",
    dataset_name="Pipe Inspection Dataset v1",
    frames_dir="./data/frames",
    annotation_file="./data/annotations.json",
    annotation_type="COCO"
)

# Step 2: Later, update with new annotations
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./data/frames",  # Can be same or updated
    annotation_file="./data/annotations_v2.json",
    annotation_type="COCO"
)

# Step 3: Download specific version for training
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./training_data",
    version="2.0"
)

# Step 4: Check what versions are available
manifest = manager._load_manifest("pipe-inspection-v1")
print(f"Available versions: {list(manifest.versions.keys())}")
latest = manager.get_latest_version("pipe-inspection-v1")
print(f"Latest version: {latest}")

DatasetReviewer

The DatasetReviewer is an interactive tool for reviewing and editing COCO-format annotations with bounding boxes. It provides a matplotlib-based UI for visualizing images with their annotations and allows interactive editing of annotation labels and attributes.

Features

Interactive Visualization: View images with bounding box annotations overlaid
Keyboard Navigation: Navigate through images using arrow keys
Annotation Editing: Click on bounding boxes to edit category labels and attributes
COCO Format Support: Works with standard COCO annotation format

Usage

from dataset_reviewer import DatasetReviewer

# Initialize reviewer
reviewer = DatasetReviewer(
    dataset_dir="./downloaded_data",
    frames_zip_path=None,  # Optional: defaults to {dataset_dir}/frames.zip
    annotations_file_path=None  # Optional: defaults to {dataset_dir}/annotations.json
)

# Start interactive review session
reviewer.review()

Keyboard Controls

Right Arrow or 'n': Next image
Left Arrow or 'p': Previous image
's': Save annotations to JSON file
'q' or Escape: Quit

Mouse Controls

Click on bounding box: Open edit dialog to modify:
- Category label
- Attributes (distance_from_camera, visibility, water_level, secondary_defect, clock_position, pipe_color, pipe_condition, difficult, occluded, rotation)

Example Workflow

from dataset_reviewer import DatasetReviewer

# Download dataset first using VersionedDatasetManager
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./review_data"
)

# Review and edit annotations
reviewer = DatasetReviewer("./review_data")
reviewer.review()

# After editing, annotations.json is updated locally
# You can then upload the updated annotations as a new version
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./review_data/data",  # Extract frames.zip first
    annotation_file="./review_data/annotations.json",
    annotation_type="COCO"
)

Integration Example

Complete workflow combining both tools:

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
from dataset_reviewer import DatasetReviewer

# Setup manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-bucket", s3_client=s3_client)

# 1. Download dataset for review
manager.download_dataset(
    dataset_id="my-dataset",
    output_dir="./review_workspace"
)

# 2. Review and edit annotations
reviewer = DatasetReviewer("./review_workspace")
reviewer.review()  # Interactive session - edit annotations as needed

# 3. Extract frames for upload (if needed)
import zipfile
with zipfile.ZipFile("./review_workspace/data/frames.zip", 'r') as zip_ref:
    zip_ref.extractall("./review_workspace/frames")

# 4. Upload updated dataset as new version
manager.update_dataset(
    dataset_id="my-dataset",
    frames_dir="./review_workspace/frames",
    annotation_file="./review_workspace/annotations.json",
    annotation_type="COCO"
)

Requirements

Python >= 3.10
boto3 >= 1.40.64
pydantic >= 2.12.3, < 3.0.0

For DatasetReviewer (currently commented out):

matplotlib
PIL (Pillow)
numpy
tkinter

Configuration

S3 Configuration

The VersionedDatasetManager uses standard AWS credentials. Ensure your AWS credentials are configured via:

AWS credentials file (~/.aws/credentials)
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
IAM role (if running on EC2)

S3 Path Structure

Paths are configurable via the prefix parameter in VersionedDatasetManager.__init__(). The default structure uses:

RAW-DATA/frames/ for frame zip files
annotations/ for annotation JSON files
manifests/ for dataset manifest files

Development

# Install in development mode
pip install -e .

# Run tests
pytest

# Lint code
ruff check .

License

[Add your license information here]

devpi

fluid/stable/: fluid-ml-training-0.1.0 metadata and description

Fluid ML Training

Overview

VersionedDatasetManager

Features

S3 Structure

Installation

Basic Usage

API Reference

Initialization

Creating Datasets

Downloading Datasets

Querying Datasets

Path Helpers

Dataset Manifest Structure

Example Workflow

DatasetReviewer

Features

Usage

Keyboard Controls

Mouse Controls

Example Workflow

Integration Example

Requirements

Configuration

S3 Configuration

S3 Path Structure

Development

License