fluid/stable/: fluid-ml-training-0.1.0 metadata and description

Simple index

Repo to manage training data and create training jobs

author Ameya Kirtane
author_email ameya.kirtane@fluidanalytics.ai
classifiers
  • Programming Language :: Python :: 3
  • Programming Language :: Python :: 3.10
  • Programming Language :: Python :: 3.11
  • Programming Language :: Python :: 3.12
  • Programming Language :: Python :: 3.13
  • Programming Language :: Python :: 3.14
description_content_type text/markdown
requires_dist
  • pydantic (>=2.12.3,<3.0.0)
  • boto3 (>=1.40.64,<2.0.0)
  • pycocotools (>=2.0.0,<3.0.0)
requires_python >=3.10

Because this project isn't in the mirror_whitelist, no releases from root/pypi are included.

File Tox results History
fluid_ml_training-0.1.0-py3-none-any.whl
Size
13 KB
Type
Python Wheel
Python
3
  • Replaced 1 time(s)
  • Uploaded to fluid/stable by fluid 2025-12-05 05:39:24
fluid_ml_training-0.1.0.tar.gz
Size
13 KB
Type
Source
  • Replaced 1 time(s)
  • Uploaded to fluid/stable by fluid 2025-12-05 05:39:25

Fluid ML Training

Repository for managing versioned training datasets stored in S3 and creating training jobs.

Overview

This repository provides tools for managing versioned datasets in S3, including:

VersionedDatasetManager

The VersionedDatasetManager class provides a comprehensive interface for managing versioned datasets stored in Amazon S3. It handles frames (as zip files), annotations (JSON files), and dataset manifests that track version history.

Features

S3 Structure

Datasets are organized in S3 with the following structure:

{prefix}/
├── RAW-DATA/
│   └── frames/
│       └── dataset_id={dataset_id}/
│           └── frames.zip
├── annotations/
│   └── dataset_id={dataset_id}/
│       └── version={version}/
│           └── annotations.json
└── manifests/
    └── dataset_id={dataset_id}/
        └── dataset_manifest.json

Installation

pip install -e .

Basic Usage

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Initialize the manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(
    bucket="my-bucket",
    s3_client=s3_client,
    prefix="datasets"  # Optional S3 prefix
)

# Create a new dataset
manager.create_dataset(
    dataset_id="project1",
    dataset_name="Project 1 Dataset",
    frames_dir="./local_frames",
    annotation_file="./annotations.json",
    annotation_type="COCO"
)

# Update an existing dataset (creates new version)
manager.update_dataset(
    dataset_id="project1",
    frames_dir="./updated_frames",
    annotation_file="./updated_annotations.json",
    annotation_type="COCO"
)

# Download a dataset
manager.download_dataset(
    dataset_id="project1",
    output_dir="./downloaded_data",
    version="2.0"  # Optional: defaults to latest
)

# List all datasets
all_datasets = manager.get_datasets()
for dataset_id, manifest in all_datasets.items():
    print(f"{dataset_id}: {manifest.dataset_name}")
    print(f"  Versions: {list(manifest.versions.keys())}")

API Reference

Initialization

VersionedDatasetManager(bucket: str, s3_client, prefix: str = "")

Creating Datasets

create_dataset(dataset_id, dataset_name, frames_dir, annotation_file, annotation_type="COCO")

Creates a new dataset with version 1.0. Uploads frames (as zip), annotations, and creates the initial manifest.

update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO")

Updates an existing dataset by creating a new version. Automatically increments the version number by 1.0.

create_or_update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO", dataset_name="")

Automatically determines whether to create or update a dataset. If the dataset doesn't exist, creates it with version 1.0. If it exists, creates a new version.

Downloading Datasets

download_dataset(dataset_id, output_dir, version="")

Downloads a dataset version from S3 to a local directory. Downloads:

If version is empty, downloads the latest version.

Querying Datasets

get_datasets() -> dict[str, DatasetManifest]

Returns a dictionary mapping dataset_id to DatasetManifest objects for all datasets in S3.

get_latest_version(dataset_id) -> str

Returns the latest version string for a dataset by comparing numeric version values.

version_exists(dataset_id, version) -> bool

Checks if a specific version exists for a dataset.

is_new_dataset(dataset_id) -> bool

Checks if a dataset is new (doesn't exist in S3 yet).

Path Helpers

get_manifest_path(dataset_id, allow_new=False) -> str

Returns the S3 key path for a dataset's manifest file.

get_frames_path(dataset_id, allow_new=False) -> str

Returns the S3 key path for a dataset's frames zip file.

get_annotations_path(dataset_id, version="", allow_new=False) -> str

Returns the S3 key path for a dataset's annotation file for a specific version.

Dataset Manifest Structure

The dataset manifest is a JSON file that tracks dataset metadata and version history:

{
  "dataset_id": "project1",
  "dataset_name": "Project 1 Dataset",
  "frames_s3_paths": "RAW-DATA/frames/dataset_id=project1/frames.zip",
  "versions": {
    "1.0": {
      "version": "1.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=1.0/annotations.json",
      "num_frames": 1000,
      "num_clips": 0,
      "num_videos": 1
    },
    "2.0": {
      "version": "2.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=2.0/annotations.json",
      "num_frames": 1200,
      "num_clips": 0,
      "num_videos": 1
    }
  }
}

Example Workflow

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Setup
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-training-data", s3_client=s3_client)

# Step 1: Create initial dataset
manager.create_dataset(
    dataset_id="pipe-inspection-v1",
    dataset_name="Pipe Inspection Dataset v1",
    frames_dir="./data/frames",
    annotation_file="./data/annotations.json",
    annotation_type="COCO"
)

# Step 2: Later, update with new annotations
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./data/frames",  # Can be same or updated
    annotation_file="./data/annotations_v2.json",
    annotation_type="COCO"
)

# Step 3: Download specific version for training
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./training_data",
    version="2.0"
)

# Step 4: Check what versions are available
manifest = manager._load_manifest("pipe-inspection-v1")
print(f"Available versions: {list(manifest.versions.keys())}")
latest = manager.get_latest_version("pipe-inspection-v1")
print(f"Latest version: {latest}")

DatasetReviewer

The DatasetReviewer is an interactive tool for reviewing and editing COCO-format annotations with bounding boxes. It provides a matplotlib-based UI for visualizing images with their annotations and allows interactive editing of annotation labels and attributes.

Features

Usage

from dataset_reviewer import DatasetReviewer

# Initialize reviewer
reviewer = DatasetReviewer(
    dataset_dir="./downloaded_data",
    frames_zip_path=None,  # Optional: defaults to {dataset_dir}/frames.zip
    annotations_file_path=None  # Optional: defaults to {dataset_dir}/annotations.json
)

# Start interactive review session
reviewer.review()

Keyboard Controls

Mouse Controls

Example Workflow

from dataset_reviewer import DatasetReviewer

# Download dataset first using VersionedDatasetManager
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./review_data"
)

# Review and edit annotations
reviewer = DatasetReviewer("./review_data")
reviewer.review()

# After editing, annotations.json is updated locally
# You can then upload the updated annotations as a new version
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./review_data/data",  # Extract frames.zip first
    annotation_file="./review_data/annotations.json",
    annotation_type="COCO"
)

Integration Example

Complete workflow combining both tools:

import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
from dataset_reviewer import DatasetReviewer

# Setup manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-bucket", s3_client=s3_client)

# 1. Download dataset for review
manager.download_dataset(
    dataset_id="my-dataset",
    output_dir="./review_workspace"
)

# 2. Review and edit annotations
reviewer = DatasetReviewer("./review_workspace")
reviewer.review()  # Interactive session - edit annotations as needed

# 3. Extract frames for upload (if needed)
import zipfile
with zipfile.ZipFile("./review_workspace/data/frames.zip", 'r') as zip_ref:
    zip_ref.extractall("./review_workspace/frames")

# 4. Upload updated dataset as new version
manager.update_dataset(
    dataset_id="my-dataset",
    frames_dir="./review_workspace/frames",
    annotation_file="./review_workspace/annotations.json",
    annotation_type="COCO"
)

Requirements

For DatasetReviewer (currently commented out):

Configuration

S3 Configuration

The VersionedDatasetManager uses standard AWS credentials. Ensure your AWS credentials are configured via:

S3 Path Structure

Paths are configurable via the prefix parameter in VersionedDatasetManager.__init__(). The default structure uses:

Development

# Install in development mode
pip install -e .

# Run tests
pytest

# Lint code
ruff check .

License

[Add your license information here]