Metadata-Version: 2.4
Name: fluid-ml-training
Version: 0.1.0
Summary: Repo to manage training data and create training jobs
Author: Ameya Kirtane
Author-email: ameya.kirtane@fluidanalytics.ai
Requires-Python: >=3.10
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: boto3 (>=1.40.64,<2.0.0)
Requires-Dist: pycocotools (>=2.0.0,<3.0.0)
Requires-Dist: pydantic (>=2.12.3,<3.0.0)
Description-Content-Type: text/markdown

# Fluid ML Training

Repository for managing versioned training datasets stored in S3 and creating training jobs.

## Overview

This repository provides tools for managing versioned datasets in S3, including:
- **VersionedDatasetManager**: A Python class for managing versioned datasets with frames, annotations, and manifests in S3
- **DatasetReviewer**: An interactive tool for reviewing and editing COCO-format annotations with bounding boxes

## VersionedDatasetManager

The `VersionedDatasetManager` class provides a comprehensive interface for managing versioned datasets stored in Amazon S3. It handles frames (as zip files), annotations (JSON files), and dataset manifests that track version history.

### Features

- **Version Management**: Automatically track and increment dataset versions
- **S3 Integration**: Seamlessly upload and download datasets from S3
- **Manifest Tracking**: Maintain JSON manifests that track all dataset versions and metadata
- **Path Management**: Automatic S3 path construction following a consistent structure

### S3 Structure

Datasets are organized in S3 with the following structure:

```
{prefix}/
├── RAW-DATA/
│   └── frames/
│       └── dataset_id={dataset_id}/
│           └── frames.zip
├── annotations/
│   └── dataset_id={dataset_id}/
│       └── version={version}/
│           └── annotations.json
└── manifests/
    └── dataset_id={dataset_id}/
        └── dataset_manifest.json
```

### Installation

```bash
pip install -e .
```

### Basic Usage

```python
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Initialize the manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(
    bucket="my-bucket",
    s3_client=s3_client,
    prefix="datasets"  # Optional S3 prefix
)

# Create a new dataset
manager.create_dataset(
    dataset_id="project1",
    dataset_name="Project 1 Dataset",
    frames_dir="./local_frames",
    annotation_file="./annotations.json",
    annotation_type="COCO"
)

# Update an existing dataset (creates new version)
manager.update_dataset(
    dataset_id="project1",
    frames_dir="./updated_frames",
    annotation_file="./updated_annotations.json",
    annotation_type="COCO"
)

# Download a dataset
manager.download_dataset(
    dataset_id="project1",
    output_dir="./downloaded_data",
    version="2.0"  # Optional: defaults to latest
)

# List all datasets
all_datasets = manager.get_datasets()
for dataset_id, manifest in all_datasets.items():
    print(f"{dataset_id}: {manifest.dataset_name}")
    print(f"  Versions: {list(manifest.versions.keys())}")
```

### API Reference

#### Initialization

```python
VersionedDatasetManager(bucket: str, s3_client, prefix: str = "")
```

- `bucket`: Name of the S3 bucket where datasets are stored
- `s3_client`: Boto3 S3 client instance
- `prefix`: Optional S3 prefix/path to prepend to all dataset paths

#### Creating Datasets

**`create_dataset(dataset_id, dataset_name, frames_dir, annotation_file, annotation_type="COCO")`**

Creates a new dataset with version 1.0. Uploads frames (as zip), annotations, and creates the initial manifest.

**`update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO")`**

Updates an existing dataset by creating a new version. Automatically increments the version number by 1.0.

**`create_or_update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO", dataset_name="")`**

Automatically determines whether to create or update a dataset. If the dataset doesn't exist, creates it with version 1.0. If it exists, creates a new version.

#### Downloading Datasets

**`download_dataset(dataset_id, output_dir, version="")`**

Downloads a dataset version from S3 to a local directory. Downloads:
- `dataset_manifest.json` - The dataset manifest
- `annotations.json` - Annotations for the specified version
- `data/frames.zip` - The frames zip file

If `version` is empty, downloads the latest version.

#### Querying Datasets

**`get_datasets() -> dict[str, DatasetManifest]`**

Returns a dictionary mapping `dataset_id` to `DatasetManifest` objects for all datasets in S3.

**`get_latest_version(dataset_id) -> str`**

Returns the latest version string for a dataset by comparing numeric version values.

**`version_exists(dataset_id, version) -> bool`**

Checks if a specific version exists for a dataset.

**`is_new_dataset(dataset_id) -> bool`**

Checks if a dataset is new (doesn't exist in S3 yet).

#### Path Helpers

**`get_manifest_path(dataset_id, allow_new=False) -> str`**

Returns the S3 key path for a dataset's manifest file.

**`get_frames_path(dataset_id, allow_new=False) -> str`**

Returns the S3 key path for a dataset's frames zip file.

**`get_annotations_path(dataset_id, version="", allow_new=False) -> str`**

Returns the S3 key path for a dataset's annotation file for a specific version.

### Dataset Manifest Structure

The dataset manifest is a JSON file that tracks dataset metadata and version history:

```json
{
  "dataset_id": "project1",
  "dataset_name": "Project 1 Dataset",
  "frames_s3_paths": "RAW-DATA/frames/dataset_id=project1/frames.zip",
  "versions": {
    "1.0": {
      "version": "1.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=1.0/annotations.json",
      "num_frames": 1000,
      "num_clips": 0,
      "num_videos": 1
    },
    "2.0": {
      "version": "2.0",
      "annotation_type": "COCO",
      "annotation_file": "annotations/dataset_id=project1/version=2.0/annotations.json",
      "num_frames": 1200,
      "num_clips": 0,
      "num_videos": 1
    }
  }
}
```

### Example Workflow

```python
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager

# Setup
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-training-data", s3_client=s3_client)

# Step 1: Create initial dataset
manager.create_dataset(
    dataset_id="pipe-inspection-v1",
    dataset_name="Pipe Inspection Dataset v1",
    frames_dir="./data/frames",
    annotation_file="./data/annotations.json",
    annotation_type="COCO"
)

# Step 2: Later, update with new annotations
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./data/frames",  # Can be same or updated
    annotation_file="./data/annotations_v2.json",
    annotation_type="COCO"
)

# Step 3: Download specific version for training
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./training_data",
    version="2.0"
)

# Step 4: Check what versions are available
manifest = manager._load_manifest("pipe-inspection-v1")
print(f"Available versions: {list(manifest.versions.keys())}")
latest = manager.get_latest_version("pipe-inspection-v1")
print(f"Latest version: {latest}")
```

## DatasetReviewer

The `DatasetReviewer` is an interactive tool for reviewing and editing COCO-format annotations with bounding boxes. It provides a matplotlib-based UI for visualizing images with their annotations and allows interactive editing of annotation labels and attributes.

### Features

- **Interactive Visualization**: View images with bounding box annotations overlaid
- **Keyboard Navigation**: Navigate through images using arrow keys
- **Annotation Editing**: Click on bounding boxes to edit category labels and attributes
- **COCO Format Support**: Works with standard COCO annotation format

### Usage

```python
from dataset_reviewer import DatasetReviewer

# Initialize reviewer
reviewer = DatasetReviewer(
    dataset_dir="./downloaded_data",
    frames_zip_path=None,  # Optional: defaults to {dataset_dir}/frames.zip
    annotations_file_path=None  # Optional: defaults to {dataset_dir}/annotations.json
)

# Start interactive review session
reviewer.review()
```

### Keyboard Controls

- **Right Arrow** or **'n'**: Next image
- **Left Arrow** or **'p'**: Previous image
- **'s'**: Save annotations to JSON file
- **'q'** or **Escape**: Quit

### Mouse Controls

- **Click on bounding box**: Open edit dialog to modify:
  - Category label
  - Attributes (distance_from_camera, visibility, water_level, secondary_defect, clock_position, pipe_color, pipe_condition, difficult, occluded, rotation)

### Example Workflow

```python
from dataset_reviewer import DatasetReviewer

# Download dataset first using VersionedDatasetManager
manager.download_dataset(
    dataset_id="pipe-inspection-v1",
    output_dir="./review_data"
)

# Review and edit annotations
reviewer = DatasetReviewer("./review_data")
reviewer.review()

# After editing, annotations.json is updated locally
# You can then upload the updated annotations as a new version
manager.update_dataset(
    dataset_id="pipe-inspection-v1",
    frames_dir="./review_data/data",  # Extract frames.zip first
    annotation_file="./review_data/annotations.json",
    annotation_type="COCO"
)
```

## Integration Example

Complete workflow combining both tools:

```python
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
from dataset_reviewer import DatasetReviewer

# Setup manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-bucket", s3_client=s3_client)

# 1. Download dataset for review
manager.download_dataset(
    dataset_id="my-dataset",
    output_dir="./review_workspace"
)

# 2. Review and edit annotations
reviewer = DatasetReviewer("./review_workspace")
reviewer.review()  # Interactive session - edit annotations as needed

# 3. Extract frames for upload (if needed)
import zipfile
with zipfile.ZipFile("./review_workspace/data/frames.zip", 'r') as zip_ref:
    zip_ref.extractall("./review_workspace/frames")

# 4. Upload updated dataset as new version
manager.update_dataset(
    dataset_id="my-dataset",
    frames_dir="./review_workspace/frames",
    annotation_file="./review_workspace/annotations.json",
    annotation_type="COCO"
)
```

## Requirements

- Python >= 3.10
- boto3 >= 1.40.64
- pydantic >= 2.12.3, < 3.0.0

For DatasetReviewer (currently commented out):
- matplotlib
- PIL (Pillow)
- numpy
- tkinter

## Configuration

### S3 Configuration

The `VersionedDatasetManager` uses standard AWS credentials. Ensure your AWS credentials are configured via:
- AWS credentials file (`~/.aws/credentials`)
- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
- IAM role (if running on EC2)

### S3 Path Structure

Paths are configurable via the `prefix` parameter in `VersionedDatasetManager.__init__()`. The default structure uses:
- `RAW-DATA/frames/` for frame zip files
- `annotations/` for annotation JSON files
- `manifests/` for dataset manifest files

## Development

```bash
# Install in development mode
pip install -e .

# Run tests
pytest

# Lint code
ruff check .
```

## License

[Add your license information here]

