fluid/stable/: fluid-ml-training-0.1.0 metadata and description
Repo to manage training data and create training jobs
| author | Ameya Kirtane |
| author_email | ameya.kirtane@fluidanalytics.ai |
| classifiers |
|
| description_content_type | text/markdown |
| metadata_version | 2.4 |
| requires_dist |
|
| requires_python | >=3.10 |
Because this project isn't in the mirror_whitelist,
no releases from root/pypi are included.
| File | Tox results | History |
|---|---|---|
fluid_ml_training-0.1.0-py3-none-any.whl
|
|
|
fluid_ml_training-0.1.0.tar.gz
|
|
Fluid ML Training
Repository for managing versioned training datasets stored in S3 and creating training jobs.
Overview
This repository provides tools for managing versioned datasets in S3, including:
- VersionedDatasetManager: A Python class for managing versioned datasets with frames, annotations, and manifests in S3
- DatasetReviewer: An interactive tool for reviewing and editing COCO-format annotations with bounding boxes
VersionedDatasetManager
The VersionedDatasetManager class provides a comprehensive interface for managing versioned datasets stored in Amazon S3. It handles frames (as zip files), annotations (JSON files), and dataset manifests that track version history.
Features
- Version Management: Automatically track and increment dataset versions
- S3 Integration: Seamlessly upload and download datasets from S3
- Manifest Tracking: Maintain JSON manifests that track all dataset versions and metadata
- Path Management: Automatic S3 path construction following a consistent structure
S3 Structure
Datasets are organized in S3 with the following structure:
{prefix}/
├── RAW-DATA/
│ └── frames/
│ └── dataset_id={dataset_id}/
│ └── frames.zip
├── annotations/
│ └── dataset_id={dataset_id}/
│ └── version={version}/
│ └── annotations.json
└── manifests/
└── dataset_id={dataset_id}/
└── dataset_manifest.json
Installation
pip install -e .
Basic Usage
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
# Initialize the manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(
bucket="my-bucket",
s3_client=s3_client,
prefix="datasets" # Optional S3 prefix
)
# Create a new dataset
manager.create_dataset(
dataset_id="project1",
dataset_name="Project 1 Dataset",
frames_dir="./local_frames",
annotation_file="./annotations.json",
annotation_type="COCO"
)
# Update an existing dataset (creates new version)
manager.update_dataset(
dataset_id="project1",
frames_dir="./updated_frames",
annotation_file="./updated_annotations.json",
annotation_type="COCO"
)
# Download a dataset
manager.download_dataset(
dataset_id="project1",
output_dir="./downloaded_data",
version="2.0" # Optional: defaults to latest
)
# List all datasets
all_datasets = manager.get_datasets()
for dataset_id, manifest in all_datasets.items():
print(f"{dataset_id}: {manifest.dataset_name}")
print(f" Versions: {list(manifest.versions.keys())}")
API Reference
Initialization
VersionedDatasetManager(bucket: str, s3_client, prefix: str = "")
bucket: Name of the S3 bucket where datasets are storeds3_client: Boto3 S3 client instanceprefix: Optional S3 prefix/path to prepend to all dataset paths
Creating Datasets
create_dataset(dataset_id, dataset_name, frames_dir, annotation_file, annotation_type="COCO")
Creates a new dataset with version 1.0. Uploads frames (as zip), annotations, and creates the initial manifest.
update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO")
Updates an existing dataset by creating a new version. Automatically increments the version number by 1.0.
create_or_update_dataset(dataset_id, frames_dir, annotation_file, annotation_type="COCO", dataset_name="")
Automatically determines whether to create or update a dataset. If the dataset doesn't exist, creates it with version 1.0. If it exists, creates a new version.
Downloading Datasets
download_dataset(dataset_id, output_dir, version="")
Downloads a dataset version from S3 to a local directory. Downloads:
dataset_manifest.json- The dataset manifestannotations.json- Annotations for the specified versiondata/frames.zip- The frames zip file
If version is empty, downloads the latest version.
Querying Datasets
get_datasets() -> dict[str, DatasetManifest]
Returns a dictionary mapping dataset_id to DatasetManifest objects for all datasets in S3.
get_latest_version(dataset_id) -> str
Returns the latest version string for a dataset by comparing numeric version values.
version_exists(dataset_id, version) -> bool
Checks if a specific version exists for a dataset.
is_new_dataset(dataset_id) -> bool
Checks if a dataset is new (doesn't exist in S3 yet).
Path Helpers
get_manifest_path(dataset_id, allow_new=False) -> str
Returns the S3 key path for a dataset's manifest file.
get_frames_path(dataset_id, allow_new=False) -> str
Returns the S3 key path for a dataset's frames zip file.
get_annotations_path(dataset_id, version="", allow_new=False) -> str
Returns the S3 key path for a dataset's annotation file for a specific version.
Dataset Manifest Structure
The dataset manifest is a JSON file that tracks dataset metadata and version history:
{
"dataset_id": "project1",
"dataset_name": "Project 1 Dataset",
"frames_s3_paths": "RAW-DATA/frames/dataset_id=project1/frames.zip",
"versions": {
"1.0": {
"version": "1.0",
"annotation_type": "COCO",
"annotation_file": "annotations/dataset_id=project1/version=1.0/annotations.json",
"num_frames": 1000,
"num_clips": 0,
"num_videos": 1
},
"2.0": {
"version": "2.0",
"annotation_type": "COCO",
"annotation_file": "annotations/dataset_id=project1/version=2.0/annotations.json",
"num_frames": 1200,
"num_clips": 0,
"num_videos": 1
}
}
}
Example Workflow
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
# Setup
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-training-data", s3_client=s3_client)
# Step 1: Create initial dataset
manager.create_dataset(
dataset_id="pipe-inspection-v1",
dataset_name="Pipe Inspection Dataset v1",
frames_dir="./data/frames",
annotation_file="./data/annotations.json",
annotation_type="COCO"
)
# Step 2: Later, update with new annotations
manager.update_dataset(
dataset_id="pipe-inspection-v1",
frames_dir="./data/frames", # Can be same or updated
annotation_file="./data/annotations_v2.json",
annotation_type="COCO"
)
# Step 3: Download specific version for training
manager.download_dataset(
dataset_id="pipe-inspection-v1",
output_dir="./training_data",
version="2.0"
)
# Step 4: Check what versions are available
manifest = manager._load_manifest("pipe-inspection-v1")
print(f"Available versions: {list(manifest.versions.keys())}")
latest = manager.get_latest_version("pipe-inspection-v1")
print(f"Latest version: {latest}")
DatasetReviewer
The DatasetReviewer is an interactive tool for reviewing and editing COCO-format annotations with bounding boxes. It provides a matplotlib-based UI for visualizing images with their annotations and allows interactive editing of annotation labels and attributes.
Features
- Interactive Visualization: View images with bounding box annotations overlaid
- Keyboard Navigation: Navigate through images using arrow keys
- Annotation Editing: Click on bounding boxes to edit category labels and attributes
- COCO Format Support: Works with standard COCO annotation format
Usage
from dataset_reviewer import DatasetReviewer
# Initialize reviewer
reviewer = DatasetReviewer(
dataset_dir="./downloaded_data",
frames_zip_path=None, # Optional: defaults to {dataset_dir}/frames.zip
annotations_file_path=None # Optional: defaults to {dataset_dir}/annotations.json
)
# Start interactive review session
reviewer.review()
Keyboard Controls
- Right Arrow or 'n': Next image
- Left Arrow or 'p': Previous image
- 's': Save annotations to JSON file
- 'q' or Escape: Quit
Mouse Controls
- Click on bounding box: Open edit dialog to modify:
- Category label
- Attributes (distance_from_camera, visibility, water_level, secondary_defect, clock_position, pipe_color, pipe_condition, difficult, occluded, rotation)
Example Workflow
from dataset_reviewer import DatasetReviewer
# Download dataset first using VersionedDatasetManager
manager.download_dataset(
dataset_id="pipe-inspection-v1",
output_dir="./review_data"
)
# Review and edit annotations
reviewer = DatasetReviewer("./review_data")
reviewer.review()
# After editing, annotations.json is updated locally
# You can then upload the updated annotations as a new version
manager.update_dataset(
dataset_id="pipe-inspection-v1",
frames_dir="./review_data/data", # Extract frames.zip first
annotation_file="./review_data/annotations.json",
annotation_type="COCO"
)
Integration Example
Complete workflow combining both tools:
import boto3
from fluid_ml_training.dataset_manager import VersionedDatasetManager
from dataset_reviewer import DatasetReviewer
# Setup manager
s3_client = boto3.client('s3')
manager = VersionedDatasetManager(bucket="my-bucket", s3_client=s3_client)
# 1. Download dataset for review
manager.download_dataset(
dataset_id="my-dataset",
output_dir="./review_workspace"
)
# 2. Review and edit annotations
reviewer = DatasetReviewer("./review_workspace")
reviewer.review() # Interactive session - edit annotations as needed
# 3. Extract frames for upload (if needed)
import zipfile
with zipfile.ZipFile("./review_workspace/data/frames.zip", 'r') as zip_ref:
zip_ref.extractall("./review_workspace/frames")
# 4. Upload updated dataset as new version
manager.update_dataset(
dataset_id="my-dataset",
frames_dir="./review_workspace/frames",
annotation_file="./review_workspace/annotations.json",
annotation_type="COCO"
)
Requirements
- Python >= 3.10
- boto3 >= 1.40.64
- pydantic >= 2.12.3, < 3.0.0
For DatasetReviewer (currently commented out):
- matplotlib
- PIL (Pillow)
- numpy
- tkinter
Configuration
S3 Configuration
The VersionedDatasetManager uses standard AWS credentials. Ensure your AWS credentials are configured via:
- AWS credentials file (
~/.aws/credentials) - Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - IAM role (if running on EC2)
S3 Path Structure
Paths are configurable via the prefix parameter in VersionedDatasetManager.__init__(). The default structure uses:
RAW-DATA/frames/for frame zip filesannotations/for annotation JSON filesmanifests/for dataset manifest files
Development
# Install in development mode
pip install -e .
# Run tests
pytest
# Lint code
ruff check .
License
[Add your license information here]