8000 Added 3 Pixmo vision-language datasets by jrwana · Pull Request #1523 · oumi-ai/oumi · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added 3 Pixmo vision-language datasets #1523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 10, 2025
Merged

Conversation

jrwana
Copy link
Contributor
@jrwana jrwana commented Mar 8, 2025

Description

Added 3 Pixmo vision-language datasets.
Since some of the image urls return 404 and 429, the integration test was modified to only test a subset of the training set.
For reference, added the yaml files used to test manually using skypilot but these should be deleted.

Related issues

Fixes #1401

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

jrwana added 2 commits March 6, 2025 12:47
…n test. Some image URLs return a 404 or 429 so only a small portion of the dataset was tested.

Added a config file I used to test it for training but got the error: ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None, filters=None) doesn't have a 'processor_name' key."
… train on a subset of data. Removed one redundant test from test_pixmo.py.
@taenin taenin requested review from nikg4 and optas March 9, 2025 03:56
@nikg4 nikg4 requested a review from taenin March 10, 2025 18:22
# # limit: 4096 # Uncomment to limit dataset size!
# return_tensors: True
- dataset_name: "allenai/pixmo-cap-qa"
split: "train[10:20]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[10:20] is too little - just 10 examples. Can we drop [10:20]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Around 7% of the image URLs are invalid. Just looking at the first 1000, I get the following count of http statuses:
200: 930
429: 31
404: 19
403: 12
405: 3
406: 1
202: 1

How many rows do you think would be sufficient? I can slice the dataframe to include just rows with valid URLs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a csv file of the sample of 1000:
image_df.csv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

429 is a throttling error "Too many requests". Could you share your snippet how to reproduce these errors?

I'm curious if there is some way to send less requests + Repro script would be good to have for communication with Pixmo owners . Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from datasets import load_dataset
import requests
from urllib.parse import urlparse
import pandas as pd

dataset = "allenai/pixmo-cap"
nRows = 20

data = load_dataset(dataset, split="train")

def get_image_url_status_code(image_url, allow_redirects=True, timeout=5):
    """
    Retrieves the HTTP status code of an image URL.

    Args:
        image_url (str): The URL of the image.
        allow_redirects (bool, optional): Whether to follow redirects. Defaults to True.
        timeout (int, optional): The request timeout in seconds. Defaults to 5.

    Returns:
        int: The HTTP status code, or None if an error occurred.
    """
    try:
        # Basic URL Validity Check (optional, but good practice)
        parsed_url = urlparse(image_url)
        if not all([parsed_url.scheme, parsed_url.netloc]):
            print(f"Invalid URL: {image_url}")  # Or raise an exception
            return None

        if parsed_url.scheme not in ('http', 'https'):
            print(f"Invalid scheme. Must be http or https: {image_url}")
            return None

        # Use a HEAD request to get headers without the full content
        response = requests.head(image_url, allow_redirects=allow_redirects, timeout=timeout)
        return response.status_code

    except requests.exceptions.RequestException as e:
        print(f"Error during request: {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# columns of dataframe
urls = []
url_statuses = []

for i in range(nRows):
    status_code = get_image_url_status_code(data[i]["image_url"])
    urls.append(data[i]["image_url"])
    url_statuses.append(status_code)

    if status_code == 200:
        print(f"Image {i} is valid")
    else:
        print(f"Image {i} is invalid")


image_df = pd.DataFrame({
    "url": urls,
    "url_status": url_statuses
})

image_df.head(10)

counts = image_df['url_status'].value_counts()
print(counts)

filename = dataset.replace("/", "_")
filename = f"{filename}-{nRows}_df.csv"

image_df.to_csv(f"{filename}", index=False)

LoadDatasetInfo(
dataset_name="allenai/pixmo-ask-model-anything",
model_name=_DEFAULT_MODEL_NAME,
dataset_split="train[10:20]", # 404 error for some image URLs
Copy link
Collaborator
@nikg4 nikg4 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this 404 issue, Can you please report it to Pixmo owners https://huggingface.co/datasets/allenai/pixmo-docs/discussions and link Github question or issue here as a comment ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reported it here: discussion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a lot of 404s in the dataset !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also posted the same question under allenai/pixmo-cap-qa : https://huggingface.co/datasets/allenai/pixmo-cap-qa/discussions/4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I wait for them to change the dataset?
Or should I make a new dataset with the same data but replace image urls with bytes and only include the ones that work?

Copy link
Collaborator
@nikg4 nikg4 Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for their feedback may take a long time (no guarantee they'll respond).
Let's do this:

  • Cleanup this PR
  • Make sure that presubmit tests pass
  • add comments to all Pixmo datasets that they are affected by image URL 404 issues.
  • Submit

We can later think how to handle that : Either re-upload with image bytes (ideally , it can only be done by owners of the dataset (?) which is outside of our control), or add dataset functionality in Oumi to ignore bad records.

I also updated https://huggingface.co/datasets/allenai/pixmo-cap-qa/discussions/4

@nikg4 nikg4 requested a review from wizeng23 March 10, 2025 18:38
Copy link
Collaborator
@taenin taenin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look reasonable. +1 to Nikolai's feedback above

@nikg4
Copy link
Collaborator
nikg4 commented Mar 13, 2025

@jrwana There are some test errors. PTAL

assert dataset.num_rows == info.expected_rows, debug_tag
# TODO: Remove this once we have a way to handle the expected rows for
# pixmo datasets
# if info.expected_rows is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests don't pass if I uncomment since the invalid URLs breaks the test.
But I can uncomment it so the other datasets will be tested correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means other datasets are less tested. Instead can you do this:
1, Uncomment Line 223...224
2. Update Pixmo test defnitions to use None

  LoadDatasetInfo(
            dataset_name="allenai/pixmo-cap",
          ...
            expected_rows=None,
        ),

@nikg4
Copy link
Collaborator
nikg4 commented Mar 19, 2025

@jrwana Tests are green now, almost there... Let's address remaining comments and submit. Thanks!


A "transcripts" column is also available but not used yet.

The dataset is affected by some image URLs having a 404 issue.
Copy link
Collaborator
@nikg4 nikg4 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move 404 warning to class docstring
as in PixmoAskModelAnythingDataset(

It starts with a [USER] and ends with an [ASSISTANT] role tag.
The Assistant response appears in the "answer" field.

The dataset is affected by some image URLs having a 404 issue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move 404 warning to class docstring

@wizeng23 wizeng23 merged commit 2b62dbb into oumi-ai:main Apr 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add support for Pixmo vision-language datasets
4 participants
0