-
Notifications
You must be signed in to change notification settings - Fork 589
Added 3 Pixmo vision-language datasets #1523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n test. Some image URLs return a 404 or 429 so only a small portion of the dataset was tested. Added a config file I used to test it for training but got the error: ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None, filters=None) doesn't have a 'processor_name' key."
… train on a subset of data. Removed one redundant test from test_pixmo.py.
# # limit: 4096 # Uncomment to limit dataset size! | ||
# return_tensors: True | ||
- dataset_name: "allenai/pixmo-cap-qa" | ||
split: "train[10:20]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[10:20]
is too little - just 10 examples. Can we drop [10:20]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Around 7% of the image URLs are invalid. Just looking at the first 1000, I get the following count of http statuses:
200: 930
429: 31
404: 19
403: 12
405: 3
406: 1
202: 1
How many rows do you think would be sufficient? I can slice the dataframe to include just rows with valid URLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a csv file of the sample of 1000:
image_df.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
429 is a throttling error "Too many requests". Could you share your snippet how to reproduce these errors?
I'm curious if there is some way to send less requests + Repro script would be good to have for communication with Pixmo owners . Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from datasets import load_dataset
import requests
from urllib.parse import urlparse
import pandas as pd
dataset = "allenai/pixmo-cap"
nRows = 20
data = load_dataset(dataset, split="train")
def get_image_url_status_code(image_url, allow_redirects=True, timeout=5):
"""
Retrieves the HTTP status code of an image URL.
Args:
image_url (str): The URL of the image.
allow_redirects (bool, optional): Whether to follow redirects. Defaults to True.
timeout (int, optional): The request timeout in seconds. Defaults to 5.
Returns:
int: The HTTP status code, or None if an error occurred.
"""
try:
# Basic URL Validity Check (optional, but good practice)
parsed_url = urlparse(image_url)
if not all([parsed_url.scheme, parsed_url.netloc]):
print(f"Invalid URL: {image_url}") # Or raise an exception
return None
if parsed_url.scheme not in ('http', 'https'):
print(f"Invalid scheme. Must be http or https: {image_url}")
return None
# Use a HEAD request to get headers without the full content
response = requests.head(image_url, allow_redirects=allow_redirects, timeout=timeout)
return response.status_code
except requests.exceptions.RequestException as e:
print(f"Error during request: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred: {e}")
return None
# columns of dataframe
urls = []
url_statuses = []
for i in range(nRows):
status_code = get_image_url_status_code(data[i]["image_url"])
urls.append(data[i]["image_url"])
url_statuses.append(status_code)
if status_code == 200:
print(f"Image {i} is valid")
else:
print(f"Image {i} is invalid")
image_df = pd.DataFrame({
"url": urls,
"url_status": url_statuses
})
image_df.head(10)
counts = image_df['url_status'].value_counts()
print(counts)
filename = dataset.replace("/", "_")
filename = f"{filename}-{nRows}_df.csv"
image_df.to_csv(f"{filename}", index=False)
LoadDatasetInfo( | ||
dataset_name="allenai/pixmo-ask-model-anything", | ||
model_name=_DEFAULT_MODEL_NAME, | ||
dataset_split="train[10:20]", # 404 error for some image URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this 404 issue, Can you please report it to Pixmo owners https://huggingface.co/datasets/allenai/pixmo-docs/discussions and link Github question or issue here as a comment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reported it here: discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a lot of 404s in the dataset !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also posted the same question under allenai/pixmo-cap-qa
: https://huggingface.co/datasets/allenai/pixmo-cap-qa/discussions/4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I wait for them to change the dataset?
Or should I make a new dataset with the same data but replace image urls with bytes and only include the ones that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for their feedback may take a long time (no guarantee they'll respond).
Let's do this:
- Cleanup this PR
- Make sure that presubmit tests pass
- add comments to all Pixmo datasets that they are affected by image URL 404 issues.
- Submit
We can later think how to handle that : Either re-upload with image bytes (ideally , it can only be done by owners of the dataset (?) which is outside of our control), or add dataset functionality in Oumi to ignore bad records.
I also updated https://huggingface.co/datasets/allenai/pixmo-cap-qa/discussions/4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look reasonable. +1 to Nikolai's feedback above
@jrwana There are some test errors. PTAL |
assert dataset.num_rows == info.expected_rows, debug_tag | ||
# TODO: Remove this once we have a way to handle the expected rows for | ||
# pixmo datasets | ||
# if info.expected_rows is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration tests don't pass if I uncomment since the invalid URLs breaks the test.
But I can uncomment it so the other datasets will be tested correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means other datasets are less tested. Instead can you do this:
1, Uncomment Line 223...224
2. Update Pixmo test defnitions to use None
LoadDatasetInfo(
dataset_name="allenai/pixmo-cap",
...
expected_rows=None,
),
…o some 404 urls. Changed test_pixmo to reflect 1) removed transcripts column in pixmo-cap and 2) combined two ContentItems into 1 message for pixmo-ask-model-anything
…wana/pixmo Merging from remote origin
@jrwana Tests are green now, almost there... Let's address remaining comments and submit. Thanks! |
|
||
A "transcripts" column is also available but not used yet. | ||
|
||
The dataset is affected by some image URLs having a 404 issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move 404 warning to class docstring
as in PixmoAskModelAnythingDataset(
It starts with a [USER] and ends with an [ASSISTANT] role tag. | ||
The Assistant response appears in the "answer" field. | ||
|
||
The dataset is affected by some image URLs having a 404 issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move 404 warning to class docstring
Description
Added 3 Pixmo vision-language datasets.
Since some of the image urls return 404 and 429, the integration test was modified to only test a subset of the training set.
For reference, added the yaml files used to test manually using skypilot but these should be deleted.
Related issues
Fixes #1401
Before submitting
Reviewers
At least one review from a member of
oumi-ai/oumi-staff
is required.