-
Notifications
You must be signed in to change notification settings - Fork 589
Normalize urlparse fragment behavior across all supported Python versions #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
did you want to include the updated test? |
Fixed |
If this seems too scary, I could normalize the other way and have our |
I'm so good at this |
|
@@ -27,6 +29,18 @@ | |||
from mrjob.compat import uses_020_counters | |||
|
|||
|
|||
### danger: monkey patch urlparse on python < 2.7.4 ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, no monkey-patching allowed. It's just not part of the deal when you import a library in Python.
Since the whole point of this code is to activate one line of code that says:
url, fragment = url.split('#', 1)
why not just add that to our urlparse()
wrapper as a post-processing step. Something to the effect of:
if allow_fragments and not fragment and '#' in path:
url, fragment = url.split('#', 1)
Yay, thanks for addressing this! And good research into the Python source code. I agree that it's better to emulate the behavior of It's probably fine for our code that parses S3 and HDFS URIs to throw away the fragment. Fine with punting on that. |
…for all Python versions
Try this. |
m = NETLOC_RE.match(path) | ||
netloc = m.group(1) | ||
path = m.group(1) | ||
if '#' in path and not fragment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. To match urlparse's behavior, this should be if allow_fragments and '#' in path and not fragment
.
You might as well just grab urlparse()
's function signature rather than doing the wraps()
thing:
def urlparse(url, scheme='', allow_fragments=True):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still rather use the kwargs and @wraps
because it's supposed to be a transparent pass-through. Also, without looking at the docs and/or source (which of course I could do), I don't know that the function signature hasn't changed. Trying to make as few assumptions as possible.
When you mentioned allow_fragments
earlier I thought you were referring to an internal variable in the urlparse
source, which is not the case. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@wraps
also preserves the docstring, I believe.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fudge, it needs to be a positional argument to. So I'll just write out the whole signature after double-checking the docs for old versions. I feel like this part of Python could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant that you're going to have to search for allow_fragments
in both args
and kwargs
, and you could save yourself some trouble. Yeah, there's no reason not to use wraps()
, and you might want to do something like:
def urlparse(url, scheme='', allow_fragments=True, *args, **kwargs):
just to be safe.
Updated |
Normalize urlparse fragment behavior across all supported Python versions
secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)
See code comment. This includes the test fix from #684 and an implementation fix for everything else.