8000 Normalize urlparse fragment behavior across all supported Python versions by irskep · Pull Request #686 · Yelp/mrjob · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Normalize urlparse fragment behavior across all supported Python versions #686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 5, 2013

Conversation

irskep
Copy link
Contributor
@irskep irskep commented Aug 3, 2013

See code comment. This includes the test fix from #684 and an implementation fix for everything else.

@irskep irskep mentioned this pull request Aug 3, 2013
@timtadh
Copy link
Contributor
timtadh commented Aug 3, 2013

did you want to include the updated test?

@irskep
Copy link
Contributor Author
irskep commented Aug 3, 2013

Fixed

@irskep
Copy link
Contributor Author
irskep commented Aug 3, 2013

If this seems too scary, I could normalize the other way and have our urlparse() wrapper put the fragment back on the path component for all versions of Python.

@irskep
Copy link
Contributor Author
irskep commented Aug 3, 2013

I'm so good at this

@irskep
Copy link
Contributor Author
irskep commented Aug 3, 2013

mrjob.parse.urlparse() is only used in mrjob.fs.s3 and mrjob.fs.hadoop, neither of which appear to rely on the old behavior, so this change should be safe, affecting only the tests.

@@ -27,6 +29,18 @@
from mrjob.compat import uses_020_counters


### danger: monkey patch urlparse on python < 2.7.4 ###
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, no monkey-patching allowed. It's just not part of the deal when you import a library in Python.

Since the whole point of this code is to activate one line of code that says:

url, fragment = url.split('#', 1)

why not just add that to our urlparse() wrapper as a post-processing step. Something to the effect of:

if allow_fragments and not fragment and '#' in path:
    url, fragment = url.split('#', 1)

@coyotemarin
Copy link
Collaborator

Yay, thanks for addressing this! And good research into the Python source code.

I agree that it's better to emulate the behavior of urlparse() in the newer versions of Python rather than the older ones.

It's probably fine for our code that parses S3 and HDFS URIs to throw away the fragment. Fine with punting on that.

@irskep
Copy link
Contributor Author
irskep commented Aug 5, 2013

Try this.

m = NETLOC_RE.match(path)
netloc = m.group(1)
path = m.group(1)
if '#' in path and not fragment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. To match urlparse's behavior, this should be if allow_fragments and '#' in path and not fragment.

You might as well just grab urlparse()'s function signature rather than doing the wraps() thing:

def urlparse(url, scheme='', allow_fragments=True):

8000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still rather use the kwargs and @wraps because it's supposed to be a transparent pass-through. Also, without looking at the docs and/or source (which of course I could do), I don't know that the function signature hasn't changed. Trying to make as few assumptions as possible.

When you mentioned allow_fragments earlier I thought you were referring to an internal variable in the urlparse source, which is not the case. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@wraps also preserves the docstring, I believe.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fudge, it needs to be a positional argument to. So I'll just write out the whole signature after double-checking the docs for old versions. I feel like this part of Python could be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that you're going to have to search for allow_fragments in both args and kwargs, and you could save yourself some trouble. Yeah, there's no reason not to use wraps(), and you might want to do something like:

def urlparse(url, scheme='', allow_fragments=True, *args, **kwargs):

just to be safe.

@irskep
Copy link
Contributor Author
irskep commented Aug 5, 2013

Updated

coyotemarin pushed a commit that referenced this pull request Aug 5, 2013
Normalize urlparse fragment behavior across all supported Python versions
@coyotemarin coyotemarin merged commit b3493b6 into Yelp:master Aug 5, 2013
scottknight added a commit to timtadh/mrjob that referenced this pull request Oct 10, 2013
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (Yelp#240)
     * see mrjob/examples/
   * can now override jobconf() again (Yelp#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (Yelp#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (Yelp#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (Yelp#633)
     * Throws IOError if output path already exists (Yelp#634)
     * Gracefully handles SSL cert issues (Yelp#621, Yelp#706)
     * Automatically infers EMR/S3 endpoints from region (Yelp#658)
     * ls() supports s3n:// schema (Yelp#672)
     * Fixed log parsing crash on JarSteps (Yelp#645)
     * visible_to_all_users works with boto <2.8.0 (Yelp#701)
     * must use --interpreter with non-Python scripts (Yelp#683)
     * cat() can decompress gzipped data (Yelp#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (Yelp#583)
     * cat() can decompress gzipped data (Yelp#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (Yelp#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747)
   * built-in protocols must be instances (Yelp#488)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0