Normalize urlparse fragment behavior across all supported Python versions #686

irskep · 2013-08-03T19:50:00Z

See code comment. This includes the test fix from #684 and an implementation fix for everything else.

timtadh · 2013-08-03T19:52:01Z

did you want to include the updated test?

irskep · 2013-08-03T19:52:36Z

Fixed

irskep · 2013-08-03T19:53:49Z

If this seems too scary, I could normalize the other way and have our urlparse() wrapper put the fragment back on the path component for all versions of Python.

irskep · 2013-08-03T19:55:21Z

I'm so good at this

irskep · 2013-08-03T20:00:13Z

mrjob.parse.urlparse() is only used in mrjob.fs.s3 and mrjob.fs.hadoop, neither of which appear to rely on the old behavior, so this change should be safe, affecting only the tests.

coyotemarin · 2013-08-04T00:35:01Z

mrjob/parse.py

@@ -27,6 +29,18 @@
 from mrjob.compat import uses_020_counters


+### danger: monkey patch urlparse on python < 2.7.4 ###


Yeah, sorry, no monkey-patching allowed. It's just not part of the deal when you import a library in Python.

Since the whole point of this code is to activate one line of code that says:

url, fragment = url.split('#', 1)

why not just add that to our urlparse() wrapper as a post-processing step. Something to the effect of:

if allow_fragments and not fragment and '#' in path: url, fragment = url.split('#', 1)

coyotemarin · 2013-08-04T00:42:08Z

Yay, thanks for addressing this! And good research into the Python source code.

I agree that it's better to emulate the behavior of urlparse() in the newer versions of Python rather than the older ones.

It's probably fine for our code that parses S3 and HDFS URIs to throw away the fragment. Fine with punting on that.

…for all Python versions

irskep · 2013-08-05T00:26:39Z

Try this.

coyotemarin · 2013-08-05T04:43:55Z

mrjob/parse.py

+        m = NETLOC_RE.match(path)
+        netloc = m.group(1)
+        path = m.group(1)
+    if '#' in path and not fragment:


Almost there. To match urlparse's behavior, this should be if allow_fragments and '#' in path and not fragment.

You might as well just grab urlparse()'s function signature rather than doing the wraps() thing:

def urlparse(url, scheme='', allow_fragments=True):

I'd still rather use the kwargs and @wraps because it's supposed to be a transparent pass-through. Also, without looking at the docs and/or source (which of course I could do), I don't know that the function signature hasn't changed. Trying to make as few assumptions as possible.

When you mentioned allow_fragments earlier I thought you were referring to an internal variable in the urlparse source, which is not the case. Will fix.

(@wraps also preserves the docstring, I believe.)

Fudge, it needs to be a positional argument to. So I'll just write out the whole signature after double-checking the docs for old versions. I feel like this part of Python could be better.

I just meant that you're going to have to search for allow_fragments in both args and kwargs, and you could save yourself some trouble. Yeah, there's no reason not to use wraps(), and you might want to do something like:

def urlparse(url, scheme='', allow_fragments=True, *args, **kwargs):

just to be safe.

…anges

irskep · 2013-08-05T04:55:44Z

Updated

Normalize urlparse fragment behavior across all supported Python versions

secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)

irskep mentioned this pull request Aug 3, 2013

Fixed failing tests #684

Closed

coyotemarin reviewed Aug 4, 2013
View reviewed changes

Normalize urlparse behavior for s3, s3n, hdfs, mapreduce url schemes …

0a8e554

…for all Python versions

Better solution to py2.7.4 urlparse behavior fix

90dce48

coyotemarin reviewed Aug 5, 2013
View reviewed changes

irskep added 2 commits August 4, 2013 21:54

Support allow_fragments in mrjob.parse.urlparse()

9062519

Reinstate @wraps, future proof against extremely unlikely func sig ch…

e2318c7

…anges

coyotemarin pushed a commit that referenced this pull request Aug 5, 2013

Merge pull request #686 from irskep/fix-urlparse-2.7.4

b3493b6

Normalize urlparse fragment behavior across all supported Python versions

coyotemarin merged commit b3493b6 into Yelp:master Aug 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize urlparse fragment behavior across all supported Python versions #686

Normalize urlparse fragment behavior across all supported Python versions #686

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -27,6 +29,18 @@
		from mrjob.compat import uses_020_counters


		### danger: monkey patch urlparse on python < 2.7.4 ###

Normalize urlparse fragment behavior across all supported Python versions #686

Normalize urlparse fragment behavior across all supported Python versions #686

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!