Added an option to disable input path checking in hadoop #583

tarnfeld · 2012-12-14T16:08:42Z

I've added a new option that disabled input path checking for the hadoop runner when set to False. This pull request also includes a change that uses -ls rather than -stat to check paths, as -stat throws an error if you have a path with a wildcard * in it.

The option is named check_hadoop_input_paths unsurprisingly.

Both of these are something we've required (partly for performance reasons), and figured it might be worth pushing upstream if it's something you feel worth including.

irskep · 2012-12-15T22:06:50Z

check_hadoop_input_paths would be a better name, and a --check-hadoop-input-paths switch would be welcome. Also documentation!

I suspect the maintainers will take care of these things if you don't, but it would be nice.

And add yourself to CONTRIBUTORS.txt.

tarnfeld · 2012-12-17T10:42:37Z

Good point, that's a much better name. I'll also update the docs to explain what's going on here.

Hmm, is there a CONTRIBUTORS.txt ? 😕

tarnfeld · 2012-12-17T10:59:45Z

Never-mind, found it in __init__.py

tarnfeld · 2012-12-17T11:48:33Z

Just a quick tldr on the latest commits... (sigh @ the github merge commits)

Changed the internal option name to check_hadoop_input_paths
Added a command line option to skip the check
Added a test for the option
Updated the documentation with the option
Fixed a typo in my name :-)

For those interested, this is a realistic example of the behaviour...

Without the option and an invalid path (python count.py test.txsdft -r hadoop)

creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.113523.312108
Traceback (most recent call last):
  File "count.py", line 36, in <module>
    MRWordCount.run()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
    mr_job.execute()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
    super(MRJob, self).execute()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
    self.run_job()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
    runner.run()
  File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
    self._run()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 230, in _run
    self._check_input_exists()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 246, in _check_input_exists
    'Input path %s does not exist!' % (path,))
AssertionError: Input path /Users/tom/Desktop/test.txtdsf does not exist!

With the option and an invalid path (python count.py test.txsdft -r hadoop --skip-hadoop-input-check)

creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.114705.181459
Copying local files into hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/
STDERR: put: File /Users/tom/Desktop/test.txsdft does not exist.
Traceback (most recent call last):
  File "count.py", line 35, in <module>
    MRWordCount.run()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
    mr_job.execute()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
    super(MRJob, self).execute()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
    self.run_job()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
    runner.run()
  File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
    self._run()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 233, in _run
    self._upload_local_files_to_hdfs()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 264, in _upload_local_files_to_hdfs
    self._upload_to_hdfs(path, uri)
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 272, in _upload_to_hdfs
    self.invoke_hadoop(['fs', '-put', path, target])
  File "/Users/tom/Projects/mrjob/mrjob/fs/hadoop.py", line 104, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/local/hadoop/bin/hadoop', 'fs', '-put', '/Users/tom/Desktop/test.txsdft', 'hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/test.txsdft']' returned non-zero exit status 255

tarnfeld · 2013-01-02T16:49:52Z

Just a small note incase anyone misses it, this branch also has #586 merged in.

Conflicts: tests/test_hadoop.py

irskep · 2013-06-03T18:33:12Z

Ping @sudarshang

tarnfeld · 2013-06-03T18:35:25Z

I'll fix the unit tests on this.

irskep · 2013-07-19T17:38:21Z

Ping @tarnfeld for status

tarnfeld · 2013-07-19T17:44:13Z

@irskep Oops kind of dropped this one, will push tests in the next day or two.

tarnfeld · 2013-07-20T18:55:24Z

@irskep I've pushed a master merge and unit test fixes for this branch. I had to shuffle around some of the mocking to add an ls mock hadoop method. I also noticed the return code was -1 for the path not existing, in practice this is actually 255 (from looking at hadoop) so i've updated both the mock functions to use this (so it's catchable in the path_exists method).

Update... I take that back, from looking at the hadoop source it should be -1 - though from testing locally I don't see that :-/

Update again... From digging a little deeper I see -1 == 255 (https://issues.apache.org/jira/browse/HADOOP-6143) so i'm going to switch them back to -1 and change ok_returncodes=[0,-1,255]

irskep · 2013-07-20T19:10:07Z

mrjob/options.py

+        opt_group.add_option(
+            '--skip-hadoop-input-check', dest='check_hadoop_input_paths',
+            default=True, action='store_false',
+            help='Skip the checks to ensure all input paths exist'),


The naming convention we use would result in --check-hadoop-input-paths and --no-check-hadoop-input-paths. Both options in case you want to override mrjob.conf.

irskep · 2013-07-20T19:27:59Z

Could you fix my naming nitpick, and add the switches to configs_reference.rst?

tarnfeld · 2013-07-20T19:29:19Z

For sure, any thoughts on the exit code stuff?

irskep · 2013-07-20T19:36:17Z

I can't contribute meaningfully to that discussion because my limited working knowledge of Hadoop doesn't cover that area. As long as it works.

coyotemarin · 2013-07-20T22:50:22Z

I've dealt with Hadoop exit codes some, and that sounds correct to me. Merging!

Added an option to disable input path checking in hadoop

coyotemarin · 2013-09-16T06:06:24Z

Just FYI, this will be called simply check_input_paths in v0.4.1, and the switch will be --no-check-input-paths.

secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)

tarnfeld added 2 commits October 18, 2012 15:42

LS rather than -test

fe21d72

New option to skip checking input paths

a8ff7b0

tarnfeld added 2 commits December 17, 2012 10:57

LS rather than -test

5e4b21e

Merge branch 'master' into feature/check-input

bd741c5

tarnfeld added 2 commits December 17, 2012 11:36

CLI Option, Rename and unit test

1214fda

Added docs

5deb1c1

Merge branch 'fix/s3n-ls-cache' into feature/check-input

5d5543c

Merge remote-tracking branch 'yelp/master' into feature/check-input

5a3b1a4

Conflicts: tests/test_hadoop.py

tarnfeld mentioned this pull request Jun 2, 2013

Added an option to disable input path validation in hadoop duedil-ltd/mrjob#3

Merged

tarnfeld added 2 commits July 20, 2013 19:08

Merge remote-tracking branch 'yelp/master' into feature/check-input

50b1233

Fix unit tests

7f307e7

Misguided exit codes

f0a580c

irskep reviewed Jul 20, 2013
View reviewed changes

coyotemarin pushed a commit that referenced this pull request Jul 20, 2013

Merge pull request #583 from tarnfeld/feature/check-input

e6269b2

Added an option to disable input path checking in hadoop

coyotemarin merged commit e6269b2 into 8000 Yelp:master Jul 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added an option to disable input path checking in hadoop #583

Added an option to disable input path checking in hadoop #583

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Added an option to disable input path checking in hadoop #583

Added an option to disable input path checking in hadoop #583

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!