-
Notifications
You must be signed in to change notification settings - Fork 589
Added an option to disable input path checking in hadoop #583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I suspect the maintainers will take care of these things if you don't, but it would be nice. And add yourself to |
Good point, that's a much better name. I'll also update the docs to explain what's going on here. Hmm, is there a |
Never-mind, found it in |
Just a quick tldr on the latest commits... (sigh @ the github merge commits)
For those interested, this is a realistic example of the behaviour... Without the option and an invalid path ( creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.113523.312108
Traceback (most recent call last):
File "count.py", line 36, in <module>
MRWordCount.run()
File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
mr_job.execute()
File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
self.run_job()
File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
runner.run()
File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
self._run()
File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 230, in _run
self._check_input_exists()
File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 246, in _check_input_exists
'Input path %s does not exist!' % (path,))
AssertionError: Input path /Users/tom/Desktop/test.txtdsf does not exist! With the option and an invalid path ( creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.114705.181459
Copying local files into hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/
STDERR: put: File /Users/tom/Desktop/test.txsdft does not exist.
Traceback (most recent call last):
File "count.py", line 35, in <module>
MRWordCount.run()
File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
mr_job.execute()
File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
self.run_job()
File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
runner.run()
File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
self._run()
File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 233, in _run
self._upload_local_files_to_hdfs()
File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 264, in _upload_local_files_to_hdfs
self._upload_to_hdfs(path, uri)
File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 272, in _upload_to_hdfs
self.invoke_hadoop(['fs', '-put', path, target])
File "/Users/tom/Projects/mrjob/mrjob/fs/hadoop.py", line 104, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/local/hadoop/bin/hadoop', 'fs', '-put', '/Users/tom/Desktop/test.txsdft', 'hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/test.txsdft']' returned non-zero exit status 255 |
Just a small note incase anyone misses it, this branch also has #586 merged in. |
Conflicts: tests/test_hadoop.py
Ping @sudarshang |
I'll fix the unit tests on this. |
Ping @tarnfeld for status |
@irskep Oops kind of dropped this one, will push tests in the next day or two. |
@irskep I've pushed a master merge and unit test fixes for this branch. I had to shuffle around some of the mocking to add an Update... I take that back, from looking at the hadoop source it should be Update again... From digging a little deeper I see |
opt_group.add_option( | ||
'--skip-hadoop-input-check', dest='check_hadoop_input_paths', | ||
default=True, action='store_false', | ||
help='Skip the checks to ensure all input paths exist'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention we use would result in --check-hadoop-input-paths
and --no-check-hadoop-input-paths
. Both options in case you want to override mrjob.conf
.
Could you fix my naming nitpick, and add the switches to |
For sure, any thoughts on the exit code stuff? |
I can't contribute meaningfully to that discussion because my limited working knowledge of Hadoop doesn't cover that area. As long as it works. |
I've dealt with Hadoop exit codes some, and that sounds correct to me. Merging! |
Added an option to disable input path checking in hadoop
Just FYI, this will be called simply |
secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)
I've added a new option that disabled input path checking for the hadoop runner when set to
False
. This pull request also includes a change that uses-ls
rather than-stat
to check paths, as-stat
throws an error if you have a path with a wildcard*
in it.The option is named
check_hadoop_input_paths
unsurprisingly.Both of these are something we've required (partly for performance reasons), and figured it might be worth pushing upstream if it's something you feel worth including.