8000 Added an option to disable input path checking in hadoop by tarnfeld · Pull Request #583 · Yelp/mrjob · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added an option to disable input path checking in hadoop #583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 20, 2013

Conversation

tarnfeld
Copy link
Contributor

I've added a new option that disabled input path checking for the hadoop runner when set to False. This pull request also includes a change that uses -ls rather than -stat to check paths, as -stat throws an error if you have a path with a wildcard * in it.

The option is named check_hadoop_input_paths unsurprisingly.

Both of these are something we've required (partly for performance reasons), and figured it might be worth pushing upstream if it's something you feel worth including.

@irskep
Copy link
Contributor
irskep commented Dec 15, 2012

check_hadoop_input_paths would be a better name, and a --check-hadoop-input-paths switch would be welcome. Also documentation!

I suspect the maintainers will take care of these things if you don't, but it would be nice.

And add yourself to CONTRIBUTORS.txt.

@tarnfeld
Copy link
Contributor Author

Good point, that's a much better name. I'll also update the docs to explain what's going on here.

Hmm, is there a CONTRIBUTORS.txt ? 😕

@tarnfeld
Copy link
Contributor Author

Never-mind, found it in __init__.py

@tarnfeld
Copy link
Contributor Author

Just a quick tldr on the latest commits... (sigh @ the github merge commits)

  • Changed the internal option name to check_hadoop_input_paths
  • Added a command line option to skip the check
  • Added a test for the option
  • Updated the documentation with the option
  • Fixed a typo in my name :-)

For those interested, this is a realistic example of the behaviour...

Without the option and an invalid path (python count.py test.txsdft -r hadoop)

creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.113523.312108
Traceback (most recent call last):
  File "count.py", line 36, in <module>
    MRWordCount.run()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
    mr_job.execute()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
    super(MRJob, self).execute()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
    self.run_job()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
    runner.run()
  File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
    self._run()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 230, in _run
    self._check_input_exists()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 246, in _check_input_exists
    'Input path %s does not exist!' % (path,))
AssertionError: Input path /Users/tom/Desktop/test.txtdsf does not exist!

With the option and an invalid path (python count.py test.txsdft -r hadoop --skip-hadoop-input-check)

creating tmp directory /var/folders/xg/2rxgs_l13wd318641f1_nyth0000gn/T/count.tom.20121217.114705.181459
Copying local files into hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/
STDERR: put: File /Users/tom/Desktop/test.txsdft does not exist.
Traceback (most recent call last):
  File "count.py", line 35, in <module>
    MRWordCount.run()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 483, in run
    mr_job.execute()
  File "/Users/tom/Projects/mrjob/mrjob/job.py", line 501, in execute
    super(MRJob, self).execute()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 146, in execute
    self.run_job()
  File "/Users/tom/Projects/mrjob/mrjob/launch.py", line 207, in run_job
    runner.run()
  File "/Users/tom/Projects/mrjob/mrjob/runner.py", line 448, in run
    self._run()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 233, in _run
    self._upload_local_files_to_hdfs()
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 264, in _upload_local_files_to_hdfs
    self._upload_to_hdfs(path, uri)
  File "/Users/tom/Projects/mrjob/mrjob/hadoop.py", line 272, in _upload_to_hdfs
    self.invoke_hadoop(['fs', '-put', path, target])
  File "/Users/tom/Projects/mrjob/mrjob/fs/hadoop.py", line 104, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/local/hadoop/bin/hadoop', 'fs', '-put', '/Users/tom/Desktop/test.txsdft', 'hdfs:///user/tom/tmp/mrjob/count.tom.20121217.114705.181459/files/test.txsdft']' returned non-zero exit status 255

@tarnfeld
Copy link
Contributor Author
tarnfeld commented Jan 2, 2013

Just a small note incase anyone misses it, this branch also has #586 merged in.

@irskep
Copy link
Contributor
irskep commented Jun 3, 2013

Ping @sudarshang

@tarnfeld
Copy link
Contributor Author
tarnfeld commented Jun 3, 2013

I'll fix the unit tests on this.

@irskep
Copy link
Contributor
irskep commented Jul 19, 2013

Ping @tarnfeld for status

@tarnfeld
Copy link
Contributor Author

@irskep Oops kind of dropped this one, will push tests in the next day or two.

@tarnfeld
Copy link
Contributor Author

@irskep I've pushed a master merge and unit test fixes for this branch. I had to shuffle around some of the mocking to add an ls mock hadoop method. I also noticed the return code was -1 for the path not existing, in practice this is actually 255 (from looking at hadoop) so i've updated both the mock functions to use this (so it's catchable in the path_exists method).

Update... I take that back, from looking at the hadoop source it should be -1 - though from testing locally I don't see that :-/

Update again... From digging a little deeper I see -1 == 255 (https://issues.apache.org/jira/browse/HADOOP-6143) so i'm going to switch them back to -1 and change ok_returncodes=[0,-1,255]

opt_group.add_option(
'--skip-hadoop-input-check', dest='check_hadoop_input_paths',
default=True, action='store_false',
help='Skip the checks to ensure all input paths exist'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention we use would result in --check-hadoop-input-paths and --no-check-hadoop-input-paths. Both options in case you want to override mrjob.conf.

@irskep
Copy link
Contributor
irskep commented Jul 20, 2013

Could you fix my naming nitpick, and add the switches to configs_reference.rst?

8000

@tarnfeld
Copy link
Contributor Author

For sure, any thoughts on the exit code stuff?

@irskep
Copy link
Contributor
irskep commented Jul 20, 2013

I can't contribute meaningfully to that discussion because my limited working knowledge of Hadoop doesn't cover that area. As long as it works.

@coyotemarin
Copy link
Collaborator

I've dealt with Hadoop exit codes some, and that sounds correct to me. Merging!

coyotemarin pushed a commit that referenced this pull request Jul 20, 2013
Added an option to disable input path checking in hadoop
@coyotemarin coyotemarin merged commit e6269b2 into 8000 Yelp:master Jul 20, 2013
@coyotemarin
Copy link
Collaborator

Just FYI, this will be called simply check_input_paths in v0.4.1, and the switch will be --no-check-input-paths.

scottknight added a commit to timtadh/mrjob that referenced this pull request Oct 10, 2013
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (Yelp#240)
     * see mrjob/examples/
   * can now override jobconf() again (Yelp#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (Yelp#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (Yelp#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (Yelp#633)
     * Throws IOError if output path already exists (Yelp#634)
     * Gracefully handles SSL cert issues (Yelp#621, Yelp#706)
     * Automatically infers EMR/S3 endpoints from region (Yelp#658)
     * ls() supports s3n:// schema (Yelp#672)
     * Fixed log parsing crash on JarSteps (Yelp#645)
     * visible_to_all_users works with boto <2.8.0 (Yelp#701)
     * must use --interpreter with non-Python scripts (Yelp#683)
     * cat() can decompress gzipped data (Yelp#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (Yelp#583)
     * cat() can decompress gzipped data (Yelp#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (Yelp#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747)
   * built-in protocols must be instances (Yelp#488)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet