Tags: liurusi101/mrjob
Tags
that's one small step for a JAR * jobs: * can interpolate input and output path(s) into arguments of JarSteps, so they can be part of multi-step jobs (Yelp#773) * see mrjob/examples/mr_jar_step_example.py * JarStep now takes keyword arguments only (Yelp#769) * removed useless "name" field; "step_args" is now just "args" * MRJobStep (usually accessed via MRJob.mr()) is now MRStep * runners: * All runners: * --setup is now fully functional (Yelp#206) * --python-archive, --setup-cmd, and --setup-script are deprecated * --bootstrap option works and uses sh (Yelp#206) * --bootstrap-cmd, --bootstrap-file, --bootstrap-python-package, --bootstrap-script are deprecated * setup commands can no longer corrupt a task's input and output (Yelp#803) * sh_bin is now "sh -e" by default so setup fails fast (Yelp#810) * default is "/bin/sh -e" on EMR * EMR: * JarSteps work again (Yelp#763) * auto-uploads jars for JarSteps (Yelp#772) * JARs on the EMR instances can be accessed with file:/// URIs * ssh_cat() no longer raises an error when catting a file containing an error (Yelp#807) * Fixed SignatureDoesNotMatchError that happens with boto 2.10.0+ with Python prior to 2.7.5 (Yelp#778) * Hadoop: * now handles JarSteps too (Yelp#770) * Fix to mrjob.parse.urlparse() that was breaking Python 2.5 * mrjob.util.buffer_iterator_to_line_iterator() is now more efficient and uses a bounded amount of memory * bz2 decompression no longer discards data (Yelp#817)
secondary sort and self-terminating job flows * jobs: * SORT_VALUES: Secondary sort by value (Yelp#240) * see mrjob/examples/ * can now override jobconf() again (Yelp#656) * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env() * examples: * bash_wrap/ (mapper/reducer_cmd() example) * mr_most_used_word.py (two step job) * mr_next_word_stats.py (SORT_VALUES example) * runners: * All runners: * single --setup option works but is not yet documented (Yelp#206) * setup now uses sh rather than python internally * EMR runner: * max_hours_idle: self-terminating idle job flows (Yelp#628) * mins_to_end_of_hour option gives finer control over self-termination. * Can reuse pooled job flows where previous job failed (Yelp#633) * Throws IOError if output path already exists (Yelp#634) * Gracefully handles SSL cert issues (Yelp#621, Yelp#706) * Automatically infers EMR/S3 endpoints from region (Yelp#658) * ls() supports s3n:// schema (Yelp#672) * Fixed log parsing crash on JarSteps (Yelp#645) * visible_to_all_users works with boto <2.8.0 (Yelp#701) * must use --interpreter with non-Python scripts (Yelp#683) * cat() can decompress gzipped data (Yelp#601) * Hadoop runner: * check_input_paths: can disable input path checking (Yelp#583) * cat() can decompress gzipped data (Yelp#601) * Inline/Local runners: * Fixed counter parsing for multi-step jobs in inline mode * Supports per-step jobconf (Yelp#616) * Documentation revamp * mrjob.parse.urlparse() works consistently across Python versions (Yelp#686) * deprecated: * many constants in mrjob.emr replaced with functions in mrjob.aws * removed deprecated features: * old conf locations (~/.mrjob and in PYTHONPATH) (Yelp#747) * built-in protocols must be instances (Yelp#488)
v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?] * EMR: * --pool-wait-minutes option lets you wait up to X minutes before creating a job flow (Yelp#455) * Job flow ID included in error messages on failure (Yelp#452) * JOB and JOB_FLOW cleanup options (Yelp#485, Yelp#455) * EMR and Hadoop: * Compatibility fixes related to deprecated options and Hadoop's bizarre non-sequential version numbers (Yelp#489, Yelp#534) * Other: * Warn when *_PROTOCOL is not a class (Yelp#490) * Bug fixes: * Unicode strings can be used when specifying interpreters (Yelp#431) * --enable-emr-logging no longer causes the wrong counters/logs to be parsed (Yelp#446) * TMP_DIR inserted into 'sort' environment variables (Yelp#477) * Setting hadoop_home in mrjob.conf works again * Gzipped input files work when specified with relative paths (Yelp#494) * Passthrough options are not re-ordered when sent to Hadoop Streaming (Yelp#509) * json module is supported again if simplejson doesn't exist (Yelp#544) * HadoopJobRunner.path_exists() is no longer backwards (Yelp#549)
v0.3.4, 2012-06-11 -- We are friendly people. * Experimental support for IronPython in the local and inline runners * set_status() and increment_counter() will encode messages/names of type 'unicode' as UTF-8 when writing to Hadoop Streaming * EMR and Hadoop counter parsing is more correct * mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of incorrectly refusing to do so * jobconf values can be booleans in mrjob.conf as well as 'true' and 'false' strings * hadoop_version can be a float in mrjob.conf, but a warning is printed to the console * Command line help is split across several --help-* commands * Local runner sorts output consistently
v0.3.3.2, 2012-04-10 -- It's a race [condition]! * Option parsing no longer dies when -- is used as an argument (Yelp#435) * Fixed race condition where two jobs can join same job flow thinking it is idle, delaying one of the jobs (Yelp#438) * Better error message when a config file contains no data for the current runner (Yelp#433)
PreviousNext