Closed
Description
To duplicate the semantics of MRJob
s, our Spark harness encodes and decodes data between mappers and reducers and between job steps. This is usually unnecessary, since PySpark already knows how to serialize Python data structures,.
It should be possible to turn this off, only having the job decode the initial data and encode the output (e.g. --no-internal-protocols
).
It might make sense to turn this on automatically for jobs that use PickleProtocol
internally.