8000 ORC-1920: [C++] Support `Geometry` and `Geography` types by ffacs · Pull Request #2269 · apache/orc · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ORC-1920: [C++] Support Geometry and Geography types #2269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ffacs
Copy link
Contributor
@ffacs ffacs commented Jun 16, 2025

What changes were proposed in this pull request?
Support Geometry and Geography types for c++ side

Why are the changes needed?
Add support for Geometry and Geography types

How was this patch tested?
UT passed

Was this patch authored or co-authored using generative AI tooling?
No

@ffacs ffacs changed the title ORC-1920: Support Geometry and Geography types ORC-1920: [C++] Support Geometry and Geography types Jun 16, 2025
@github-actions github-actions bot added the CPP label Jun 16, 2025
@ffacs ffacs changed the title ORC-1920: [C++] Support Geometry and Geography types ORC-1920: [C++] Support Geometry and Geography types Jun 16, 2025
Copy link
Member
@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ffacs .

cc @wgtmac and @williamhyun

@dongjoon-hyun dongjoon-hyun added this to the 2.2.0 milestone Jun 16, 2025
///
/// These values correspond to the 1, 2, ..., 7 component of the WKB integer
/// geometry type (i.e., the value of geometry_type % 1000).
enum class GeometryType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, we had an additional value, -1, for UNKNOWN_TYPE_ID.

private static final int UNKNOWN_TYPE_ID = -1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ employs a different implementation that invalidates statistics when unknown types are encountered; therefore, UNKNOWN_TYPE_ID is unnecessary in this context.

}

std::stringstream ss;
ss << "<GeoStatistics>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Why do we have this only in C++ implementation? If this is required, can we have this in Java first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is due to gap between Parquet Java and C++ implementations...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is due to gap between Parquet Java and C++ implementations..

Copy link
Member
@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor comments about the feature parity between Java and C++ implementations.

Copy link
Member
@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @ffacs for this PR, waiting on the above comment before merging.

@dongjoon-hyun
Copy link
Member

Gentle ping, @ffacs .

@ffacs
Copy link
Contributor Author
ffacs commented Jun 24, 2025

Gentle ping, @ffacs .

Pong. Thank you for your review @dongjoon-hyun , I'd update this patch these days.

@dongjoon-hyun
Copy link
Member

Thank you, @ffacs . Also, please participate the 1.8.10 vote too when you have some time.

}

std::stringstream ss;
ss << "<GeoStatistics>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is due to gap between Parquet Java and C++ implementations...

@dongjoon-hyun
Copy link
Member

Could you fix the above comments, @ffacs ?

@wgtmac
Copy link
Member
wgtmac commented Jul 3, 2025

Let me know when ready to review.

@ffacs
Copy link
Contributor Author
ffacs commented Jul 3, 2025

@wgtmac @dongjoon-hyun I'm ready to review now, please take a look when you're free~

@dongjoon-hyun
Copy link
Member

Thank you, @ffacs .

#include "orc/Type.hh"
#include "orc/Vector.hh"
#include "orc/orc-config.hh"

#include <sstream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this removed due to the #include <ostream> from Geospatial.hh?

virtual ~GeospatialColumnStatistics();

/**
* get bounding box
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Maybe, get -> Get for consistency with other code around this?

* This file contains code adapted from the Apache Arrow project.
*
* Original source:
* https://github.com/apache/arrow/blob/main/cpp/src/parquet/geospatial/statistics.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a tag based location instead of the main branch because it changes always.

* This file contains code adapted from the Apache Arrow project.
*
* Original source:
* https://github.com/apache/arrow/blob/main/cpp/src/parquet/geospatial/statistics.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a tag based location instead of the main branch because it changes always.

Copy link
Member
@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

#include <ostream>
#include <string>

namespace orc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use namespace orc::geospatial {

@@ -25,6 +25,19 @@

namespace orc {

namespace geospatial {
enum EdgeInterpolationAlgorithm {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum EdgeInterpolationAlgorithm {
enum class EdgeInterpolationAlgorithm {

ANDOYER = 3,
KARNEY = 4
};
using EIAlgo = EdgeInterpolationAlgorithm;
Copy link
Member

Choose a reason for hiding this comment

CEB7 The reason will be displayed to describe this comment to others. Learn more.

I don't think this acronym is clear. Can we use its full name instead?

Comment on lines +77 to +78
virtual const std::string& getCRS() const = 0;
virtual geospatial::EIAlgo getEIAlgo() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual const std::string& getCRS() const = 0;
virtual geospatial::EIAlgo getEIAlgo() const = 0;
virtual const std::string& getCrs() const = 0;
virtual geospatial::EdgeInterpolationAlgorithm getAlgorithm() const = 0;

It would be good to add comment to say these two functions are for geometry & geography types only.

const char* incomingMask) override {
ColumnWriter::add(rowBatch, offset, numValues, incomingMask);

const StringVectorBatch& strBatch = dynamic_cast<const StringVectorBatch&>(rowBatch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide a clear exception message if it fails?

}
}

} // namespace orc::geospatial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a blank line

namespace orc::geospatial {

template <typename T>
inline std::enable_if_t<std::is_trivially_copyable_v<T>, T> SafeLoadAs(const uint8_t* unaligned) {
3D11 Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix the function names in this file to have lowercased initials?

void WKBGeometryBounder::mergeGeometryInternal(WKBBuffer* src, bool recordWkbType) {
uint8_t endian = src->ReadUInt8();
bool swap = endian != 0x00;
if (isLittleEndian()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use if constexpr (std::endian::native == std::endian::little)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::endian requires C++20, we are using c++17 now.

} // namespace geospatial
} // namespace orc

#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a blank line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0