-
Notifications
You must be signed in to change notification settings - Fork 76
Improve using //
in XPath performance
#249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5bf858e
to
119ed65
Compare
I have some concerns about maintainability because cache invalidation bug is hard to debug. Here are some ideas of cache strategy.
def node_test
namespaces_cache = {}
# Cache scope is enclosed in this method.
# There is no need to worry about cache invalidation.
call node.namespaces(namespaces_cache) to use the cache
end
def namespaces(cache = nil)
if cache
cache[self] ||= calculate_namespaces(cache)
else
calculate_namespaces
end
end
def namespaces
parent_namespaces = parent&.namespaces
# If parent.namespaces or @attributes is changed, cache is automatically invalidated
return @namespaces if parent_namespaces.equal?(@parent_namespaces) && @attrubutes.frozen?
@attributes.freeze # Need to unfreeze before modifying @attributes
@parent_namespaces = parent_namespaces
@namespaces = @parent_namespaces.merge(@attributes).freeze
end |
119ed65
to
34cca4c
Compare
For maintainability, I have changed my mind about adopting policy 1 above. |
I have a bit concern about adding a I feel that it may a bit ad-hoc approach. I'm not object this approach but can we improve our API? For example, how about keeping a cache per document? class XPathParser
def parse path, nodeset
path_stack = @parser.parse( path )
nodeset.first.document.enable_cache do
# New document level cache is created and available in this block.
# This API is thread unsafe. Users can't change this document in this block.
match( path_stack, nodeset )
end
end
end |
34cca4c
to
cea9bb8
Compare
I add |
cea9bb8
to
ffe6f9a
Compare
I have corrected the points you pointed out. |
ffe6f9a
to
24cfc1b
Compare
I have corrected the points you pointed out. |
ff07142
to
acf7ecf
Compare
I have corrected the points you pointed out. |
#252) `XPath.match`, `XPath.first`, `XPath.each`, `XPathParser#parse` and `XPathParser#match` accepted nodeset as element. This pull request changes the first parameter of these method to be an element instead of nodeset. Passing nodeset will be deprecated. ```ruby # Documented usage. OK REXML::XPath.match(element, xpath) # Undocumented usage. Deprecate in this pull request nodeset = [element] REXML::XPath.match(nodeset, xpath) ``` ### Background #249 will introduce a temporary cache. ```ruby def parse path, nodeset path_stack = @parser.parse( path ) nodeset.first.document.send(:enable_cache) do match( path_stack, nodeset ) end end ``` But the signature `XPathParser#match(path, nodeset)` does not guarantee that all nodes in the nodeset has the same root document. So cache does not work in the code below. It's still slow. ```ruby REXML::XPath.match(2.times.map { REXML::Document.new('<a>'*400+'</a>'*400) }, 'a//a') ``` The interface is holding our back, so I propose to drop accepting array as element. This change is a backward incompatibility, but it just drops undocumented feature. I think only the test code was unintentionally using this feature. ### XPath.match with array XPath.match only traverse the first element of the array for some selectors. ```ruby nodeset = [REXML::Document.new("<a><b/></a>"), REXML::Document.new("<a><c/></a>")] REXML::XPath.match(nodeset, "a/*") #=> [<b/>, <c/>] REXML::XPath.match(nodeset, "//a/*") #=> [<b/>] # I expect [<b/>, <c/>] but the second document is ignored ``` It indicates that `XPath.match` is not designed to search inside multiple nodes/documents. --------- Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
acf7ecf
to
7279481
Compare
#252 merged, so I rebased this PR and updated. |
7279481
to
88ec36f
Compare
I have corrected the points you pointed out. |
88ec36f
to
1f67fbd
Compare
I have corrected the points you pointed out. |
1f67fbd
to
b28407e
Compare
I have corrected the points you pointed out. |
## Why? for fix get_namespace performance > # FIXME: This DOUBLES the time XPath searches take Co-authored-by: tomoya ishida <tomoyapenguin@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com>
## Benchmark (Comparison with rexml 3.4.1) ``` $ benchmark-driver benchmark/xpath.yaml Calculating ------------------------------------- rexml 3.4.1 master 3.4.1(YJIT) master(YJIT) REXML::XPath.match(REXML::Document.new(xml), 'a//a') 29.215 234.909 108.945 492.410 i/s - 100.000 times in 3.422925s 0.425697s 0.917898s 0.203083s Comparison: REXML::XPath.match(REXML::Document.new(xml), 'a//a') master(YJIT): 492.4 i/s master: 234.9 i/s - 2.10x slower 3.4.1(YJIT): 108.9 i/s - 4.52x slower rexml 3.4.1: 29.2 i/s - 16.85x slower ``` - YJIT=ON : 4.52x faster - YJIT=OFF : 8.04x faster
b28407e
to
6c573e0
Compare
I have corrected the points you pointed out. |
Thanks. |
When using
//
in XPath, the deeper the tag hierarchy, the slower it becomes due to the namespace acquisition process.Caching namespace information improves performance when using
//
with XPath.Benchmark (Comparison with rexml 3.4.1)