JyAttribute.getAttr significantly slow down during multi-threading #360

newprogrammer101 · 2024-10-12T23:03:53Z

I have a service deployed on AWS and use CodeGuru to analyze the performance. I have found when high concurrency >20TPS, JyAttribute.getAttr makes CPU in 95% of BLOCKED time. There was an issue in 2016, but I do not think it got resolved completely as there is still a Synchronized(ob) https://bugs.jython.org/issue2452.

This brought a huge latency impact due to synchronized. I am looking for a way to improve the performance during high-concurrency to avoid this kinds of lock. Thanks.

The text was updated successfully, but these errors were encountered:

Stewori · 2024-10-13T15:15:59Z

I think JyAttribute can be adjusted such that read access can be unsynchronized. I suspect the blocking you observe is mainly (like 99%) from read access (correct me if I'm wrong). I'm preparing a PR to de-synchronize JyAttribute read-access.

jeff5 · 2024-10-13T19:04:22Z

It's not generally true that only write access needs synchronisation. If a field is final (and initialised by one thread before it can be seen by any other) we're good.

JyAttribute is a cunning field that is sometimes a value, sometimes a linked list head, but fundamentally it is modifiable at any moment for arbitrary purposes. Yet lock-free structures exist for particular cases: they're just difficult to prove correct. I realise my nervousness is because I don't know enough about how to make those things (e.g. the lock-free lookup in ClassValue).

What if a thread A reads ob.attributes and caches it (maybe just L2 cache of one core), then another thread B modifies it. A still thinks its value is good. Might volatile be enough?

Stewori · 2024-10-14T13:52:55Z

The read/cache -> update -> write sequence is precisely the reason why write access will continue to be synchronized. However, this can be organized such that a concurrent read access finds a valid state at every point in time. The trade-off is that the concurrent read access might still get the "old" state because a slightly earlier write access has not yet finnished. In other words, in case of concurrent access, it will be undefined whether the old state or new state is read, but it will be one of them.
(This only refers to read/cache -> update -> write inside the original synchronized block. Such a sequence outside of JyAttribute getters/setters would have been affected by concurrent modification issues already in the status quo. It cannot be a goal here to prevent that.)
You are right that this (AFAIK) requires making the PyObject.attributes field volatile. Then again, befor introducing JyAttribute, the javaProxy field was not volatile and I suspect that there are many other fields in Jython that may be subject to concurrent access and not properly declared as volatile. Anyway, the suspicion that it might have been done wrongly earlier or elsewhere would be a bad excuse for getting it wrong on this occasion.
After reviewing JyAttribute.setAttr and JyAttribute.delAttr, and provided that we make PyObject.attributes and JyAttribute.AttributeLink.next and JyAttribute.TransientAttributeLink.next volatile, the following getAttr implementation should work sanely without synchronization:

public static Object getAttr(PyObject ob, byte attr_type) {
    Object oatt = ob.attributes; // atomic one-time access to ob.attributes
    if (oatt instanceof JyAttribute) {
        JyAttribute att = (JyAttribute) oatt;
        while (att != null && att.attr_type < attr_type) {
            att = att.getNext(); // atomic access to next
            // perhaps att is affected by a concurrent write access, perhaps not (who cares)
            // Since the next-field is volatile, att is always a properly initialized
            // attribute sequence head at this point.
        }
        return att != null && att.attr_type == attr_type ? att.getValue() : null;
    }
    return attr_type == JAVA_PROXY_ATTR ? oatt : null;
}

Since the (Transient)AttributeLink.value fields are not read, but just written by setAttr/delAttr I suppose they don't have to be volatile. I would, however, declare JyAttribute.attr_type final (because it is conceptionally immutable and should have been marked as final in the first place).
Then (more an academic concern), the methods reserve(Transient)CustomAttrType should be synchronized since they perform non-atomic access on nonBuiltin(Transient)AttrTypeOffset.
I'm not sure if that is necessary, but for safety we should probably make nonBuiltin(Transient)AttrTypeOffset also volatile. I suppose the reserve methods are rarely called (if ever), so synchronizing them shouldn't have an impact.

jeff5 · 2024-10-15T07:14:24Z

Ah, I hadn't recognised this was yours originally. But it is documented, so I should have guessed. :)

I have never entirely settled in my mind what guarantees we ought to give about concurrent access to objects we design. Currently I think that unmanaged concurrent access should not break an object at the Java level, but that the application threads should not then expect to see the same state. I think this is what you are aiming at.

There are probably Python applications that work correctly in CPython only because of the inadvertent synchronisation brought about by the GIL, but I don't think they're ours to fix.

By "the original synchronized block" above you mean whatever keeps the threads of the outer application from tripping over each other? It could be some other kind of lock taken before the object in question is accessed.

It's a Java claim that uncontended access is cheap. If so, for the problem to be this significant, threads getting an attribute here are likely in real contention for a shared PyObject (or several). It seems unusual to me that the getter and setter are static methods, but Darjus' change to synchronise on the target instance makes sense. I don't think the method being static makes a difference to the locking.

jeff5 · 2024-10-15T08:59:39Z

I'll admit this makes my head spin, but I think it contains the concepts for a little comment why your solution is ok.

I was going to add what I thought the solution might be, but I think I'm still wrong. Concurrency is hard. -- Brian Goetz. (Let's go shopping. -- Barbie.)

Stewori · 2024-10-15T13:21:23Z

I mainly learned from http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html (once shared by you, thx!) the issue of possible reordering. From that doc I get that the reordering of assignment and object initialization is prevented for volatile fields in modern (not totally ancient, that is) Java. The desync of getAttr works based on that assumption. Re static methods: We wanted to move the JyAttribute logic out of PyObject (to avoid cluttering that class even more), but that logic manages PyObject.attribute. So we made static utility methods. I assume that the assumption of volatile preventing assign/init reordering applies to static methods in similar manner.
By "the original synchronized block" I meant the current status quo syncronized blocks in JyAttribute. However, I think some syncing outside of that code, as you point out, is already required for the current state, so it is responsibility of the user anyway.

newprogrammer101 · 2024-10-15T17:36:37Z

Thanks Stewori and jeff5 for the quick reply.
Below is AWS CodeGuru full printed stack trace where JyAttribute.getAttr with 95% CPU block time. The Jython version I am using is 2.7.4.

org.python.util.PythonInterpreter.exec -> org.python.core.Py.flushLine -> org.python.core.StdoutWrapper.flushLine -> org.python.core.StdoutWrapper.myFile -> org.python.core.PyObject.getJavaProxy -> org.python.core.JyAttribute.getAttr

Hope this helps.

Stewori · 2024-10-16T14:42:56Z

@newprogrammer101 That confirms that solving the read access should solve the problem.

@jeff5 I think from the linked jls doc, 17.4.4 "A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread" and the induced x happens-before y relation hb(x, y) according to 17.4.5 "If an action x synchronizes-with a following action y, then we also have hb(x, y)." justify the described solution approach.

The complexity of this matter makes it somewhat hard to believe that it actually works the way it is supposed to :) We'll see what tests say...

Stewori · 2024-10-18T19:58:12Z

@newprogrammer101 can you clone my branch https://github.com/Stewori/jython/tree/desync_JyAttribute_getAttr and check whether it solves the performance issue in your scenario?

newprogrammer101 · 2024-11-05T02:40:46Z

Hi, Stewori, appreciate for resolving the issue by changing synchronized to volatile to mitigate race condition issue. It looks good to me. Will you publish a new Jython version to include this commit?

Stewori · 2024-11-05T11:52:52Z

As discussed in the PR, this change is too theoretic and, given our near-zero experience with this kind of synchronization tuning, too experimental to be deployed into master without further testing. I wrote a consistency and stress test for the implementation with help of AI (that was actually an experiment on its own right - I could have done it without AI and the code needed several adjustments, but I wanted to see if it can actually save time). The test is currently a standalone based on a mocked PyObject, but with original JyAttribute code (I told the AI it's a multi-threaded stress test for a custom Dict implementation, hence the class Dict extends PyObject in DictStressTest.java), yet not integrated with the rest of Jython or its test suite:
test_JyAttribute.tar.gz
I currently don't have time to move on with the test - I think it should not be used as-is out of the box. The number of parallel threads and the timing should start with more conservative values and one would carefully adjust it towards an actual stress test for the hardware at hand. Also, one might better replace System.currentTimeMillis with a time-measurement approach based on System.nanoTime. I currently don't have time for this fiddling, but please feel free to proceed with the given code.

Also, it would be good if you could confirm that the PR does actually solve the performance issue in your setting before it is integrated into master. If a different approach should be required, the trial-and-error-development should not take place in master.

newprogrammer101 · 2024-11-06T03:26:12Z

Hi Stewori, thanks for follow-up. I manually modified the Jython dependency locally with your change and can see improvement on latency. If my understanding is correct, you want to perform more stress test before commit this change to the master branch. If so, there is no urgency at the moment. Once the new version of Jython released, I will ask my company's Ops team to import the latest change. Thank you.

newprogrammer101 · 2024-12-03T19:15:46Z

Hi Stewori, do we have any plan when do we want to merge the commit to the main pipeline as the new version? From my side, I can confirm by changing synchronized keyword to volatile can significantly reduce latency as there is no Lock anymore.

Thank you.

newprogrammer101 · 2024-12-18T02:59:20Z

Another concern is I notice Jython uses PrintStream.flush often, but according to https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/PrintStream.java#L402, looks like it also has synchronized issue that might be no good for multi-threading.

jeff5 · 2024-12-18T08:48:25Z

Concurrency is hard to reason about successfully, so the test is important. (Not the test that it goes faster: the test that it is still correct.) You can see above that @Stewori has done some work towards a test and, if it passes, it would be nice to get this into a 2.7.5 beta.

The link to PrintStream is interesting, thanks, particularly the history. You can see they changed a couple of times how they make it thread safe. I don't understand the detail, but from the comments I see this is about the re-introduction of lightweight threads. If an output stream is shared between threads, it must surely be synchronised, and therefore it is a bottleneck. Is that what you mean?

The stream output from Jython is complex because of the uneasy relationship amongst character output in Java, the byte oriented streams in Python, and terminal encoding (a real mess on Windows). ISTR frequent flushing was necessary for data to emerge in the right order.

newprogrammer101 · 2024-12-18T16:50:35Z

Hi, @jeff5 , thanks for reply. If an output stream is shared between threads, it must surely be synchronized, and therefore it is a bottleneck. Is that what you mean? Yes! This is what I am concerning. Now we have optimized some lock on Jython package itself, but still face performance issue as dependency package also use heavyweight synchronized lock.

Not sure if thread-pool to create multi-instance might help as the sychronized key word usually lock the same instance (As long as it is a IO operation CPU can be used for other thread).

Stewori linked a pull request Oct 18, 2024 that will close this issue

Remove synchronization of JyAttribute.getAttr to improve performance in multithreaded scenarios. #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JyAttribute.getAttr significantly slow down during multi-threading #360

JyAttribute.getAttr significantly slow down during multi-threading #360

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JyAttribute.getAttr significantly slow down during multi-threading #360

JyAttribute.getAttr significantly slow down during multi-threading #360

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!