Description
Steps to reproduce
As a continuation of #866 I have this snippet to load parquet files (compressed or otherwise) in a separate thread.
import sys
import glob
import time
import threading
import atoti as tt
# jars folder contains zstd jar
session = tt.Session(extra_jars=["atoti/jars"], java_options=["-Xmx32G", "-Xms32G"])
# Create or update the table with parquet file
def load_parquet(pqfile, table_name, keys=None):
table = session.tables.get(table_name, None)
if table is None:
table = session.read_parquet(pqfile, table_name=table_name, keys=keys)
else:
table.load_parquet(pqfile)
return table
pq_files = sorted(glob.glob("/path/to/compressed/files_*.parquet"))
# Load the first file & create a cube
tbl = load_parquet(pq_files[0], "mytable", keys=["id"])
cube = session.create_cube(tbl)
# Load positions in a separate thread
def loader():
for idx, pq in enumerate(pq_files):
sys.stdout.write(f"\rLoading file #{idx} : {pq}")
load_parquet(pq, "mytable")
time.sleep(1) # wait a sec
threading.Thread(target=loader).start()
Actual Result
Process memory continues to grow when my loader thread runs!
The process starts with 40 GB VIRT & 2.6 GB RSS (observed via htop
)
Initial server log indicates other values (1GB heap + 3GB direct)
2024-04-02T10:55:32.129-04:00 INFO 3615 --- [activepivot-health-event-dispatcher] c.a.h.m.ILoggingHealthEventHandler : [jvm, memory] INFO 2024-04-02T14:55:32.127Z uptime=34570ms com.activeviam.health.monitor.impl.JvmHealthCheck.createEvent:61 thread=activeviam-health-check-worker thread_id=52 event_type=JvmMemoryReport JVM Memory Usage report: G1 Young Generation[count=10 (+0), time=0s (+0)] G1 Old Generation[count=0 (+0), time=0s (+0)] Heap[used=1 GiB 465 MiB (1561350664) (+(0)), committed=32 GiB (34359738368) (+(0)), max=32 GiB (34359738368) (+(0))] Direct[used=3 GiB 46 MiB (3269516297) (+(0)), count=11569 (+0), max=32 GiB (34359738368) (+(0))] Threads[count=103 (+0), peak=104 (+0)]
Now as the loader thread runs, and loads a file i see that the total memory continues to rise & only stops rising when the loader stops.
At the end it reached ~85 GB VIRT & 74GB RSS (seen via htop
)
But the last line of the server log says that heap being used is 12GB and direct is 16GB.
2024-04-02T12:56:32.199-04:00 INFO 3615 --- [activepivot-health-event-dispatcher] c.a.h.m.ILoggingHealthEventHandler : [jvm, memory] INFO 2024-04-02T16:56:32.199Z uptime=7294642ms com.activeviam.health.monitor.impl.JvmHealthCheck.createEvent:61 thread=activeviam-health-check-worker thread_id=52 event_type=JvmMemoryReport JVM Memory Usage report: G1 Young Generation[count=108 (+0), time=2s (+0)] G1 Old Generation[count=3 (+0), time=1s (+0)] Heap[used=12 GiB 980 MiB (13912595624) (+(0)), committed=32 GiB (34359738368) (+(0)), max=32 GiB (34359738368) (+(0))] Direct[used=16 GiB 380 MiB (17578900307) (+(0)), count=286824 (+0), max=32 GiB (34359738368) (+(0))] Threads[count=40 (+0), peak=113 (+0)]
So the numbers don't add up. The sum of all such memory log lines is within the 32GB upper-bound that I'd set initially.
But the process is actually taking much much more RAM.
Expected Result
I expected the -Xmx
option to set a maximum RAM size on the process.
But I think that is just the JVM heap max? I read somewhere that Atoti allocates data off-heap as well. Is that what is happening?
Is there a way to restrict the TOTAL memory usage of the process?
Could it perhaps be a "leak" in the parquet loader?
Environment
-
atoti: 0.8.10
-
Python: 3.12.2
-
Operating system: linux
-
Machine being tested on has 32 cores & 256 GB RAM
Logs
I have detailed logs as well, please let me know what additional info you require and I'll be happy to help!