Add vm.max_map_count to startup checks #17078

henrikingo · 2024-11-27T11:50:37Z

Problem Statement

CrateDB has a user friendly feature where it will check various ulimit settings and log an error if they are not changed to meaningful scalable database-in-production values. Recently we had a case where we would hit failures due to too low vm.max_map_count. For some reason this wasn't caught in the startup checks.

Possible Solutions

Add vm.max_map_count to the startup checks already done to other ulimit or other configuration.

Considered Alternatives

No response

mfussenegger · 2024-11-27T13:33:29Z

It's already part of the startup checks:

crate/server/src/main/java/org/elasticsearch/bootstrap/BootstrapChecks.java

Lines 398 to 418 in 299f276

    
           @Override 
        
           public BootstrapCheckResult check(final Settings settings) { 
        
               // we only enforce the check if a store is allowed to use mmap at all 
        
               if (IndexModule.NODE_STORE_ALLOW_MMAP.getWithFallback(settings)) { 
        
                   long maxMapCount = getMaxMapCount(); 
        
                   if (maxMapCount != -1 && maxMapCount < LIMIT) { 
        
                       final String message = String.format( 
        
                           Locale.ROOT, 
        
                           "max virtual memory areas vm.max_map_count [%d] is too low, " + 
        
                           "increase to at least [%d] by adding `vm.max_map_count = 262144` to `/etc/sysctl.conf` " + 
        
                           "or invoking `sysctl -w vm.max_map_count=262144`", 
        
                           maxMapCount, 
        
                           LIMIT); 
        
                       return BootstrapCheckResult.failure(message); 
        
                   } else { 
        
                       return BootstrapCheckResult.success(); 
        
                   } 
        
               } else { 
        
                   return BootstrapCheckResult.success(); 
        
               } 
        
           }

https://cratedb.com/docs/guide/admin/bootstrap-checks.html

mfussenegger · 2024-11-27T15:34:36Z

Whoops, didn't mean to close.
If people managed to get past the check but still ran into the issue it could be an indication that we'd need to bump the required value further. Downside of that is that it would be a breaking change for anyone already running CrateDB with a value that's currently passing the check and sufficient for their amount of shards.

It would also help to know what the actual configurated value was and how much they had to increase it to resolve it.

henrikingo · 2024-11-28T12:25:38Z

Summoning @romseygeek who was also on the same call, do you happen to remember? (And if not, do you mind just asking the customer in the Slack channel. I'm happy to do it just trying to remove an unnecessary back and forth.)

I failed to google it, so just asking here: what is our max file size for the lucene index files? It seems like we are trying to ensure a node can hold and mmap enough files so that 4GB * "enough files" = about 10 TB? (Enough files = 262144).

henrikingo · 2024-11-29T09:30:51Z

Also noting while waiting for response to the previous: We also at some point suspected vm.overcommit_memory to be the culprit, but concluded it was not. Just noting it for the record. (Given that map_max_count was already checked...)

mfussenegger · 2024-12-02T09:36:55Z

what is our max file size for the lucene index files? It seems like we are trying to ensure a node can hold and mmap enough files so that 4GB * "enough files" = about 10 TB? (Enough files = 262144).

The number of mappings depends on the number of segments and their size.
Since Lucene 9.10 it mmap's Lucene indexes in chunks of 16 GB (before it was 1 GB).
(CrateDB 5.7 is the first release with Lucene 9.10)

So in an ideal scenario:

>>> 262144 * 16GB -> TB

  262144 × 16 GB ➞ TB

   = 4194.3 TB

Caveat that you need to subtract quite a bit from that for other files, and more importantly because not all segments will have their max sizes depending on how merges are going on the system. Merges can require additional files. And the mapping is also not a clear 1:1. Could be that you end up with 2 mapped files per segment. Or if transparent huge pages are enabled it's the other way around. cat /proc/<pid>/maps gives some insights into that.

But it should still be plenty to handle the typical hardware setup.

romseygeek · 2024-12-02T09:50:57Z

IIRC because they had created completely new installs, it was set to the system default of 65530; they had updated the value to 262144 in the various conf files but hadn't restarted the node, so the new value hadn't taken. Possibly this means that /proc/sys/vm/max_map_count was reporting the new version, so the bootstrap check didn't fail?

mfussenegger · 2024-12-02T16:07:32Z

chunks of 16 GB (before it was 1 GB).

Small correction, GiB not GB.

>>> 262144 * 16GiB -> TB

  262144 × 16 GiB ➞ TB

   = 4503.6 TB

they had updated the value to 262144 in the various conf files but hadn't restarted the node, so the new value hadn't taken. Possibly this means that /proc/sys/vm/max_map_count was reporting the new version, so the bootstrap check didn't fail?

If it was changed via sysctl -w vm.max_map_count=262144 it should take immediate effect, no reboot required.
And afaik /proc/sys/vm/max_map_count shows the currently active value, not a configured but unapplied value.

mfussenegger closed this as completed Nov 27, 2024

mfussenegger reopened this Nov 27, 2024

mfussenegger added the needs info or feedback label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vm.max_map_count to startup checks #17078

Add vm.max_map_count to startup checks #17078

Add vm.max_map_count to startup checks #17078

Add vm.max_map_count to startup checks #17078

Comments

Problem Statement

Possible Solutions

Considered Alternatives