8000 [BUG] not removing orphaned chunks · Issue #581 · moosefs/moosefs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[BUG] not removing orphaned chunks #581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
onlyjob opened this issue Aug 10, 2024 · 3 comments
Open

[BUG] not removing orphaned chunks #581

onlyjob opened this issue Aug 10, 2024 · 3 comments
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation.
8000

Comments

@onlyjob
Copy link
Contributor
onlyjob commented Aug 10, 2024

I had one chunkserver offline for weeks, so it was removed from cluster entirely. It have a unique label that is not used in any storage classes. All storage classes are STRICT.

When I started that particular chunkserver (its host server was booted after being offline for few months) I expected all its chunks removed. However, only some chunks were deleted, and about a million orphaned chunks (~7.5 TiB) were left on that chunkserver, not being removed at all.

Apparently MooseFS does nothing to remove orphaned chunks that were forgotten by master.

Experienced on MooseFS 3.0.117 (Debian).

@chogata
Copy link
Member
chogata commented Aug 10, 2024

MooseFS removes orphaned chunks - after 1 week. This is on purpose, in case those chunks are a result of a crash or something similar. This gives the admins time to save them manually if they turn out to be needed after all.

This is not a bug :)

Edit: to be perfectly clear: after 7 days of continuous work. If you shut down the chunkserver after a day, the counter will reset.

@chogata chogata added the data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. label Aug 10, 2024
@onlyjob
Copy link
Contributor Author
onlyjob commented Aug 10, 2024

Thanks for quick reply. I also hope that it is not a bug but I'm not convinced yet.

As I've said, few months passed but it seems that orphaned chunks were not deleted even after CS_DAYS_TO_REMOVE_UNUSED. (I'm not actually absolutely sure about that, as that server could have been temporarily booted within CS_DAYS_TO_REMOVE_UNUSED period therefore resetting it.)

I've also explicitly removed chunkserver in question from "Servers" tab in web interface.

What if I can not guarantee continuous availability of chunkserver for 7 days? How to ensure that orphaned chunks are deleted?

@chogata
Copy link
Member
chogata commented Aug 13, 2024

CS_DAYS_TO_REMOVE_UNUSED is set to tell the master how many days it should keep an inactive chunk server on the server list, before removing it permanently. It has nothing to do with how long a chunk server will keep orphaned chunks. You can also remove a disconnected chunk server from the list via CGI sooner than the value of CS_DAYS_TO_REMOVE_UNUSED, and again, it has nothing to do with chunks.

When a chunk server - no matter if it is on the list or not - connects to a master and presents it with a correct metaid (meaning the chunk server was connected to this master in the past and its chunks actually belong), the master will then start accepting chunk ids sent by the chunk server (this is called "chunk registration process"). If, during this process, the chunk server will send a chunk id that the master thinks should not exist, the master will reply, that it's not interested in this chunk, and the chunk server will flag it as orphaned. And remove after 7 days.

We never thought to add any way of shortening this period because, frankly, this should not happen - there should not be any orphaned chunks, unless there is an emergency situation. And in an emergency we want to keep them, long enough so if an admin realises some important data is missing, there is still time to react and try to recover the data manually. Hence the quite arbitrary 7 days.

We've never expected any usecases where users would disconnect entire servers, put them offline, and then online again after prolonged periods of time, with old chunk data still on them.

But we see a few people doing just that and while MooseFS will behave safely in such cases, it may sometimes not behave as a user would expect :)

Anyway, we can add a feature request to our list, to make a config setting that will allow to adjust the time for orphaned chunks and also to report their existence, similar to what we have with duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation.
Projects
None yet
Development

No branches or pull requests

2 participants
0