[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File deletion doesn't properly clean up database entries, causing issues with re-uploads #7181

Open
2 tasks
eml-henn opened this issue Nov 21, 2024 · 2 comments

Comments

@eml-henn
Copy link
eml-henn commented Nov 21, 2024

Bug Report


Installation Method

Kubernetes on Azure Kubernetes Service

Environment

  • **Open WebUI Version: 0.4.0

  • **Operating System: AKSUbuntu-2204

  • **Browser: Firefox 132.0.2

Confirmation:

  • [ X ] I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • [ X ] I have included the Docker container logs.
  • [ X ] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

  1. File deletion from a knowledge base should remove both the UI entry and the corresponding database content
  2. The operation should provide feedback about its success/failure
  3. Re-uploading a previously deleted file should work as if it were a new file

Actual Behavior:

  1. File deletion sometimes only removes the file from the frontend FilesTable
  2. No feedback is provided about whether the database deletion was successful
  3. The vector database can retain the old entries even after file deletion
  4. Attempting to re-upload the same file results in a "duplicate content" error

Description

Bug Summary:
When deleting a processed file from a knowledge base through the frontend, the file appears to be removed from the UI but its content sometimes remains in the vector database. This creates issues when trying to re-upload the same file, as the system detects it as duplicate content.

Reproduction Details

This is frustratingly a "sometimes" error. So my proposed solution would be to add logging to make it easier to reproduce.

Steps to Reproduce:

  1. Upload a file to a knowledge base
  2. Delete the file using the remove button in the FilesTable
  3. Try to upload the same file again
  4. Observe the "duplicate content" error.

Logs and Screenshots

Docker Container Logs:
// When removing a file:
INFO: 10.1.1.4:0 - "POST /api/v1/knowledge/{id}/file/remove HTTP/1.1" 200 OK

// When trying to re-upload (error due to remaining content):
INFO: 10.1.1.4:0 - "GET /api/v1/knowledge/{id} HTTP/1.1" 200 OK
INFO [open_webui.apps.webui.routers.files] file.content_type: text/plain
INFO [open_webui.apps.retrieval.main] save_docs_to_vector_db: document {file} {file_collection_id}
INFO [open_webui.apps.retrieval.main] adding to collection {file_collection_id}
Collection {file_collection_id} does not exist.
INFO: 10.1.1.4:0 - "POST /api/v1/files/ HTTP/1.1" 200 OK
INFO [open_webui.apps.retrieval.main] save_docs_to_vector_db: document {file} {id}
INFO [open_webui.apps.retrieval.main] Document with hash [file hash} already exists
ERROR [open_webui.apps.retrieval.main] Duplicate content detected. Please provide unique content to proceed.
Traceback (most recent call last):
File "/app/backend/open_webui/apps/retrieval/main.py", line 1001, in process_file
raise e
File "/app/backend/open_webui/apps/retrieval/main.py", line 975, in process_file
result = save_docs_to_vector_db(
^^^^^^^^^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/main.py", line 759, in save_docs_to_vector_db
raise ValueError(ERROR_MESSAGES.DUPLICATE_CONTENT)
ValueError: Duplicate content detected. Please provide unique content to proceed.
INFO: 10.1.1.4:0 - "POST /api/v1/knowledge/{id}/file/add HTTP/1.1" 400 Bad Request

// Compare with Logs when adding a file:
INFO: 10.1.1.222:0 - "POST /api/v1/files/ HTTP/1.1" 200 OK
INFO [open_webui.apps.webui.routers.files] file.content_type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
INFO [open_webui.apps.retrieval.main] save_docs_to_vector_db: document {filename} {collection_id}
INFO [open_webui.apps.retrieval.main] collection {collection_id} already exists
INFO [open_webui.apps.retrieval.main] adding to collection {collection_id}

Additional Information

The issue appears to be in the file removal endpoint (@router.post("/{id}/file/remove") in knowledge.py. Currently:

  • The deletion operation doesn't verify if the vector database cleanup was successful
  • The frontend updates regardless of the backend operation's success
  • The deletion doesn´t give feedback on the new state of the collection

Proposed Solution

  1. Add verification of vector database deletion
  2. Add error handling and user feedback
  3. Only update the frontend UI after confirmed successful deletion
@tjbck
Copy link
Contributor
tjbck commented Nov 22, 2024

Would love to investigate more but we'll need a more reliable way to reproduce the issue, definitely continue our troubleshooting journey and keep us posted!

@sreinwald
Copy link

I just ran across this issue as well and I can reproduce it 100% using the API on v0.4.7, deployed via docker compose.

Steps to reproduce:

  • Reset Vector Storage/Knowledge
  • Upload file via API
  • Delete file via API
  • Open chroma.sqlite with a sqlite browser
sqlite> select * from embedding_metadata;

In my specific example:

curl -X POST $HOST'/api/v1/files/' \
  --header 'Authorization: Bearer '$API_KEY \
  --header 'Accept: application/json' \
  -F 'file=@/home/sre/foo.txt;type=text/plain'
{"id":"6052a7df-99aa-4570-b5f5-1f06a54acddf","user_id":"126aaeba-bc1c-48b5-b47c-c9dc9a5394a1","hash":"ee1abdd5d09f7426d4950b928c7a73ba28e5085b06c7d909790a36891348ee29","filename":"foo.txt","data":{"content":""},"meta":{"name":"foo.txt","content_type":"text/plain","size":1491,"collection_name":"file-6052a7df-99aa-4570-b5f5-1f06a54acddf"},"created_at":1733137414,"updated_at":1733137414}%

curl -X DELETE $HOST'/api/v1/files/6052a7df-99aa-4570-b5f5-1f06a54acddf' \
  --header 'Authorization: Bearer '$API_KEY
{"message":"File deleted successfully"}%
root@ai /var/lib/docker/volumes/open-webui_open-webui/_data/vector_db # sqlite3 chroma.sqlite3
SQLite version 3.45.1 2024-01-30 16:01:20                                               

Enter ".help" for usage hints.
sqlite> select * from embedding_metadata;
1|source|foo.txt|||
1|name|foo.txt|||
1|created_by|126aaeba-bc1c-48b5-b47c-c9dc9a5394a1|||
1|file_id|6052a7df-99aa-4570-b5f5-1f06a54acddf|||
1|start_index||0||
1|hash|ee1abdd5d09f7426d4950b928c7a73ba28e5085b06c7d909790a36891348ee29|||
1|embedding_config|{"engine": "ollama", "model": "bge-m3:latest"}|||
1|chroma:document|{content}|||

The issue with re-uploads very much seems to be related to the issue above, and I can reproduce it 100% with these steps using the API directly:

  • Reset Vector DB
  • Create a knowedgebase
  • Upload a file
  • Add the file to the knowledgebase
  • Delete the file
  • Upload the file again
  • Trying to add the file to the knowledgebase will now fail with 400: Duplicate content detected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants