Additional bindings logic #12

shabani1 · 2023-10-17T23:43:51Z

What

This PR adds logic for processing a new binding.

When a new binding is added, tasks are generated for each applicable document.
The output of tasks are saved using a DB task which does not contain hard coded names for index fields
This PR does not address the issue of serialization for embeddings
- There are currently two DB tasks that need to be combined into one
  - save_result_to_index works only for embedding fields
  - save_records_to_index works only for non-embedding fields
- Will address this issue in the next PR

Test plan

Create a new transformer with the following payload

{
  "transformer_id": "text.counter.word_counter",
  "path": "lexy.transformers.counter.word_counter",
  "description": "Returns count of words and the longest word"
}

Create a new index with the following payload

{
  "index_id": "word_counts",
  "description": "Word counts",
  "index_table_schema": {},
  "index_fields": {
      "word_count": {"type": "int"}, 
      "longest_word": {"type": "string", "optional": true}
  }
}

Ensure that the index table zzidx__word_counts has been created
- This is a manual step right now, run lexy.core.events.create_new_index_table('word_counts')
Create a new binding with the following payload

{
  "collection_id": "default",
  "transformer_id": "text.counter.word_counter",
  "index_id": "word_counts",
  "description": "New binding for word counts",
  "execution_params": {},
  "transformer_params": {},
  "filters": {}
}

The test should produce tasks for each document in the default collection using the word_counter transformer, and save the results in the index table zzidx__word_counts.

…database, and new transformers for testing

# What This is a follow up to #12 allowing for the use of `save_results_to_index` instead of having to use two different DB tasks to save embeddings and non-embeddings. - It uses a very inefficient conversion of Numpy arrays to lists - Will update this when updating to Pydantic 2.0 - It requires use of `text_embedding_transformer` instead of `text_embedding` - The former simply returns the result of the latter as `{'embedding': result}` - Need to create a decorator to wrap output with column names as part of a future PR # Test plan Following the test plan of #12, adding a new document to the default collection should now generate two tasks, one for text embedding and one for word counts.

# What This PR adds the `lexy_transformer` decorator. The decorator can be imported and applied as follows. ```python from lexy.transformers import lexy_transformer @lexy_transformer(name="text.embeddings.minilm") def text_embeddings(sentences: list[str]) -> torch.Tensor: ... ``` When applied to a function, the decorator will: * Register the function as a celery shared task with the name `lexy.transformers.{name}` * Add the kwarg `lexy_index_fields` to the function signature * When run without the argument, the decorated function behaves as it normally would * With the argument, the decorated function returns its output with the labels specified in `lexy_index_fields` (see example below) ```python @lexy_transformer(name="add_and_subtract") def add_and_subtract(a, b): return a + b, a - b add_and_subtract(5, 3) # returns (8, 2) add_and_subtract(5, 3, lexy_index_fields=["sum", "difference"]) # returns [{'sum': 8, 'difference': 2}] ``` # Test plan Similar to the test plan for #12, though bindings need the additional keyword argument `lexy_index_fields`. - Create a new transformer with the following payload: ```json { "transformer_id": "text.counter.word_counter", "path": "lexy.transformers.counter.word_counter", "description": "Returns count of words and the longest word" } ``` - Create a new index with the following payload: ```json { "index_id": "word_counts", "description": "Word counts", "index_table_schema": {}, "index_fields": { "word_count": {"type": "int"}, "longest_word": {"type": "string", "optional": true} } } ``` - Ensure that the index table `zzidx__word_counts` has been created - This is a manual step right now, run `lexy.core.events.create_new_index_table('word_counts')` - Create a new binding with the following payload: ```json { "collection_id": "default", "transformer_id": "text.counter.word_counter", "index_id": "word_counts", "description": "New binding for word counts", "execution_params": {}, "transformer_params": { "lexy_index_fields": ["word_count", "longest_word"] }, "filters": {} } ``` The test should produce tasks for each document in the default collection using the `word_counter` transformer, and save the results in the index table `zzidx__word_counts`. - Finally, the existing binding for `default_text_embeddings` (i.e., binding with `id=1`) should be patched with the following payload: ```json { "transformer_params": { "lexy_index_fields": ["embedding"] } } ``` Now, any new documents added to the default collection should trigger two jobs, one for each binding.

shabani1 added 2 commits October 7, 2023 19:08

celery task to save records to db, word_counter transformer

772b210

added logic for processing a new binding, task for saving records to …

6028761

…database, and new transformers for testing

shabani1 requested a review from jnnnthnn October 17, 2023 23:44

shabani1 merged commit 873a123 into main Oct 18, 2023

shabani1 deleted the additional-bindings-logic branch October 18, 2023 16:57

shabani1 mentioned this pull request Oct 18, 2023

Switched to exclusive use of save_records_to_index #13

Merged

shabani1 mentioned this pull request Oct 22, 2023

added lexy_transformer decorator #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Additional bindings logic #12

Additional bindings logic #12

Uh oh!

Uh oh!

Uh oh!

Additional bindings logic #12

Additional bindings logic #12

Uh oh!

Conversation

What

Test plan

Uh oh!

Uh oh!