The following README will guide you through the whole process of Entity Matching using pyJedAI.
💡 Tip: Find json examples here.
💡 Tip: If you want to learn more about pyJedAI read the docs here.
For all key attributes in JSON, exactly one file path must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
dataset_1 |
.csv format |
list |
✔ |
dataset_2 |
.csv format |
list |
|
ground_truth |
.csv format |
list |
|
embeddings_dataset_1 |
Used for loading embeddings in EmbeddingsNNWorkflow .npy format |
list |
|
embeddings_dataset_2 |
Used for loading embeddings in EmbeddingsNNWorkflow .npy format |
list |
{
"inputs" :
"dataset_1": [
"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
],
"dataset_2": [
"cb37e262-a606-4d82-9712-b80e8f4d723d"
],
"ground_truth":[
"db006da0-16ed-4ef5-bf1e-d142488d533e"
]
}
💡 Tip: If
dataset_2
is provided, matches will only be of type (e_1, e_2), where e_1 is an entity indataset_1
and e_2 is an entity indataset_2
.
💡 Tip: If
ground_truth
is provided, metrics will be returned
Concering input, additional info must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
dataset_1 |
Provide info for dataset to be processed correctly | dataset_object | ✔ |
dataset_2 |
Provide info for dataset to be processed correctly | dataset_object | |
ground_truth |
Provide info for dataset to be processed correctly | ground_truth_object | |
workflow |
Select your preferred workflow:
BlockingBasedWorkflow ,
EmbeddingsNNWorkflow , or
JoinWorkflow
| string |
✔ |
block_building |
Block building method and parameters used only for BlockingBasedWorkflow , EmbeddingsNNWorkflow
| block_building_object | ✔ |
block_cleaning |
Block cleaning method and parameters used only for BlockingBasedWorkflow More than one block_cleaning methods can be used
| block_cleaning_object or list of block_cleaning_object |
|
comparison_cleaning |
Comparison cleaning method and parameters used only for BlockingBasedWorkflow |
comparison-cleaning-object | |
entity_matching |
Entity Matching method and parameters used only for BlockingBasedWorkflow |
entity-matching-object | ✔ |
clustering |
Clustering method and parameters used only for BlockingBasedWorkflow , EmbeddingsNNWorkflow or JoinWorkflow |
clustering-object | |
join |
Join method and parameters used only for JoinWorkflow |
join-object | ✔ |
💡 Tip:
JoinWorkflow
does not containblock_building
step.
Attributes of keys: dataset_1
, dataset_2
Attributes | Info | Value Type | Required |
---|---|---|---|
separator |
Character separating values in csv | char |
✔ |
id_column_name |
Name of Dataset's id column | string |
✔ |
dataset_name |
Name of Dataset | string |
|
attributes |
Columns to be used for matching | list |
Attributes of key: ground_truth
Attributes | Info | Value Type | Required |
---|---|---|---|
separator |
Character separating values in csv | char |
✔ |
Input Examples
"parameters" : {
"dataset_1" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "abt"
},
"dataset_2" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "buy"
},
"ground_truth" : {
"separator" : "|"
},
"workflow": "BlockingBasedWorkflow",
"block_building": {
"method": "StandardBlocking",
"attributes_1" : ["name"],
"attributes_2" : ["first_name"]
},
"block_cleaning" : [
{
"method" : "BlockFiltering",
"params" : { "ratio" : 0.7 }
}
],
"comparison_cleaning": {
"method": "BLAST"
},
"entity_matching" : {
"method" : "EntityMatching",
"params" : {
"similarity_threshold" : 0.8
}
},
"clustering" : {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold" : 0.1
}
}
}
"parameters" : {
"workflow": "EmbeddingsNNWorkflow",
"block_building":
{
"method" : "EmbeddingsNNBlockBuilding",
"params" : {
"vectorizer" : "st5"
}
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
}
....
}
"parameters" : {
"workflow": "JoinWorkflow",
"block_building":
{
"method" : "TopKJoin",
"params" : {
"metrics" : "cosine",
"tokenization": "qgrams",
"reverse_order": "False"
},
"attributes_1": ["name"],
"attributes_2" : ["name"]
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
}
....
}
For all key attributes in JSON, exactly one file path must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
metrics |
Creates a file with F1, Recall, Precision metrics if ground truth exists.csv format |
path |
|
pairs |
Creates a file with the ids of pairs.csv format |
path |
|
entities |
Creates a file with all the matched entities.csv format |
list |
{
"outputs": {
"metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
"pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
"entities" : "s3://klms-bucket/pyjedai-output/entities_df.csv"
}
}