The following README will guide you through the whole process of Schema Matching using pyJedAI.
💡 Tip: Find json examples here.
💡 Tip: If you want to learn more about pyJedAI read the docs here.
For all key attributes in JSON, exactly one file path must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
dataset_1 |
.csv format |
list |
✔ |
dataset_2 |
.csv format |
list |
✔ |
ground_truth |
.csv or .json formatJSON file must be a list |
list |
|
embeddings_dataset_1 |
Used for loading embeddings in EmbeddingsNNWorkflow .npy format |
list |
|
embeddings_dataset_2 |
Used for loading embeddings in EmbeddingsNNWorkflow .npy format |
list |
{
"inputs" :
"dataset_1": [
"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
],
"dataset_2": [
"cb37e262-a606-4d82-9712-b80e8f4d723d"
],
"ground_truth":[
"db006da0-16ed-4ef5-bf1e-d142488d533e"
]
}
💡 Tip: If
ground_truth
is provided, metrics will be returned
Concering input, additional info must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
dataset_1 |
Provide info for dataset to be processed correctly | dataset_object | ✔ |
dataset_2 |
Provide info for dataset to be processed co 8000 rrectly | dataset_object | |
ground_truth |
Provide info for dataset to be processed correctly | ground_truth_object | |
matching_type |
contnet : matching based on rowscomposite : matching based on attributes and rowsschema : matching based on attributes |
schema : default
| |
workflow |
Select your preferred workflow:
BlockingBasedWorkflow ,
EmbeddingsNNWorkflow ,
JoinWorkflow , or
ValentineWorkflow
| string |
✔ |
block_building |
Block building method and parameters used only for BlockingBasedWorkflow , EmbeddingsNNWorkflow
| block_building_object | ✔ |
block_cleaning |
Block cleaning method and parameters used only for BlockingBasedWorkflow More than one block_cleaning methods can be used
| block_cleaning_object or list of block_cleaning_object |
|
comparison_cleaning |
Comparison cleaning method and parameters used only for BlockingBasedWorkflow |
comparison-cleaning-object | |
entity_matching |
Entity Matching method and parameters used only for BlockingBasedWorkflow |
entity-matching-object | ✔ |
clustering |
Clustering method and parameters used only for BlockingBasedWorkflow , EmbeddingsNNWorkflow or JoinWorkflow |
clustering-object | |
join |
Join method and parameters used only for JoinWorkflow |
join-object | ✔ |
valentine_matching |
Valentine matching method used only for ValentineWorkflow |
valentine-object | ✔ |
💡 Tip:
JoinWorkflow
does not containblock_building
step.
Attributes of keys: dataset_1
, dataset_2
Attributes | Info | Value Type | Required |
---|---|---|---|
separator |
Character separating values in csv | char |
✔ |
dataset_name |
Name of Dataset | string |
Attributes of key: ground_truth
Attributes | Info | Value Type | Required |
---|---|---|---|
separator |
Character separating values in csv Must provide if .csv |
char |
|
is_json |
If ground_truth is .json |
bool |
Input Examples
"parameters" : {
"dataset_1" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "abt"
},
"dataset_2" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "buy"
},
"ground_truth" : {
"separator" : "|"
},
"workflow": "BlockingBasedWorkflow",
"block_building": {
"method": "StandardBlocking"
},
"block_cleaning" : [
{
"method" : "BlockFiltering",
"params" : { "ratio" : 0.7 }
}
],
"comparison_cleaning": {
"method": "BLAST"
},
"entity_matching" : {
"method" : "EntityMatching",
"params" : {
"similarity_threshold" : 0.8
}
},
"clustering" : {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold" : 0.1
}
},
"matching_type": "content"
}
"parameters" : {
"workflow": "EmbeddingsNNWorkflow",
"block_building":
{
"method" : "EmbeddingsNNBlockBuilding",
"params" : {
"vectorizer" : "st5"
}
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
},
"matching_type": "content"
....
}
"parameters" : {
"workflow": "JoinWorkflow",
"block_building":
{
"method" : "TopKJoin",
"params" : {
"metrics" : "cosine",
"tokenization": "qgrams",
"reverse_order": "False"
}
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
},
"matching_type": "content"
....
}
"parameters" : {
"workflow": "ValentineWorkflow",
"valentine_matching":
{
"method" : "Coma",
"params" : {
"max_n" : 10,
"use_instances": False,
}
},
"matching_type": "content"
....
}
For all key attributes in JSON, exactly one file path must be provided.
Attributes | Info | Value Type | Required |
---|---|---|---|
metrics |
Creates a file with F1, Recall, Precision metrics if ground truth exists.csv format |
path |
✔ |
pairs |
Creates a file with the attribute pairs.csv format |
path |
{
"outputs": {
"metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
"pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
}
}