8000 GitHub - stelar-eu/pyjedai-em: pyJedAI Entity Matching version integrated with the KLMS
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

stelar-eu/pyjedai-em

 
 

Repository files navigation

README PyJedAI - Entity Matching

The following README will guide you through the whole process of Entity Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
dataset_1 .csv format list
dataset_2 .csv format list
ground_truth .csv format list
embeddings_dataset_1 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
embeddings_dataset_2 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If dataset_2 is provided, matches will only be of type (e_1, e_2), where e_1 is an entity in dataset_1 and e_2 is an entity in dataset_2.

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes Info Value Type Required
dataset_1 Provide info for dataset to be processed correctly dataset_object
dataset_2 Provide info for dataset to be processed correctly dataset_object
ground_truth Provide info for dataset to be processed correctly ground_truth_object
workflow Select your preferred workflow: BlockingBasedWorkflow, EmbeddingsNNWorkflow, or JoinWorkflow string
block_building Block building method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow block_building_object
block_cleaning Block cleaning method and parameters used only for BlockingBasedWorkflow
More than one block_cleaning methods can be used
block_cleaning_object or list of block_cleaning_object
comparison_cleaning Comparison cleaning method and parameters used only for BlockingBasedWorkflow comparison-cleaning-object
entity_matching Entity Matching method and parameters used only for BlockingBasedWorkflow entity-matching-object
clustering Clustering method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow or JoinWorkflow clustering-object
join Join method and parameters used only for JoinWorkflow join-object

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes Info Value Type Required
separator Character separating values in csv char
id_column_name Name of Dataset's id column string
dataset_name Name of Dataset string
attributes Columns to be used for matching list

Ground Truth

Attributes of key: ground_truth

Attributes Info Value Type Required
separator Character separating values in csv char

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking",
              "attributes_1" : ["name"],
              "attributes_2" : ["first_name"]
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        }
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }    
"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            },
            "attributes_1": ["name"],
            "attributes_2" : ["name"]
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }    

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
metrics Creates a file with F1, Recall, Precision metrics if ground truth exists
.csv format
path
pairs Creates a file with the ids of pairs
.csv format
path
entities Creates a file with all the matched entities.csv format list
{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
        "entities" : "s3://klms-bucket/pyjedai-output/entities_df.csv"
  }
}

About

pyJedAI Entity Matching version integrated with the KLMS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.1%
  • Shell 6.5%
  • Other 1.4%
0