README PyJedAI - Entity Matching

The following README will guide you through the whole process of Entity Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	`.csv` format	`list`	✔
`dataset_2`	`.csv` format	`list`
`ground_truth`	`.csv` format	`list`
`embeddings_dataset_1`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`
`embeddings_dataset_2`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`

{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If dataset_2 is provided, matches will only be of type (e_1, e_2), where e_1 is an entity in dataset_1 and e_2 is an entity in dataset_2.

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	Provide info for dataset to be processed correctly	dataset_object	✔
`dataset_2`	Provide info for dataset to be processed correctly	dataset_object
`ground_truth`	Provide info for dataset to be processed correctly	ground_truth_object
`workflow`	Select your preferred workflow: `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`, or `JoinWorkflow`	`string`	✔
`block_building`	Block building method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`	block_building_object	✔
`block_cleaning`	Block cleaning method and parameters used only for `BlockingBasedWorkflow` More than one `block_cleaning` methods can be used	block_cleaning_object or `list` of block_cleaning_object
`comparison_cleaning`	Comparison cleaning method and parameters used only for `BlockingBasedWorkflow`	comparison-cleaning-object
`entity_matching`	Entity Matching method and parameters used only for `BlockingBasedWorkflow`	entity-matching-object	✔
`clustering`	Clustering method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow` or `JoinWorkflow`	clustering-object
`join`	Join method and parameters used only for `JoinWorkflow`	join-object	✔

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔
`id_column_name`	Name of Dataset's id column	`string`	✔
`dataset_name`	Name of Dataset	`string`
`attributes`	Columns to be used for matching	`list`

Ground Truth

Attributes of key: ground_truth

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking",
              "attributes_1" : ["name"],
              "attributes_2" : ["first_name"]
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        }
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }

"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            },
            "attributes_1": ["name"],
            "attributes_2" : ["name"]
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type
`metrics`	Creates a file with F1, Recall, Precision metrics if ground truth exists `.csv` format	`path`
`pairs`	Creates a file with the ids of pairs `.csv` format	`path`
`entities`	Creates a file with all the matched entities`.csv` format	`list`

{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
        "entities" : "s3://klms-bucket/pyjedai-output/entities_df.csv"
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.local		.local
data		data
docs		docs
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
blocking_based.py		blocking_based.py
global_dict.py		global_dict.py
logo.png		logo.png
main.py		main.py
pyjedai_utils.py		pyjedai_utils.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README PyJedAI - Entity Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Uh oh!

Releases

Packages

Languages

stelar-eu/pyjedai-em

Folders and files

Latest commit

History

Repository files navigation

README PyJedAI - Entity Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages