README PyJedAI - Schema Matching

The following README will guide you through the whole process of Schema Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	`.csv` format	`list`	✔
`dataset_2`	`.csv` format	`list`	✔
`ground_truth`	`.csv` or `.json` format JSON file must be a list	`list`
`embeddings_dataset_1`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`
`embeddings_dataset_2`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`

{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	Provide info for dataset to be processed correctly	dataset_object	✔
`dataset_2`	Provide info for dataset to be processed co 8000 rrectly	dataset_object
`ground_truth`	Provide info for dataset to be processed correctly	ground_truth_object
`matching_type`	`contnet`: matching based on rows `composite`: matching based on attributes and rows `schema`: matching based on attributes	`schema` : default
`workflow`	Select your preferred workflow: `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`, `JoinWorkflow`, or `ValentineWorkflow`	`string`	✔
`block_building`	Block building method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`	block_building_object	✔
`block_cleaning`	Block cleaning method and parameters used only for `BlockingBasedWorkflow` More than one `block_cleaning` methods can be used	block_cleaning_object or `list` of block_cleaning_object
`comparison_cleaning`	Comparison cleaning method and parameters used only for `BlockingBasedWorkflow`	comparison-cleaning-object
`entity_matching`	Entity Matching method and parameters used only for `BlockingBasedWorkflow`	entity-matching-object	✔
`clustering`	Clustering method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow` or `JoinWorkflow`	clustering-object
`join`	Join method and parameters used only for `JoinWorkflow`	join-object	✔
`valentine_matching`	Valentine matching method used only for `ValentineWorkflow`	valentine-object	✔

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔
`dataset_name`	Name of Dataset	`string`

Ground Truth

Attributes of key: ground_truth

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv Must provide if `.csv`	`char`
`is_json`	If ground_truth is `.json`	`bool`

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking"
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        },
        "matching_type": "content"
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }

"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }

"parameters" : {           
        "workflow": "ValentineWorkflow",
        "valentine_matching": 
        {
            "method" : "Coma",
            "params" : {
                "max_n" : 10,
                "use_instances": False,
            }
        },
        "matching_type": "content"
     ....     
    }

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`metrics`	Creates a file with F1, Recall, Precision metrics if ground truth exists `.csv` format	`path`	✔
`pairs`	Creates a file with the attribute pairs `.csv` format	`path`

{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.local		.local
docs		docs
utils		utils
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
blocking_based.py		blocking_based.py
global_dict.py		global_dict.py
logo.png		logo.png
main.py		main.py
pyjedai_utils.py		pyjedai_utils.py
requirements.txt		requirements.txt
run.sh		run.sh
val_utils.py		val_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README PyJedAI - Schema Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Uh oh!

Releases

Packages

Languages

stelar-eu/pyjedai-sm

Folders and files

Latest commit

History

Repository files navigation

README PyJedAI - Schema Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages