8000 GitHub - stelar-eu/pyjedai-sm: pyJedAI Schema Matching version integrated with the KLMS
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

stelar-eu/pyjedai-sm

 
 

Repository files navigation

README PyJedAI - Schema Matching

The following README will guide you through the whole process of Schema Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
dataset_1 .csv format list
dataset_2 .csv format list
ground_truth .csv or .json format
JSON file must be a list
list
embeddings_dataset_1 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
embeddings_dataset_2 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes Info Value Type Required
dataset_1 Provide info for dataset to be processed correctly dataset_object
dataset_2 Provide info for dataset to be processed co 8000 rrectly dataset_object
ground_truth Provide info for dataset to be processed correctly ground_truth_object
matching_type contnet: matching based on rows
composite: matching based on attributes and rows
schema: matching based on attributes
schema : default
workflow Select your preferred workflow: BlockingBasedWorkflow, EmbeddingsNNWorkflow, JoinWorkflow, or ValentineWorkflow string
block_building Block building method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow block_building_object
block_cleaning Block cleaning method and parameters used only for BlockingBasedWorkflow
More than one block_cleaning methods can be used
block_cleaning_object or list of block_cleaning_object
comparison_cleaning Comparison cleaning method and parameters used only for BlockingBasedWorkflow comparison-cleaning-object
entity_matching Entity Matching method and parameters used only for BlockingBasedWorkflow entity-matching-object
clustering Clustering method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow or JoinWorkflow clustering-object
join Join method and parameters used only for JoinWorkflow join-object
valentine_matching Valentine matching method used only for ValentineWorkflow valentine-object

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes Info Value Type Required
separator Character separating values in csv char
dataset_name Name of Dataset string

Ground Truth

Attributes of key: ground_truth

Attributes Info Value Type Required
separator Character separating values in csv
Must provide if .csv
char
is_json If ground_truth is .json bool

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking"
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        },
        "matching_type": "content"
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }    
"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }    
"parameters" : {           
        "workflow": "ValentineWorkflow",
        "valentine_matching": 
        {
            "method" : "Coma",
            "params" : {
                "max_n" : 10,
                "use_instances": False,
            }
        },
        "matching_type": "content"
     ....     
    }    

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
metrics Creates a file with F1, Recall, Precision metrics if ground truth exists
.csv format
path
pairs Creates a file with the attribute pairs
.csv format
path
{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
  }
}

About

pyJedAI Schema Matching version integrated with the KLMS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.3%
  • Shell 5.2%
  • Other 1.5%
0