ParsiAnalyzer is an analysis plugin for Elasticsearch. Analysis is a process that consists of the following steps:
- Tokenizing a block of text into individual terms
- Normalizing these terms into a standard form
An analyzer is really just a wrapper that combines Character filters, Tokenizer, and Token filters. Elasticsearch provides many Built-in Analyzers but there's still room for improvement especially for Persian language. This plugin provides tools for tokenizing, normalizing and stemming Persian text.
-
Tokenize Persian text
- Convert whitespaces to zero width nonjoiner (
نیمفاصله
) whenever it is necessary. for example,می رود
toمیرود
. - Convert Persian punctuations to their English equivalent. for example,
۳/۱۴
to۳.۱۴
- Tokenize Persian text by whitespaces and punctuations.
- Convert whitespaces to zero width nonjoiner (
-
Normalize Persian tokens into a single canonical form
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
براي
toبرای
. - Convert all Persian and Arabic numbers to their English equivalent. for example,
۱۴۳
to143
. - Remove diacritic (
اِعراب
) from words. for example,اَرّه
toاره
. - Remove Kashida form words. for example,
بادبــــــادک
toبادبادک
.
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
-
Remove common Persian stop words
- Persian stop words like
از
,به
and etc will be removed.
- Persian stop words like
-
Stem Persian words
- Remove common Persian suffixes. for example,
ها
orان
.
- Remove common Persian suffixes. for example,
To install the plugin for Elasticsearch 8.13.4, run this command:
bin\elasticsearch-plugin install file:///path/to/ParsiAnalyzer.zip
If you want to build ParsiAnalyzer for any specific version of Elasticsearch, follow these steps:
- Make sure you've installed JDK and Maven on your computer
- Clone project from https://github.com/NarimanN2/ParsiAnalyzer.git
- Open
pom.xml
- Under dependencies tag, change Elasticsearch version to your desired version
- Open
plugin-descriptor.properties
- Change elasticsearch.version to your desired version
- Build and Run maven project with Goals
package
- In the target/releases folder, you’ll now find a zip file. install the plugin using this command:
bash
bin/elasticsearch-plugin install file:///path/to/ParsiAnalyzer.zip
Note : for establish a ELK Stack, refer to my github
all commands are present at commands
Change Elasticsearch version to 8.13.4 :
&
Build the project :
the related packages appear after downloading :
Run the app with goal of package :
at final :
the zip file is present at target/releases folder :
Note: you can reach this file at release at my github.
upload zip file inside elasticsearch
container :
install the plugin for Elasticsearch 8.13.4 inside elasticsearch
container :
test the installed analyzer with kibana
after restart the elasticsearch container, you can use Elasticsearch's analyze
API :
create your index
with the analyzer, ParsiAnalyzer can be specified directly in the field mapping as follows :
insert data to the index :
search :
get with analyzer :