-
Notifications
You mus 8000 t be signed in to change notification settings - Fork 134
Ability to extend supported extension - file parser mapping #3690 #3683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
fractal3000
wants to merge
77
commits into
master
Choose a base branch
from
feature/3660-2-search-add-on-adds-unnecessary-stacktrace-to-log-if-an-unsupported-file-format-is-provided
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
77 commits
Select commit
Hold shift + click to select a range
09297bd
Message text correction
fractal3000 e8ff174
Parser resolving mechanism refactoring.
fractal3000 b1bf8c1
UnsupportedFileExtensionExceptionTest
fractal3000 062b4fe
FileParserResolverTest
fractal3000 8d44aef
FileParserResolverTest
fractal3000 38aee04
File type correction
fractal3000 0e8ba5e
UnsupportedFileExtensionExceptionTest
fractal3000 b9fc042
Groovy tests correction
fractal3000 8c24fcc
minor change
fractal3000 6f82ea5
minor change
fractal3000 6a47dc0
FilePropertyValueExtractorTest
fractal3000 5dfd0e5
FilePropertyValueExtractorTest enhancement
fractal3000 7a3abd9
8000
FilePropertyValueExtractorTest enhancement
fractal3000 737ff41
dependencies adding
fractal3000 9622593
Review correction
fractal3000 2b691ec
Review correction(exceptions)
fractal3000 4ac2288
Review correction(exceptions)
fractal3000 c9d056a
Review correction(exceptions)
fractal3000 3bd1bf9
Review correction(exceptions)
fractal3000 26857b6
Review correction(exceptions)
fractal3000 c790851
Parser resolvers
fractal3000 eb89525
Parser resolvers
fractal3000 81d3c55
FileParserResolverManager
fractal3000 d00b02b
FileParserResolverManager
fractal3000 2eb438a
FileProcessorTest
fractal3000 a7f2f7d
UnsupportedFileExtensionException
fractal3000 e2dc255
UnsupportedFileExtensionExceptionTest
fractal3000 10c639e
FilePropertyValueExtractorTest
fractal3000 8d8d1fd
OpenOfficeDocumentsParserResolver correction
fractal3000 58fdd1c
adding necessary dependency for testing purposes
fractal3000 4d2aae1
Resolvers correction
fractal3000 5632c05
test correction
fractal3000 c9403f0
Removing not necessary lines
fractal3000 179f1a6
Extensions problem
fractal3000 9edf65e
Packages reorganizing
fractal3000 a20b68d
Message correction.
fractal3000 a20e552
EmptyFileExtensionException message extending
fractal3000 d1b4850
Java doc
fractal3000 3cc0edf
Java doc
fractal3000 c105e4d
FileParserResolverManagerIntegrationTest creation and resolvers corre…
fractal3000 0a9d793
a not necessary extra dependency
fractal3000 6bf15a1
Method renaming
fractal3000 0380c81
FileParserResolver class's signature changing
fractal3000 8f43891
UnsupportedFileTypeException correction
fractal3000 fa9f068
FilePropertyValueExtractorTest correction
fractal3000 78bb10a
FileProcessorTest correction
fractal3000 d61537d
The tests correction
fractal3000 abb539e
FileParserResolverManager and the test correction
fractal3000 60b2126
AbstractExtensionBasedFileParserResolverTest
fractal3000 7c99bc0
AbstractExtensionBasedFileParserResolverTest
fractal3000 27a7afc
JavaDoc
fractal3000 bc20985
JavaDoc
fractal3000 c94bc7b
minor change
fractal3000 0205f72
minor change
fractal3000 711387b
minor change
fractal3000 7a241ea
minor change
fractal3000 78746b5
FileParserResolverManagerIntegrationTest extending
fractal3000 e6bde49
minor change
fractal3000 9380af4
code formatting
fractal3000 cf11816
Capital letters checking
fractal3000 48b2709
Removing not necessary custom exception
fractal3000 1e53e6f
Renaming the exception
fractal3000 c51d4c9
Code style changes
fractal3000 97b8cf4
JavaDocs correction
fractal3000 cca17b6
JavaDocs correction
fractal3000 a611807
Test correction
fractal3000 917a07a
List to Set changing
fractal3000 eecd553
FileParserResolverManager -> FileParserProvider
fractal3000 6f4e470
Message text correction
fractal3000 6372e9a
OldMSOfficeDocumentsParserResolver > LegacyMSOfficeDocumentsParserRes…
fractal3000 300fb7a
JavaDoc
fractal3000 3dd69e9
Getting FileParsingBundle with FileParserResolver
fractal3000 b6b966b
Comment adding
fractal3000 f490f6d
Uppercase extensions' support
fractal3000 32f6568
FileParsingBundle -> FileParserKit
fractal3000 10d8db8
BodyContentHandler -> ContentHandler
fractal3000 57e81d9
MSOfficeDocumentsParserResolver
fractal3000 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
104 changes: 104 additions & 0 deletions
104
.../main/java/io/jmix/search/index/fileparsing/AbstractExtensionBasedFileParserResolver.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing; | ||
|
||
import com.google.common.base.Strings; | ||
import io.jmix.core.FileRef; | ||
import org.apache.commons.io.FilenameUtils; | ||
import org.apache.tika.metadata.Metadata; | ||
import org.apache.tika.parser.ParseContext; | ||
import org.apache.tika.parser.Parser; | ||
import org.apache.tika.sax.BodyContentHandler; | ||
import org.xml.sax.ContentHandler; | ||
|
||
import java.io.StringWriter; | ||
import java.util.Set; | ||
import java.util.function.Function; | ||
|
||
/** | ||
* Implements the common logic for all extension based file parser resolvers. | ||
*/ | ||
public abstract class AbstractExtensionBasedFileParserResolver implements FileParserResolver { | ||
|
||
/** | ||
* Returns a collection of supported extensions of the supported file type. | ||
* Note that the extension checking mechanism is case-sensitive. So in order to support | ||
* the both uppercase one and lowercase option of the extension they should be defined explicitly. | ||
* E.g. ["xlsx", "XLSX", "docx", "DOCX"]. | ||
* | ||
* @return collection of supported extensions | ||
*/ | ||
public abstract Set<String> getSupportedExtensions(); | ||
|
||
@Override | ||
public String getCriteriaDescription() { | ||
return String.format( | ||
"Parser: %s. Supported extensions: %s.", | ||
this.getClass().getSimpleName(), | ||
getSupportedExtensionsString(getSupportedExtensions())); | ||
} | ||
|
||
@Override | ||
public boolean supports(FileRef fileRef) { | ||
String fileName = fileRef.getFileName(); | ||
String fileExtension = FilenameUtils.getExtension(fileName); | ||
if (Strings.isNullOrEmpty(fileExtension)) { | ||
return false; | ||
} | ||
|
||
return getSupportedExtensions().contains(fileExtension); | ||
} | ||
|
||
protected String getSupportedExtensionsString(Set<String> supportedExtensions) { | ||
return String.join(", ", supportedExtensions); | ||
} | ||
|
||
@Override | ||
public FileParserKit getParserKit() { | ||
return new FileParserKit( | ||
getParser(), | ||
getContentHandlerGenerator(), | ||
getMetadata(), | ||
getParseContext()); | ||
} | ||
|
||
/** | ||
* Returns a parser for the supported file type. | ||
*/ | ||
protected abstract Parser getParser(); | ||
|
||
/** | ||
* Returns a function for the ContentHandler generating that is necessary for the given file parsing. | ||
*/ | ||
protected Function<StringWriter, ContentHandler> getContentHandlerGenerator() { | ||
return stringWriter -> new BodyContentHandler(stringWriter); | ||
} | ||
|
||
/** | ||
* Returns a Metadata object for the given file parsing. | ||
*/ | ||
protected Metadata getMetadata() { | ||
return new Metadata(); | ||
} | ||
|
||
/** | ||
* Returns a ParseContext object for the given file parsing. | ||
*/ | ||
protected ParseContext getParseContext() { | ||
return new ParseContext(); | ||
} | ||
} |
32 changes: 32 additions & 0 deletions
32
jmix-search/search/src/main/java/io/jmix/search/index/fileparsing/FileParserKit.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing; | ||
|
||
import jakarta.validation.constraints.NotNull; | ||
import org.apache.tika.metadata.Metadata; | ||
import org.apache.tika.parser.ParseContext; | ||
import org.apache.tika.parser.Parser; | ||
import org.xml.sax.ContentHandler; | ||
|
||
import java.io.StringWriter; | ||
import java.util.function.Function; | ||
|
||
public record FileParserKit( | ||
@NotNull Parser parser, | ||
@NotNull Function<StringWriter, ContentHandler> contentHandlerGenerator, | ||
@NotNull Metadata metadata, | ||
@NotNull ParseContext parseContext) {} |
52 changes: 52 additions & 0 deletions
52
jmix-search/search/src/main/java/io/jmix/search/index/fileparsing/FileParserResolver.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing; | ||
|
||
import io.jmix.core.FileRef; | ||
|
||
/** | ||
* Interface to be implemented for adding a custom file parser resolver | ||
* or modifying the behavior of the existing file parser resolvers. It gives an ability to define the exact parser | ||
* for the exact file types with a custom implementation of the file checking logic. These parsers are used to extract | ||
* file content for sending it to the search server and indexing. | ||
*/ | ||
public interface FileParserResolver { | ||
|
||
/** | ||
* Returns the description of the criteria for the files that are supported with this resolver. | ||
* This text is used for generating the log message that is written into the log | ||
* while no one of the resolvers supports the processing of the given file. | ||
* | ||
* @return criteria description | ||
*/ | ||
String getCriteriaDescription(); | ||
|
||
/** | ||
* Returns a complex object that contains all necessary objects for the supported file type parsing. | ||
* | ||
* @return an instance of a file parser kit | ||
*/ | ||
FileParserKit getParserKit(); | ||
|
||
/** | ||
* Returns the result of the checking if the file with the given fileRef is supported by the resolver or not. | ||
* | ||
* @param fileRef object with the file information | ||
* @return the given FileRef's checking result | ||
*/ | ||
boolean supports(FileRef fileRef); | ||
} |
20 changes: 20 additions & 0 deletions
20
jmix-search/search/src/main/java/io/jmix/search/index/fileparsing/package-info.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
/* | ||
* Copyright 2020 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
@NonNullApi | ||
package io.jmix.search.index.fileparsing; | ||
|
||
import org.springframework.lang.NonNullApi; |
40 changes: 40 additions & 0 deletions
40
...ava/io/jmix/search/index/fileparsing/resolvers/LegacyMSOfficeDocumentsParserResolver.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing.resolvers; | ||
|
||
import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver; | ||
import org.apache.tika.parser.Parser; | ||
import org.apache.tika.parser.microsoft.OfficeParser; | ||
import org.springframework.core.annotation.Order; | ||
import org.springframework.stereotype.Component; | ||
|
||
import java.util.Set; | ||
|
||
@Component("search_LegacyMSOfficeDocumentsParserResolver") | ||
@Order(100) | ||
public class LegacyMSOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver { | ||
|
||
@Override | ||
public Set<String> getSupportedExtensions() { | ||
return Set.of("doc", "xls", "DOC", "XLS"); | ||
} | ||
|
||
@Override | ||
public Parser getParser() { | ||
return new OfficeParser(); | ||
} | ||
} |
53 changes: 53 additions & 0 deletions
53
...main/java/io/jmix/search/index/fileparsing/resolvers/MSOfficeDocumentsParserResolver.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except 3D11 in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing.resolvers; | ||
|
||
import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver; | ||
import org.apache.tika.parser.ParseContext; | ||
import org.apache.tika.parser.Parser; | ||
import org.apache.tika.parser.microsoft.OfficeParserConfig; | ||
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser; | ||
import org.springframework.core.annotation.Order; | ||
import org.springframework.stereotype.Component; | ||
|
||
import java.util.Set; | ||
|
||
@Component("search_OfficeDocumentsParserResolver") | ||
@Order(100) | ||
public class MSOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver { | ||
|
||
@Override | ||
public Set<String> getSupportedExtensions() { | ||
return Set.of("docx", "xlsx", "DOCX", "XLSX"); | ||
} | ||
|
||
@Override | ||
public Parser getParser() { | ||
return new OOXMLParser(); | ||
} | ||
|
||
@Override | ||
protected ParseContext getParseContext() { | ||
ParseContext parseContext = super.getParseContext(); | ||
|
||
OfficeParserConfig officeParserConfig = new OfficeParserConfig(); | ||
officeParserConfig.setIncludeHeadersAndFooters(false); | ||
parseContext.set(OfficeParserConfig.class, officeParserConfig); | ||
|
||
return parseContext; | ||
} | ||
} |
40 changes: 40 additions & 0 deletions
40
...in/java/io/jmix/search/index/fileparsing/resolvers/OpenOfficeDocumentsParserResolver.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
/* | ||
* Copyright 2024 Haulmont. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.jmix.search.index.fileparsing.resolvers; | ||
|
||
import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver; | ||
import org.apache.tika.parser.Parser; | ||
import org.apache.tika.parser.odf.OpenDocumentParser; | ||
import org.springframework.core.annotation.Order; | ||
import org.springframework.stereotype.Component; | ||
|
||
import java.util.Set; | ||
|
||
@Component("search_OpenOfficeDocumentsParserResolver") | ||
@Order(100) | ||
public class OpenOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver { | ||
|
||
@Override | ||
public Set<String> getSupportedExtensions() { | ||
return Set.of("odt", "ods", "ODT", "ODS"); | ||
} | ||
|
||
@Override | ||
public Parser getParser() { | ||
return new OpenDocumentParser(); | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this public API method?
Isn't it better to delegate this logic to the final consumer? The only purpose of this is to generate message like 'The file extension should be one of the following: ...'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consumer doesn't know anything about the file checking criteria that are implemented in the resolvers. The aim was to give to the user ability to get comprehensive information what is going wrong. If we remove this method we just could say that "A resolver(and parser) for the file couldn't be found".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message of the AbstractExtensionBasedFileParserResolver was corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method was left without changes as it was discussed.