8000 Ability to extend supported extension - file parser mapping #3690 by fractal3000 · Pull Request #3683 · jmix-framework/jmix · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ability to extend supported extension - file parser mapping #3690 #3683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 77 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
09297bd
Message text correction
fractal3000 Sep 6, 2024
e8ff174
Parser resolving mechanism refactoring.
fractal3000 Sep 7, 2024
b1bf8c1
UnsupportedFileExtensionExceptionTest
fractal3000 Sep 7, 2024
062b4fe
FileParserResolverTest
fractal3000 Sep 7, 2024
8d44aef
FileParserResolverTest
fractal3000 Sep 7, 2024
38aee04
File type correction
fractal3000 Sep 7, 2024
0e8ba5e
UnsupportedFileExtensionExceptionTest
fractal3000 Sep 7, 2024
b9fc042
Groovy tests correction
fractal3000 Sep 7, 2024
8c24fcc
minor change
fractal3000 Sep 7, 2024
6f82ea5
minor change
fractal3000 Sep 7, 2024
6a47dc0
FilePropertyValueExtractorTest
fractal3000 Sep 7, 2024
5dfd0e5
FilePropertyValueExtractorTest enhancement
fractal3000 Sep 9, 2024
7a3abd9
8000 FilePropertyValueExtractorTest enhancement
fractal3000 Sep 9, 2024
737ff41
dependencies adding
fractal3000 Sep 9, 2024
9622593
Review correction
fractal3000 Sep 10, 2024
2b691ec
Review correction(exceptions)
fractal3000 Sep 10, 2024
4ac2288
Review correction(exceptions)
fractal3000 Sep 10, 2024
c9d056a
Review correction(exceptions)
fractal3000 Sep 10, 2024
3bd1bf9
Review correction(exceptions)
fractal3000 Sep 10, 2024
26857b6
Review correction(exceptions)
fractal3000 Sep 10, 2024
c790851
Parser resolvers
fractal3000 Sep 12, 2024
eb89525
Parser resolvers
fractal3000 Sep 12, 2024
81d3c55
FileParserResolverManager
fractal3000 Sep 12, 2024
d00b02b
FileParserResolverManager
fractal3000 Sep 12, 2024
2eb438a
FileProcessorTest
fractal3000 Sep 13, 2024
a7f2f7d
UnsupportedFileExtensionException
fractal3000 Sep 13, 2024
e2dc255
UnsupportedFileExtensionExceptionTest
fractal3000 Sep 13, 2024
10c639e
FilePropertyValueExtractorTest
fractal3000 Sep 13, 2024
8d8d1fd
OpenOfficeDocumentsParserResolver correction
fractal3000 Sep 13, 2024
58fdd1c
adding necessary dependency for testing purposes
fractal3000 Sep 13, 2024
4d2aae1
Resolvers correction
fractal3000 Sep 13, 2024
5632c05
test correction
fractal3000 Sep 13, 2024
c9403f0
Removing not necessary lines
fractal3000 Sep 13, 2024
179f1a6
Extensions problem
fractal3000 Sep 13, 2024
9edf65e
Packages reorganizing
fractal3000 Sep 13, 2024
a20b68d
Message correction.
fractal3000 Sep 13, 2024
a20e552
EmptyFileExtensionException message extending
fractal3000 Sep 13, 2024
d1b4850
Java doc
fractal3000 Sep 13, 2024
3cc0edf
Java doc
fractal3000 Sep 13, 2024
c105e4d
FileParserResolverManagerIntegrationTest creation and resolvers corre…
fractal3000 Sep 18, 2024
0a9d793
a not necessary extra dependency
fractal3000 Sep 18, 2024
6bf15a1
Method renaming
fractal3000 Sep 18, 2024
0380c81
FileParserResolver class's signature changing
fractal3000 Sep 18, 2024
8f43891
UnsupportedFileTypeException correction
fractal3000 Sep 18, 2024
fa9f068
FilePropertyValueExtractorTest correction
fractal3000 Sep 18, 2024
78bb10a
FileProcessorTest correction
fractal3000 Sep 18, 2024
d61537d
The tests correction
fractal3000 Sep 18, 2024
abb539e
FileParserResolverManager and the test correction
fractal3000 Sep 18, 2024
60b2126
AbstractExtensionBasedFileParserResolverTest
fractal3000 Sep 18, 2024
7c99bc0
AbstractExtensionBasedFileParserResolverTest
fractal3000 Sep 18, 2024
27a7afc
JavaDoc
fractal3000 Sep 18, 2024
bc20985
JavaDoc
fractal3000 Sep 18, 2024
c94bc7b
minor change
fractal3000 Sep 18, 2024
0205f72
minor change
fractal3000 Sep 18, 2024
711387b
minor change
fractal3000 Sep 18, 2024
7a241ea
minor change
fractal3000 Sep 18, 2024
78746b5
FileParserResolverManagerIntegrationTest extending
fractal3000 Sep 18, 2024
e6bde49
minor change
fractal3000 Sep 18, 2024
9380af4
code formatting
fractal3000 Sep 19, 2024
cf11816
Capital letters checking
fractal3000 Sep 26, 2024
48b2709
Removing not necessary custom exception
fractal3000 Sep 26, 2024
1e53e6f
Renaming the exception
fractal3000 Sep 26, 2024
c51d4c9
Code style changes
fractal3000 Sep 26, 2024
97b8cf4
JavaDocs correction
fractal3000 Sep 26, 2024
cca17b6
JavaDocs correction
fractal3000 Sep 26, 2024
a611807
Test correction
fractal3000 Sep 26, 2024
917a07a
List to Set changing
fractal3000 Sep 26, 2024
eecd553
FileParserResolverManager -> FileParserProvider
fractal3000 Sep 26, 2024
6f4e470
Message text correction
fractal3000 Sep 26, 2024
6372e9a
OldMSOfficeDocumentsParserResolver > LegacyMSOfficeDocumentsParserRes…
fractal3000 Sep 26, 2024
300fb7a
JavaDoc
fractal3000 Sep 26, 2024
3dd69e9
Getting FileParsingBundle with FileParserResolver
fractal3000 Sep 26, 2024
b6b966b
Comment adding
fractal3000 Sep 26, 2024
f490f6d
Uppercase extensions' support
fractal3000 Sep 27, 2024
32f6568
FileParsingBundle -> FileParserKit
fractal3000 Sep 27, 2024
10d8db8
BodyContentHandler -> ContentHandler
fractal3000 Sep 27, 2024
57e81d9
MSOfficeDocumentsParserResolver
fractal3000 Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions jmix-search/search/search.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,9 @@ dependencies {
testImplementation 'org.junit.jupiter:junit-jupiter-engine'
testImplementation 'org.junit.jupiter:junit-jupiter-params'
testImplementation 'org.junit.vintage:junit-vintage-engine'
testImplementation 'org.spockframework:spock-core'
testImplementation 'org.mockito:mockito-core'
testImplementation 'ch.qos.logback:logback-classic'
testRuntimeOnly 'org.slf4j:slf4j-simple'
testRuntimeOnly 'org.hsqldb:hsqldb'
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,29 @@

package io.jmix.search.exception;

import org.apache.commons.io.FilenameUtils;
import java.util.List;

/**
* An exception that is thrown when a user added some file of the type that is not supported
* and there are no any known parser for.
*/
public class UnsupportedFileFormatException extends Exception {

public static final String MESSAGE = "The file the-file-with-not-supported-extension.sql with 'sql' extension is not supported. Only following formats are supported: pdf, doc, docx, xls, xlsx, odt, ods, rtf, txt.";
private static final String MESSAGE = "The file %s can't be parsed. " +
"Only the following file parsing criteria are supported:\n -%s";

/**
* @param fileName the name of the file which type is not supported
* @param supportedExtensions the list of the criteria that are supported in the application
*/
public UnsupportedFileFormatException(String fileName, List<String> supportedExtensions) {
super(String.format(
MESSAGE,
fileName,
getSupportedExtensionsString(supportedExtensions)));
}

public UnsupportedFileFormatException(String fileName) {
super(String.format(MESSAGE, fileName, FilenameUtils.getExtension(fileName)));
protected static String getSupportedExtensionsString(List<String> supportedExtensions) {
return String.join("\n -", supportedExtensions);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing;

import com.google.common.base.Strings;
import io.jmix.core.FileRef;
import org.apache.commons.io.FilenameUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.StringWriter;
import java.util.Set;
import java.util.function.Function;

/**
* Implements the common logic for all extension based file parser resolvers.
*/
public abstract class AbstractExtensionBasedFileParserResolver implements FileParserResolver {

/**
* Returns a collection of supported extensions of the supported file type.
* Note that the extension checking mechanism is case-sensitive. So in order to support
* the both uppercase one and lowercase option of the extension they should be defined explicitly.
* E.g. ["xlsx", "XLSX", "docx", "DOCX"].
*
* @return collection of supported extensions
*/
public abstract Set<String> getSupportedExtensions();

@Override
public String getCriteriaDescription() {
return String.format(
"Parser: %s. Supported extensions: %s.",
this.getClass().getSimpleName(),
getSupportedExtensionsString(getSupportedExtensions()));
}

@Override
public boolean supports(FileRef fileRef) {
String fileName = fileRef.getFileName();
String fileExtension = FilenameUtils.getExtension(fileName);
if (Strings.isNullOrEmpty(fileExtension)) {
return false;
}

return getSupportedExtensions().contains(fileExtension);
}

protected String getSupportedExtensionsString(Set<String> supportedExtensions) {
return String.join(", ", supportedExtensions);
}

@Override
public FileParserKit getParserKit() {
return new FileParserKit(
getParser(),
getContentHandlerGenerator(),
getMetadata(),
getParseContext());
}

/**
* Returns a parser for the supported file type.
*/
protected abstract Parser getParser();

/**
* Returns a function for the ContentHandler generating that is necessary for the given file parsing.
*/
protected Function<StringWriter, ContentHandler> getContentHandlerGenerator() {
return stringWriter -> new BodyContentHandler(stringWriter);
}

/**
* Returns a Metadata object for the given file parsing.
*/
protected Metadata getMetadata() {
return new Metadata();
}

/**
* Returns a ParseContext object for the given file parsing.
*/
protected ParseContext getParseContext() {
return new ParseContext();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing;

import jakarta.validation.constraints.NotNull;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.ContentHandler;

import java.io.StringWriter;
import java.util.function.Function;

public record FileParserKit(
@NotNull Parser parser,
@NotNull Function<StringWriter, ContentHandler> contentHandlerGenerator,
@NotNull Metadata metadata,
@NotNull ParseContext parseContext) {}
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing;

import io.jmix.core.FileRef;

/**
* Interface to be implemented for adding a custom file parser resolver
* or modifying the behavior of the existing file parser resolvers. It gives an ability to define the exact parser
* for the exact file types with a custom implementation of the file checking logic. These parsers are used to extract
* file content for sending it to the search server and indexing.
*/
public interface FileParserResolver {

/**
* Returns the description of the criteria for the files that are supported with this resolver.
* This text is used for generating the log message that is written into the log
* while no one of the resolvers supports the processing of the given file.
*
* @return criteria description
*/
String getCriteriaDescription();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this public API method?
Isn't it better to delegate this logic to the final consumer? The only purpose of this is to generate message like 'The file extension should be one of the following: ...'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consumer doesn't know anything about the file checking criteria that are implemented in the resolvers. The aim was to give to the user ability to get comprehensive information what is going wrong. If we remove this method we just could say that "A resolver(and parser) for the file couldn't be found".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message of the AbstractExtensionBasedFileParserResolver was corrected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method was left without changes as it was discussed.


/**
* Returns a complex object that contains all necessary objects for the supported file type parsing.
*
* @return an instance of a file parser kit
*/
FileParserKit getParserKit();

/**
* Returns the result of the checking if the file with the given fileRef is supported by the resolver or not.
*
* @param fileRef object with the file information
* @return the given FileRef's checking result
*/
boolean supports(FileRef fileRef);
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
/*
* Copyright 2020 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

@NonNullApi
package io.jmix.search.index.fileparsing;

import org.springframework.lang.NonNullApi;
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing.resolvers;

import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.OfficeParser;
import org.springframework.core.annotation.Order;
import org.springframework.stereotype.Component;

import java.util.Set;

@Component("search_LegacyMSOfficeDocumentsParserResolver")
@Order(100)
public class LegacyMSOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver {

@Override
public Set<String> getSupportedExtensions() {
return Set.of("doc", "xls", "DOC", "XLS");
}

@Override
public Parser getParser() {
return new OfficeParser();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except 3D11 in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing.resolvers;

import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.OfficeParserConfig;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.springframework.core.annotation.Order;
import org.springframework.stereotype.Component;

import java.util.Set;

@Component("search_OfficeDocumentsParserResolver")
@Order(100)
public class MSOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver {

@Override
public Set<String> getSupportedExtensions() {
return Set.of("docx", "xlsx", "DOCX", "XLSX");
}

@Override
public Parser getParser() {
return new OOXMLParser();
}

@Override
protected ParseContext getParseContext() {
ParseContext parseContext = super.getParseContext();

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

return parseContext;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright 2024 Haulmont.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.jmix.search.index.fileparsing.resolvers;

import io.jmix.search.index.fileparsing.AbstractExtensionBasedFileParserResolver;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.springframework.core.annotation.Order;
import org.springframework.stereotype.Component;

import java.util.Set;

@Component("search_OpenOfficeDocumentsParserResolver")
@Order(100)
public class OpenOfficeDocumentsParserResolver extends AbstractExtensionBasedFileParserResolver {

@Override
public Set<String> getSupportedExtensions() {
return Set.of("odt", "ods", "ODT", "ODS");
}

@Override
public Parser getParser() {
return new OpenDocumentParser();
}
}
Loading
0