8000 GitHub - truerss/content-extractor: Java library. Detect top-level selector on the HTML page.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

truerss/content-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content-Extractor

Java library.

Returns the selector with the largest amount of content.

add

// sbt: 

"io.github.truerss" % "content-extractor" % "1.1.0"

// maven: 

<dependency>
  <groupId>io.github.truerss</groupId>
  <artifactId>content-extractor</artifactId>
  <version>1.1.0</version>
</dependency>

// gradle

implementation 'io.github.truerss:content-extractor:1.1.0'

jsoup should be present in classpath.

Example:

import com.github.truerss.ContentExtractor;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Document;

String url = "http://example.com/post.html";
Document doc = Jsoup.connect(url).get();
Element body = doc.body();
ExtractResult result = ContentExtractor.extract(body);
System.out.println("==========> " + result.selector);

License: MIT

About

Java library. Detect top-level selector on the HTML page.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  
0