8000 Release dataset-39996091 · wilsonzlin/hackerverse · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dataset-39996091

Latest
Compare
Choose a tag to compare
@wilsonzlin wilsonzlin released this 08 May 10:48
· 15 commits to master since this release

Dataset up to item ID 39996091 (inclusive)

All *.arrow files are Apache Arrow files. There are no null/NaN columns; where missing, a default value (empty string, 1970, 0, false, etc.) is used.

File Description Columns
comment_sentiments.arrow Predicted sentiment probabilities (0-1 inclusive, sum to 1) of ~30M HN comments. id u32, positive f32, neutral f32, negative f32
comment_texts.arrow Text of ~30M HN comments, HTML, unchanged from HN API. id u32, text str
comments.arrow Metadata for ~30M HN comments. Authors link to an ID of a users row. id u32, deleted bool, dead bool, score i16, parent u32, author u32, ts timestamp, post u32
interactions.arrow Posts that a user has commented on at least once. user u32, post u32
post_texts.arrow Text of ~4M posts, HTML, unchanged from HN API. id u32, text str
post_titles.arrow Titles of ~4M posts, HTML, unchanged from HN API. id u32, text str
posts.arrow Metadata for ~4M HN posts. emb_missing_page means the page linked to could not be crawled. URLs link to an ID of a row in the url* tables. Authors link to an ID of a users row. id u32, deleted bool, dead bool, score i16, author u32, ts timestamp, url u32, emb_missing_page bool
url_metas.arrow Extracted metadata about crawled pages. id u32, description str, image_url str, lang str, snippet str, timestamp timestamp, timestamp_modified timestamp, title str
url_texts.arrow Extracted text from crawled pages. id u32, text str
urls.arrow Crawler tasks. URL does not contain the protocol. id url, url str, proto str, fetched timestamp, fetch_err str, fetched_via str, found_in_archive bool
users.arrow Users. IDs are arbitrary and not official. id u32, username str
comment-embs-data.mat Embeddings for ~40M comments, generated using jina-v2-small. Packed f32 LE matrix of shape (N, 512).
comment-embs-ids.mat Corresponding ID for each row in comment-embs-data.mat. Packed u32 LE matrix of shape (N,).
post-embs-data.mat Embeddings for ~4M comments, generated using jina-v2-small. Packed f32 LE matrix of shape (N, 512).
post-embs-ids.mat Corresponding ID for each row in post-embs-data.mat. Packed u32 LE matrix of shape (N,).
toppost-embs-data.mat Embeddings for ~650K posts with score >= 10, generated using bge-m3. Packed f32 LE matrix of shape (N, 1024).
toppost-embs-ids.mat Corresponding ID for each row in toppost-embs-data.mat. Packed u32 LE matrix of shape (N,).
0