Releases: wilsonzlin/hackerverse
Releases · wilsonzlin/hackerverse
dataset-39996091
Dataset up to item ID 39996091 (inclusive)
All *.arrow
files are Apache Arrow files. There are no null/NaN columns; where missing, a default value (empty string, 1970, 0, false, etc.) is used.
File | Description | Columns |
---|---|---|
comment_sentiments.arrow | Predicted sentiment probabilities (0-1 inclusive, sum to 1) of ~30M HN comments. | id u32, positive f32, neutral f32, negative f32 |
comment_texts.arrow | Text of ~30M HN comments, HTML, unchanged from HN API. | id u32, text str |
comments.arrow | Metadata for ~30M HN comments. Authors link to an ID of a users row. |
id u32, deleted bool, dead bool, score i16, parent u32, author u32, ts timestamp, post u32 |
interactions.arrow | Posts that a user has commented on at least once. | user u32, post u32 |
post_texts.arrow | Text of ~4M posts, HTML, unchanged from HN API. | id u32, text str |
post_titles.arrow | Titles of ~4M posts, HTML, unchanged from HN API. | id u32, text str |
posts.arrow | Metadata for ~4M HN posts. emb_missing_page means the page linked to could not be crawled. URLs link to an ID of a row in the url* tables. Authors link to an ID of a users row. |
id u32, deleted bool, dead bool, score i16, author u32, ts timestamp, url u32, emb_missing_page bool |
url_metas.arrow | Extracted metadata about crawled pages. | id u32, description str, image_url str, lang str, snippet str, timestamp timestamp, timestamp_modified timestamp, title str |
url_texts.arrow | Extracted text from crawled pages. | id u32, text str |
urls.arrow | Crawler tasks. URL does not contain the protocol. | id url, url str, proto str, fetched timestamp, fetch_err str, fetched_via str, found_in_archive bool |
users.arrow | Users. IDs are arbitrary and not official. | id u32, username str |
comment-embs-data.mat | Embeddings for ~40M comments, generated using jina-v2-small. | Packed f32 LE matrix of shape (N, 512). |
comment-embs-ids.mat | Corresponding ID for each row in comment-embs-data.mat . |
Packed u32 LE matrix of shape (N,). |
post-embs-data.mat | Embeddings for ~4M comments, generated using jina-v2-small. | Packed f32 LE matrix of shape (N, 512). |
post-embs-ids.mat | Corresponding ID for each row in post-embs-data.mat . |
Packed u32 LE matrix of shape (N,). |
toppost-embs-data.mat | Embeddings for ~650K posts with score >= 10, generated using bge-m3. | Packed f32 LE matrix of shape (N, 1024). |
toppost-embs-ids.mat | Corresponding ID for each row in toppost-embs-data.mat . |
Packed u32 LE matrix of shape (N,). |