TOPK: an efficient solution to find topk hot items from a big input source

SUMMARY

Solution

Divide and conquer: Because of 1GB memory limit, we can't load 100GB url into memory at the same time. So we can split the raw URL input into 1000 shards, for each shard, in expectation, its size will be 100MB, which can be loaded into memory. Then we can get topk for each shard. So the code is similar as following:

1. Split input into 1000 shards, make sure same urls are stored into the same shard
2. for each shard, use hash_map to calculate the count of every url and update current topk(priority_queue)
3. current topk is the answer

Optimization

I try many strategies to improve performance, of which two strategies work very well:

g++ compiler optimization level, g++ -O2 or g++ -O3. It seems that unordered_map runs much faster when -O2 or -O3 is enabled.
concurrent hash_map: cuckoohash_map replaces built-in unordered_map.

Currently, for 10GB URL data(100GB too slow to run on my local machine), 100MB as the memory limit, it takes about 500s to solve topk problem, of which:

450s read/write file. I think it is reasonable becuase 20GB read and 10GB write is necessary.
50s hash_map cost.

FUTURE WORK

Pipeline(Parallelism)

Step1: hash code calculation & disk read of raw input

1. disk read
2. hash code calculation
3. disk write

The strategy can be implemented in Concurrent Queue with producers and consumers.

Step2: hash_map operations & read 1000 shards(10GB read)

1. wait for the end of aio to read current shard into memory
2. issue asynchronous read of next shard
3. process current shard

Potentially 50s of hash_map cost can be reduced in 10GB task, leading to 10% improvement. I leave this potential strategy as future work.

Approximate solutions

There are many research papers that focus on optimize the performance of topk problems. Their ideas are based on that accurate topk is not necessary and approximate topk is acceptable for many industry situations.

Traditional solutions for finding top-k hot items can be divided into two categories: sketch-based and counter-based.

Sketch-based: Count-min sketch and the Cound sketch.
Counter-based: Space-saving, Frequent, Lossy counting

These approximate solutions can achieve great gain in performance. You can find more information in recent paper named SSS, and I leave approximate solution of topk as future work here.

HOW TO RUN

If no input is given, we need generate data ourselves:

cd /path/to/TOPK
make clean && make
./solve_topk 1 0 10

Solve_topk need parameters:

<gen>: whether to generate input. 0:no, 1:yes
<real_url>: whether url len:60-120. 0:no, 1:yes
<scale>: workload, 1:1GB, 10:10GB

Black-box bug test

To check fast-topk's correctness, workloads of different scales are generated and I implement a naive solution as the reference. The reference solution assumes memory resource is not limited and uses hash_map to record count of every url and uses min_heap to get topk.

We call naive solution as ref-topk and above topk with memory limited as fast-topk.

Then we run ref-topk and fast-top on different workloads and their results are compared to prove correctness.

Finally, fast-topk passes my black-box tests successfully.

CONSTRAINTS

Because no real input is given, I generate the 1GB, 10GB data to tune performance and different scales workloads to check its correctness.

V1: on 6th October

Evaluation

For 1GB data, it takes 120s, of which:

30s to split
10s to read shard files
80s maybe the software cost, most of which results from hash_map insert operation

V2: on 8th October

Effective optimization

For 1GB data

g++ -O2 or -O3: hash_map cost reduces to 20s!!!

current evaluation

For 1GB data, total 60s:

30s to split(read 1GB and write 1GB(1000 shards))
10s read 1000 shards
20s: hash_map operations

For 10GB data, total 720s:

340s to split
130s to read shards
250s: hash_map opeations

Failed strategies

fast writefile by buffered write into file: failed because maybe OS has done this
fast_readfile by call syscall:read directly rather than fscanf: failed because the performance is similar(read 1GB file: about 10s)

V3: on 9th October

Concurrent Hash-MAP: libcuckoo used to improve hash_table performance:

For 1GB URL, total 49s :

30s split
10s read 1000 shards
8s hash_map operations

For 10GB URL, total 505s:

330s to split
125s read shards
50s: hash_map operations, 5X SPEEDUP(8 threads used)

The output of my solution on 10GB workload is as following:

When concurrent hash_map is used:

when built-in unordered_map is used:

Workload

The program will cost 2 hours to run on 100GB workload with 1GB memory limited, So I try 1GB and 10GB workload with 10MB memory limited and 100MB limited respectively. The workdload is generated as follows:

Define MIN_URL_LEN and MAX_URL_LEN
RAND() to generate the count of rows according to average length of every url
RAND() to generate urls(made up of characters from 'a' to 'z') of each row

TASK1

The length of each url ranges from [4,8), and this workload is dense(An url can occurs many times).

TASK2

The length of each url ranges from [60, 100), and this workload is very sparse(Nearly all urls only occur only once).

Tools used

linux perf
gperftools

Environment

Hardware

8 cores
16GB memory
140MB/s HDD read bindwidth

Software

Ubuntu18.04 Desktop(Linux 4.15.0)
ext4

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bench		bench
figures		figures
Makefile		Makefile
README.md		README.md
clear_cache		clear_cache
common.h		common.h
gen_data.cpp		gen_data.cpp
mem-stat.cpp		mem-stat.cpp
self_test.cpp		self_test.cpp
solver.cpp		solver.cpp
topk.cpp		topk.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TOPK: an efficient solution to find topk hot items from a big input source

SUMMARY

Solution

Optimization

FUTURE WORK

Pipeline(Parallelism)

Step1: hash code calculation & disk read of raw input

Step2: hash_map operations & read 1000 shards(10GB read)

Approximate solutions

HOW TO RUN

Black-box bug test

CONSTRAINTS

V1: on 6th October

Evaluation

V2: on 8th October

Effective optimization

current evaluation

Failed strategies

V3: on 9th October

Workload

TASK1

TASK2

Tools used

Environment

Hardware

Software

About

Uh oh!

Releases 1

Packages

Languages

zhitaoli-6/TOPK

Folders and files

Latest commit

History

Repository files navigation

TOPK: an efficient solution to find topk hot items from a big input source

SUMMARY

Solution

Optimization

FUTURE WORK

Pipeline(Parallelism)

Step1: hash code calculation & disk read of raw input

Step2: hash_map operations & read 1000 shards(10GB read)

Approximate solutions

HOW TO RUN

Black-box bug test

CONSTRAINTS

V1: on 6th October

Evaluation

V2: on 8th October

Effective optimization

current evaluation

Failed strategies

V3: on 9th October

Workload

TASK1

TASK2

Tools used

Environment

Hardware

Software

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages