[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Hoover Inequality Score Data Retrival and Calculation
Closed, ResolvedPublic

Description

User Story:: As a PM/UX-Person I want to use an inequality score as part of assessing community health

Based on the input of @GoranSMilovanovic, @guergana.tzatchkova and me, I suggest and document the following:

Data retrieval

To calculate the inequality of account edit contributions, we need to count edits for accounts. This means we need a way to select the accounts and a way to count edits.

Account selection: All accounts that have been active in a certain timeframe. I suggest two levels of granularity: Month and Year.

Counting Edits: I suggest two ways to count the edits: Edits done within that timeframe and total edits ever done, measured at the end of this timeframe. This is equivalent to measuring income and wealth, respectively in economics.

This gives us 4 tables:

MonthYear
Edits-over-timeeditsedits
Total-at-timeeditsedits

These tables could either be just a list of edits ("long form") or be aggregated to a two column table that shows how often which edit count was in the set: We counted X accounts to have an edit count of Y ("aggregated form")

Calculate hoover score

The score could be calculated with existing R packages. If they are not fast enough, we might need to consider optimization. The formula for hoover is quite simple and vectorizable afaic , so I guess it might run just fine. Let's keep in mind, that we need to run the scoring monthly and yearly, so it is not a continuous load.

The resulting hover scores should be appended to tables of hoover scores. Again, these will be 4 tables:

MonthlyYearly
Edits-over-timehooverhoover
Total-at-timehooverhoover

and each table has two columns, one for the point in time and one for the according score at this moment (which way of measurement needs to be stated in the table’s name or in some metadata or in an extra "type-cell" for each row (which would be quite redundant)

Presenting the scores

In the simplest case, we only provide the data and can see how it changes over time by importing it to excel or the like. Even better if we have a tool that visualizes the data:

image.png (381×411 px, 43 KB)

(The wireframe has no timeframe selection, as the data will be small enough to just scroll back)

Event Timeline

@Jan_Dittrich Got it. I will get back to you if it turns that I need more info.

@Jan_Dittrich @guergana.tzatchkova @WMDE-leszek

Everything is ready to put this on regular updates. There are some constraints in relation to the cost of data engineering procedures. Let me share with you:

  • Essentially, our ETL procedure extracts several frequency distributions of the number of edits from the wmf.mediawiki_history table in the Data Lake:
    • the distribution of the number of edits for the current snapshot (monthly) of the table,
    • the distribution of the number of edits for the current year (and that would be 2020 in the test case since we still do not have the January 2021 snapshot of the table ready), and
    • the distribution of the number of edits since the beginning of time and until the current snapshot (note. including the current snapshot).

All extracted distributions are represented as two-column tables, with the first column representing the number of edits made, and the second representing how many editors made that number of edits.

We then compute the Hoover index from each table separately following some simple computations.
The Hoover index values, in the same order as the respective tables were introduced above, are: 0.7678569, 0.8724347, and 0.8672452.

Note. Bots are filtered out, and the search is constrained to page_namespace = 0 for the test run only.

@Jan_Dittrich Let's talk about the questions and constraints now, please:

  • first, as you can see, I can derive three tables, and not four as you have expected:

Counting Edits: I suggest two ways to count the edits: Edits done within that timeframe and total edits ever done, measured at the end of this timeframe. This is equivalent to measuring income and wealth, respectively in economics.

  • my approach was to use (1 )the current snapshot of the table (and we get to observe one snapshot monthly) as yours "Edits done within that timeframe", then
  • to use (2) all available data from the beginning of time up to the current table snapshot as yours "total edits ever done, measured at the end of this timeframe" ,
  • and (3) all available data from the currently completed year to adhere to " I suggest two levels of granularity: Month and Year", but I am not sure if I understand how would
  • the entries in the bottom row of your table be different (Total-at-time/Monthly and Total-at-time/Yearly).

The constraints:

  • the Apache Spark run to extract the data from the beginning of time is quite expensive, it uses a lot of cluster resources, and takes some time to complete, but it is doable on a monthly basis;
  • we would have to use our analytics cluster for hours to extract per-year historical data, not to mention how much cluster time would we need to extract
  • the data with a monthly resolution from the past.

I suggest that we start the data acquisition procedures as of this month, and then slowly accumulate the time-series during 2021.

Besides these technical difficulties, everything else that we need to serve the datasets and results can be solved in no time. The dashboard, if we want one, will be rather simple.

Thanks!

If I understand you correctly, you can/did calculate scores based on

Edits from within a month ("income", monthly)
Edits from within a year ("income", yearly)

for the "since the beginning of time" ("wealth") however, you wonder why there needs to be a month/year distinction. This is a good point – the time-frame would always be "begin-now", no matter if one measures it yearly, monthly, every second… You are totally right on this one, we do not need two.

I do not know how costly running "begin of time" is – I am pretty fine with running it every 3 month or so, too to save some computing time and CO2.

I suggest that we start the data acquisition procedures as of this month, and then slowly accumulate the time-series during 2021.

Sounds good to me!

Besides these technical difficulties, everything else that we need to serve the datasets and results can be solved in no time. The dashboard, if we want one, will be rather simple.

I am fine with datasets, but if it is easy to do, a simple dashboard might increase the usefulness for people who want to have a brief look and/or are not much into R/Excel.

@Jan_Dittrich

I do not know how costly running "begin of time" is – I am pretty fine with running it every 3 month or so, too to save some computing time and CO2.

The planet would survive a monthly update. Many of our system run similar, regular monthly updates, so this can be packed on the same train.

I am fine with datasets, but if it is easy to do, a simple dashboard might increase the usefulness for people who want to have a brief look and/or are not much into R/Excel.

The dashboard for this can be produced in a minimal time. It will be included to our Wikidata Analytics portal, under "Analytics", if you agree.

The dashboard for this can be produced in a minimal time. It will be included to our Wikidata Analytics portal, under "Analytics", if you agree.

Would make sense to me; I guess the final decision is with @Lydia_Pintscher

Can we see it somewhere before adding it to the portal? In general probably fine but quick review for understandability etc would be good.

@Lydia_Pintscher @Jan_Dittrich On the test server, of course - no WMF or WMDE domains are there.

Change 662093 had a related patch set uploaded (by GoranSMilovanovic; owner: GoranSMilovanovic):
[analytics/wmde/WD/WikidataAnalytics@master] T270109 init WD_Inequality

https://gerrit.wikimedia.org/r/662093

Change 662093 merged by GoranSMilovanovic:
[analytics/wmde/WD/WikidataAnalytics@master] T270109 init WD_Inequality

https://gerrit.wikimedia.org/r/662093

@Jan_Dittrich @Lydia_Pintscher

  • the computation of the Hoover inequality index will be run every time a new snapshot of wmf.mediawiki_history is detected,
  • in Pyspark for ETL, orhestrated by a Python script that checks the snapshot, runs the ETL, and computes the index;
  • the data will be served as a .csv file from the public directory,
  • and the future dashboard will be client-side dependent and use the public dataset to visualize the results.

I do not think that it makes sense to start working on the super-simple dashboard for this now, since we only have the results for the 2021-01 snapshot of the wmf.mediawiki_history table (which means that we have exactly three numbers to visualize). My suggestions is to wait for the next update (which is happening at some point in March 2021) and then serve the results on a dashboard.

Thanks!
Just to understand it correctly – what do the 3 values refer to, particularly, are they from beginning of time until some point ("wealth") or for a specific timeframe ("income")?

@Jan_Dittrich

hoover	                measurement	                     snapshot
0.775146187351836	current_year	                     2021-01
0.872708564923399	history_to_current_snapshot	     2021-01
0.774913797649451	current_snapshot	             2021-01

where the measurement column stands for what the Hoover value refers to: (1) current_year, (2) history_to_current_snapshot which is "from beginning of time until some point", and finally (3) current_snapshot which is really the current month. So, (1, 3) are Hoover indices for specific timeframes ("income", as you say), while (2) is from the beginning of time ("wealth").

So, the next update will enter three more rows into the file for the 2021-02 snapshot of the wmf.mediawiki_history table as soon as the snapshot becomes available in March 2021.

@guergana.tzatchkova @Jan_Dittrich @WMDE-leszek

  • The data in your public directory have just received an update;
  • I will now start building a simple dashboard to present the results.

One variable I did not ask for: current_year – is this from beginning of the calendar year to the current snapshot or from the snapshot 12 months ago until the current snapshot?

  • If it is possible to give "human" labels to the variables, and if I understand them right it might be useful to do the following
    • current_year → over last 12 month
    • current_snapshot → over last month
    • history_to_current_snapshot → all time
  • For the diagram it would be great to have a reset button somewhere in case one creates a mess by zooming in or panning away from the graph. Hope that is just a config option.

@Jan_Dittrich

One variable I did not ask for: current_year – is this from beginning of the calendar year to the current snapshot or from the snapshot 12 months ago until the current snapshot?

It is the current calendar year, e.g. 2021, 2022, etc. - NOT("...from the snapshot 12 months ago until the current snapshot..."). Also, the later is not doable really, because our Data Engineering does not keep all the monthly snapshots that we would need for "...from the snapshot 12 months ago until the current snapshot...", see:

show partitions wmf.mediawiki_history;

results in:

snapshot=2020-08
snapshot=2020-09
snapshot=2020-10
snapshot=2020-11
snapshot=2020-12
snapshot=2021-01
snapshot=2021-02

The wmf.mediawiki_history table - denormalized history of all MediaWiki edits - is really huge, which I guess is the reason why a constrained set of past snapshots is preserved only.

If it is possible to give "human" labels to the variables, and if I understand them right it might be useful to do the following

Of course, I am on it.

For the diagram it would be great to have a reset button somewhere in case one creates a mess by zooming in or panning away from the graph. Hope that is just a config option.

Plotly already provides for that: if a user zooms in the diagram, an overlay message should pop up in the top right corner of the dashboard saying that double-clicking the diagram resets the zoom. Could you please try it out - it works for me both in Firefox and Chrome? Thank you.

Plotly already provides for that: if a user zooms in the diagram, an overlay message should pop up …

Ahh, OK, works for me, too. Not great UX, I think, but if it is build-in that way lets keep it as it is.

It is the current calendar year

OK, that is equally fine, I just was unsure what it means

current_year → over last 12 month

It is the current calendar year

Then my label suggestion is staying with this year

@Jan_Dittrich

Fixed:

current_year → this year
current_snapshot → over last month
history_to_current_snapshot → all time

see http://datakolektiv.org:3838/WD_Inequality/

@Lydia_Pintscher @Jan_Dittrich @WMDE-leszek

Please help me clear my backlog a bit: this analytics dashboard, for example, needs a review before I productionize it.
Thank you.

Looks fine to me!

Am So., 14. März 2021 um 14:54 Uhr schrieb GoranSMilovanovic <
no-reply@phabricator.wikimedia.org>:

GoranSMilovanovic added a comment. View Task
https://phabricator.wikimedia.org/T270109

@Lydia_Pintscher https://phabricator.wikimedia.org/p/Lydia_Pintscher/
@Jan_Dittrich https://phabricator.wikimedia.org/p/Jan_Dittrich/
@WMDE-leszek https://phabricator.wikimedia.org/p/WMDE-leszek/

Please help me clear my backlog a bit: this analytics dashboard, for
example, needs a review before I productionize it.
Thank you.

*TASK DETAIL*
https://phabricator.wikimedia.org/T270109

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *GoranSMilovanovic
*Cc: *Lydia_Pintscher, WMDE-leszek, Jan_Dittrich, GoranSMilovanovic,
Aklapper, guergana.tzatchkova, maantietaja, Akuckartz, Nandana, Lahi, Gq86,
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, abian,
Wikidata-bugs, aude, Mbch331

\o/
Looking good. Let's close this.

@Lydia_Pintscher

Re-opened; the system is

(1) not in production state, and
(2) not included to Wikidata Analytics yet.

It should not take more than several hours to complete (1, 2) but I need the ticket until then.