User talk:IagoQnsi

Jump to navigation Jump to search

About this board

Previous discussion was archived at User talk:IagoQnsi/Archive 1 on 2020-08-18.

EncycloPetey (talkcontribs)

I'm curious about the reason for this edit. Shouldn't the reference refer to the Data Item for the reference, with full publication details, rather than link to a specific scan of that reference?

2600:1009:B10E:3B45:0:17:91E4:7401 (talkcontribs)

Not every document has its own Wikidata item. For example, I used this to reference one of the files in commons:Category:RIT News and Events, which are just weekly newsletters that the university put out internally. They are not significant enough to create a new item for every single issue. I at least put a "published in" statement in the refer to refer to RIT News and Events (Q110929999).

IagoQnsi (talkcontribs)

Ah dang it, forgot to login (I'm on mobile). The above message was from me.

EncycloPetey (talkcontribs)

I see what you mean, but I disagree about "not significant enough" to have a data item for each issue. If they were in a library, each would get full bibliographic information in the library database. --EncycloPetey (talk) 19:22, 11 December 2022 (UTC)

IagoQnsi (talkcontribs)

I just don't think there's much need to create items for some of these things, as it's unlikely they would be used much. It would also be a burden to have to create a new item every time one wants to cite a document. To me, this is no different from using the 'reference URL' property inside a reference; it's just that the URL happens to point to Commons instead of an external site.

EncycloPetey (talkcontribs)

Well, the burden in citing either falls to creating a data item, or else to each later of use that citation, meaning the person has to hunt out each bit of information from the scan for themselves. So, there will either be a burden once on a Wikidata editor, or multiple times: once for each person who wants to use that reference. --EncycloPetey (talk) 22:12, 11 December 2022 (UTC)

IagoQnsi (talkcontribs)

If a document is likely to be cited many times, it definitely makes sense to create an item for it so it's reusable. But some documents are realistically only going to be cited once or twice ever, so it seems like it's probably easier to just put all the metadata directly in a reference instead of creating a new item.

Reply to "as reference"
Migrant (talkcontribs)

Hello IagoQnsi. I see that you suggested for a Wikidata property for this huge Wikidata:Property proposal/The Trading Card Database person ID, What do you think of https://www.laststicker.com/ database of the same type of cards, But are there any Id's there to sort them out ? I think it would be good to crosscheck these two databases towards eachother since it would be more complete in total. the reason for this is that I last weekend met the famous Supersub of the late 1970ies in english football (Q1174396 David Fairclough) and looked him up in both databases.

Trading Card Database : https://www.tcdb.com/Person.cfm/pid/34358/col/1/yea/0/David-Fairclough

LastSticker: https://www.laststicker.com/search/?q=David+Fairclough

What do you think ? Best regards Migrant.

Reply to "Trading Card databases"

Call for participation in a task-based online experiment

1
Kholoudsaa (talkcontribs)

Dear IagoQnsi,

I hope you are doing good,

I am Kholoud, a researcher at King's College London, and I work on a project as part of my PhD research, in which I have developed a personalised recommender system that suggests Wikidata items for the editors based on their past edits. I am collaborating on this project with Elena Simperl and Miaojing Shi.

I am inviting you to a task-based study that will ask you to provide your judgments about the relevance of the items suggested by our system based on your previous edits.

Participation is completely voluntary, and your cooperation will enable us to evaluate the accuracy of the recommender system in suggesting relevant items to you. We will analyse the results anonymised, and they will be published to a research venue.

The study will start in late February 2022, and it should take no more than 30 minutes.

If you agree to participate in this study, please either contact me at [] or use this form https://docs.google.com/forms/d/e/1FAIpQLSees9WzFXR0Vl3mHLkZCaByeFHRrBy51kBca53euq9nt3XWog/viewform?usp=sf_link

I will contact you with the link to start the study.

For more information about the study, please read this post: https://www.wikidata.org/wiki/User:Kholoudsaa

In case you have further questions or require more information, don't hesitate to contact me through my mentioned email.

Thank you for considering taking part in this research.

Regards

Reply to "Call for participation in a task-based online experiment"
OBender12 (talkcontribs)

Your change made every single player ID fail the constraint... I reverted, but it probably needs a new format string altogether, as "julio-cesar-x8022" is now an allowed value.

IagoQnsi (talkcontribs)

Ah my bad, I did a + when I should have done a *. I was trying to make IDs like the one you mentioned work. I'll go ahead and make the change with a *. Thanks for fixing my oops.

Reply to "MLS player ID format"
Jura1 (talkcontribs)

Hi IagoQnsi

Thanks for creating that catalog in MxM. It helped interwiki prefix at Wikimedia (P6720) finally get closer to completion.

BTW, I removed the language prefixes, as they can refer to Wikipedia, but also specific Wikisource or Wiktionary languages.

Reply to "MxM 3788 for P6720"
Amadalvarez (talkcontribs)

Hi. I recently voted against the creation of some properties related to sports statistics. I proposed a more generic solution that would allow us to have the information without having to create one property for each situation.

You have participated in the initial vote and I want to share with you the details of my counter-proposal before I publish it.

It would be interesting to have your expert opinion / suggestions on whether to incorporate any changes before proposing it as a substitute property. The work page is: User:Amadalvarez/sports statistics property.

Please leave comments on its talk page. Thanks

cc:@ArthurPSmith

Reply to "Sports statistic properties"
Peteforsyth (talkcontribs)

Hi, I'm delighted to see you have added so many items for local newspapers. I've found several so far that are redundant of existing items (The Bend Bulletin, the Hood River Glacier, and the Klamath Falls Herald). Do you know of a good way for search for and resolve duplicates, or should I just continue searching for them manually?

Also, you may be interested in a campaign this connects to: News On Wiki. We are trying to create a few hundred new English Wikipedia articles about small newspapers by the end of February. -~~~~

IagoQnsi (talkcontribs)

Hi Pete. Apologies for those duplicates; it just wasn't feasible for me to go through and dedupe all the ones that the automatching missed, as there were some 19,000 newspapers iirc. I don't know of a good way to find and merge duplicates. I'd be happy to give you my OpenRefine project files if you think you could do something with those. I suspect the duplicates aren't too numerous, as many of the papers I imported have been defunct for decades, and many of the ones that still exist did not already have items anyway. I figured editors would gradually stumble upon the duplicates over time and whittle them away.

Peteforsyth (talkcontribs)

Hi, I've delved into this in some more depth now. I resolved most of the duplicates in the state of Washington, I've dabbled in Oregon and Wisconsin, and I think I have a pretty good sense of where things stand. It seems to me that probably well over half of the ~19k items you imported were duplicates. There are two things going on; the first is pretty straightforward, the second has more nuance and room for interpretation.

First, there appear to have been three major imports of newspaper items: Sk!dbot in 2013, 99of9 in 2018, and items for which a Wikipedia article existed (rolling). A whole lot of your items were duplicates of existing items, that had come about by these processes. (An example is the Capital Times. But I've merged yours, so you'll have to look closely.) I know that the 2018 import was based on the USNPL database (and a handful of other web databases). I have no idea what Sk!dbot used as a source.

Second, there are items that some might consider duplicates, while others wouldn't. Consider a case like this:

  • The (a) Weekly Guard changed its name to the (b) Guard.
  • The Guard and the (c) Herald (which had both daily and weekly editions at various times, and went through three different owners) merged.
  • The (d) Herald-Guard has continued.

Many newspaper databases (Chronicling America, Newspapers.com, etc.) consider a, b, c, and d four or more distinct items, and may or may not do a great job of expressing the relationships among the items. In WikiProject Periodicals, we discussed it in 2018, and concluded that we should generally consider a, b, c, and d one item, and attach all four names to it (as alternate labels). See the items merged into the Peninsula Daily News for an example of how your items relate to this principle.

Peteforsyth (talkcontribs)

Unfortunately, all this adds up to a setback to the News On Wiki campaign, which has relied on reasonably tidy and stable Wikidata data (and the in-progress PaceTrack software) to track our progress (in addition to having improvement to relevant Wikidata content as a core component of our mission). There are two impacts:

  • Prior to your import, this query returned about 6,000 results. It was a little cumbersome, but it was possible to scroll and zoom around. Now, it returns about 25,000 results, and it's sluggish.
  • Prior to your import, the green items (indicating that there was a Wikipedia article) were pretty prominent. But now, the map looks mostly red, making it less useful as a visual indicator of how thoroughly English Wikipedia covers U.S. newspapers.

The second problem results in part from some stuff that predates your import, and maybe I can figure out a way to address it. If a city has 4 papers and one of them has a Wikipedia article, it would be better to have a green dot than a red dot (indicating that at least one newspaper in that city has a Wikipedia article). But unfortunately it goes the other way. I bet I can adjust the code to make that change, or maybe even find a more graceful way of handling it than that.

Anyway, just wanted to give you an overview of what I'd learned. I don't know whether you discussed this import at WikiProject Periodicals (or a similar venue) prior to performing it, but if not, I'd urge you to do that in the future, to at least have a chance of detecting these kinds of issues ahead of time. I know it's a learning process for all of us, so please don't take that as anything but a suggestion on how to improve future imports.

If you do have thoughts on how to address any of the issues I brought up, I'd be very interested to hear.

IagoQnsi (talkcontribs)

Hi Pete, thanks for the detailed message and the time you've put into this. My apologies for the high duplicate rate -- I had expected it to be much lower. I think the core issue is really that it's just hard to de-dupe newspapers due to how they're named; many newspapers have similar or identical names, and newspapers are often known by several name variants. My goal wasn't really to import new newspapers so much as to import newspapers.com links -- it just worked out that I wasn't able to automatically match that many of the links to existing items.

I don't know that there's an easy solution to this situation. Perhaps we could have better tooling for identifying likely duplicates, but I think this is fundamentally a problem that requires lots of manual cleanup.

IagoQnsi (talkcontribs)

I also do wonder if the rate of duplicates you've found stands up across the entire dataset, as Newspapers.com's collection isn't evenly distributed across the country. They seem to have a particular emphasis in the middle of the country -- in Kansas, Nebraska, and Oklahoma. When I was working on the import, I found that a lot of these newspapers were very obscure; perhaps they existed for two years in a town that existed for ten years but has now been abandoned for a century. I actually had to create a surprising number of new items for the towns these newspapers existed in, as they had not yet made their way into Wikidata. This is why I went forward with the import despite the volume of new items -- it seemed likely to me that a majority of them were indeed completely new.

IagoQnsi (talkcontribs)

By the way, how are you finding all these duplicates? Do you just use that map? I'd be happy to help out in the de-duping process.

Matthias Winkelmann (talkcontribs)

I found a cool 400 duplicates with this most straightforward query: Identical places of publication, Identical labels, and no dates of inception/dissolution that would differentiate them:

SELECT distinct ?item ?other ?itemIncept ?otherIncept ?itemPubLabel ?otherPubLabel ?np ?label ?otherLabel WHERE {

 ?item wdt:P7259 ?np.
 ?item rdfs:label ?label. 
 FILTER(LANG(?label) = 'en').
 ?other rdfs:label ?label.
 ?other (wdt:P31/wdt:P279*) wd:Q11032 .
 FILTER(?other != ?item).
 OPTIONAL { ?item wdt:P291 ?itemPub.
          ?other wdt:P291 ?otherPub. }
 OPTIONAL { ?item wdt:P571 ?itemIncept. 
           ?other wdt:P571 ?otherIncept. }
 FILTER(?itemPub = ?otherPub)
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

}

Will probably not show much after I've merged these. But you can find thousands more with a it of creative, such as dropping "The" from labels etc. Example Chadron Record (Q55667983) The Chadron Recorder (Q100288438)

Checking for duplicates at that level is what I would consider the bare minimum level of care before creating thousands of new items.

Duplicate items are far worse than missing data, because they create an illusion of knowledge/truth, i. e. the consumer of such data will wrangle with "unknown unknowns" instead of "known unknowns". Witht that in mind, it's simply unacceptable to create tens of thousands of items when the work is shoddy enough to warrant a disclaimer in the description: "Created a new Item: adding Sports-Reference college basketball players (likely contains some duplicates)" (see here for 500 out of x usages of that: ).

About half of the duplicates I cleaned up were instances where you created both the original and the duplicate, meaning you didn't even deduplicate within your own data. Simply sorting alphabetically would have made it easy to sort this out... Is something I would usually say. But many of these cases have (had) consecutive IDs, meaning they were sorted alphabetically. You just didn't care enough to quickly scroll through the data?

IagoQnsi (talkcontribs)

You're right, I should have caught those. I assumed that Newspapers.com would have done such basic de-duping, as their website presents each newspaper as being a distinct and complete entity. Clearly I was mistaken in this assumption. Mea culpa.

The batch of basketball players I tagged as "likely contains some duplicates" was the set of players in which OpenRefine had found a potential match, but with a very low level of confidence. I manually checked a number of these and found that the matches were completely wrong most of the time, but occasionally there was a correct match. To me the rate of duplicates seemed fairly low and so I figured it was worth having rather than leaving a gap in the SRCBB data.

Although I agree that I could and should have done more to clean up the Newspapers.com batch, I disagree that no data is better than data with duplicates. Duplicates are easily merged, and de-duping is a job that is excellently handled by the crowdsourcing of a wiki. It's very difficult for 1 person to solve 5000 potential duplicates, but it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them.

Matthias Winkelmann (talkcontribs)

I just noticed you also added 16k+ aliases that are identical to the labels, which is a complete waste of resources, among other things. As to who should clean that (and everything else) up, I disagree with the idea that "it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them", except in the sense that it is easier for you. I'll also cite this:

"Users doing the edits are responsible for fixing or undoing their changes if issues are found."

(from Help:QuickStatements#Best practices))

IagoQnsi (talkcontribs)

Certainly I agree that problems should be dealt with by the original editor. I should have deduped more and I take responsibility for that. I was talking about deduping of the type that can't be done automatically -- things that require manual inspection to determine that they are the same item. I had thought when I uploaded the dataset that the duplicates would be primarily of this type. I'm sorry that my data upload was lower quality than I had thought and intended it to be.

IagoQnsi (talkcontribs)

@Matthias Winkelmann Here's an example of the kinds of complex cases that I didn't want to systematically apply merges to. The town of Cain City, Kansas has three similarly named newspapers on Newspapers.com: Cain City News, The Cain City News, and Cain-City News. I initially merged all three of these into one item, but upon further investigation, I discovered that only the 2nd and 3rd were duplicates; the 1st entry was a different newspaper. Cain City News (Q100252116) had its Vol. 1 No. 1 issue in 1889, while The Cain City News (Q100252118) had its Vol. 1 No. 1 in 1882 (and its archives end in 1886). In merging newspapers just on the basis of having the same title and the same location, you bulldoze over these sorts of cases. This is why I was so hesitant to merge seeming duplicates -- they often aren't in fact duplicates.

Peteforsyth (talkcontribs)

Oh my, I thought I had replied here long ago, but must have failed to save.

First, I want to say, while it's true this issue has been pretty frustrating to our campaign and ability to access information about our subject matter and our progress, I understand that it came about through good faith efforts. I do think there are some important lessons for next time (and I think it's worth some further discussion, here and maybe at a more general venue, to figure out exactly what those lessons are -- as they may not be 100% clear to anybody yet.)

Specifically in response to the comment immediately above, about the Cain City News: Personally, I strongly disagree with the conclusion; I understand that such merges would be sub-optimal, but in the grand scheme of things, if the choice is:

  • Create tens of thousands of items, of which maybe 40% are duplicates, or
  • De-dupe, potentially merging some hundreds or even thousands of items that are not actually the same, and then create far fewer items

I think the second option is VASTLY superior. These are items that did not previously exist in Wikidata; to the extent they are considered important, they will be fixed up by human editors and/or automated processes over time.

Furthermore, with many smart minds on the problem, it might be possible to substantially reduce the number of false-positives in the de-duping process, so that most items like the Cain City News get caught and fixed ahead of time. (Maybe.)

Which is all to say, it's my understanding that best practice in these cases is to make meaningful consultation with other Wikidata editors prior to importing thousands of items, and to allow some time for the discussion to unfold and ideas to emerge. I think this particular instance really underscores that need. Maybe others would agree with your assessment about items like Cain City, or maybe they would agree with me; but without asking the question, how do we know? We don't. It's important in a collaborative environment to assess consensus, prior to taking large-scale actions that are difficult to undo.

Anyway, I think it would be a good idea to bring some of the points in this discussion up at WikiProject Periodicals or similar, and get some other perspectives that could inform future imports.

Reply to "Newspaper items"
Matthias Winkelmann (talkcontribs)

While resolving the duplicate that is The Dallas Morning News (Q889935) vs The Dallas Morning News (Q100292555) (an issue others have already brought up, as I see), I noticed your use of archives at (P485) on all these newspapers.

That is, at the very least, redundant with newspaper archive URL (P7213). "Newspaper" is an actual word of the English language, and not just the parasitic website that is newspaper.com. That should be obvious from the property translations: even without knowing any of those languages, you will notice that none refer to "newspaper", which they would if it were a proper noun. Capitalisation, the absence of ".com", and the parallel existance of Newspapers.com paper ID (P7259) should be further clues, as well as the discussion page and original proposal.

Further, archives at (P485) does not refer to an archive of the paper, but rather "the paper's archives". That is: it would, consistent with its use for presidents or really anything else, refer to a collection of all the institution's artefacts.

IagoQnsi (talkcontribs)

Thanks for the explanation, and for cleaning up those bad statements. I'll be more conservative about adding properties like those in the future.

IagoQnsi (talkcontribs)

I've just run a batch of university sports clubs where I accidentally added some items' entity ID as an alias. This is obviously unhelpful, so I'm going to undo it. I'm creating this talk page note to serve as a to-do item, and so that anyone else who may discover the issue knows that I'm working on reverting it.