ATD-MCL-Overseas: Overseas Travelogues from Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation
Requirements: Python >= 3.8.0
- Obtain the Arukikata Travelogue Dataset (ATD) original data (
data.zip
) from the NII IDR site https://www.nii.ac.jp/dsc/idr/arukikata/. - Decompress
data.zip
and then movedata
directory to underatd
directory (or create a symbolic link todata
directory inatd
directory). - Excute
bin/gen_full_data_json.sh
.- The restored data will be placed at
atd-mcl-overseas/full/main/json_per_doc/
. - The data used for calculating inter-annotator aggreement scores will be placed at
atd-mcl-overseas/full/agreement/
.
- The restored data will be placed at
- Excute
bin/gen_full_data_tsv.sh
.- The restored data will be placed at
atd-mcl-overseas/full/main/link_tsv_per_doc
andatd-mcl-overseas/full/main/mention_tsv_per_doc
.
- The restored data will be placed at
Attribute | Number |
---|---|
Document | 78 |
Section | 1,309 |
Sentence | 4,318 |
Chars | 112,591 |
Mention | 5,116 |
Entity | 2,263 |
Entity w/ OSM link | 1,361 |
This can be obtained by excuting the following command.
python src/show_data_statistics.py -i atd-mcl-overseas/full/main/json_per_doc/
.
The JSON data (atd-mcl-overseas/full/main/json_per_doc
) holds full annotation information as follows.
- A document object value is assosiated with a key that represents the document ID (e.g.,
00019
). Each document object has the sets ofsections
,sentences
,mentions
, andentities
.{ "00711": { "sections": { "001": { ... }, }, "sentences": { "001-01": { ... }, }, "mentions": { "M001": { ... }, }, "entities": { "E001": { ... } } } }
- A section
8000
object under
sections
is as follows:"sections": { "001": { "sentence_ids": [ "001-01", "001-02", "001-03", "001-04", "001-05" ] }, ...
- A sentence object under
sentences
is as follows:- A sentence object may have one or more geographic entity mentions.
- Some sentences with an ID that has a branch number (e.g., "026-01" and "026-02") indicate that a line of text in the original ATD data was split into those multiple sentences.
"sentences": { "001-01": { "section_id": "001", "span_in_orig_text": [ 0, 33 ], "text": "パラオではオプショナルツアーに参加しないとほとんど観光できません。", "mention_ids": [ "M001" ] }, ... "006-06": { "section_id": "006", "span_in_orig_text": [ 168, 173 ], "text": "オススメ!", "mention_ids": [] } },
- A mention object under
mentions
is as follows:- A mention object may be associated with an entity.
"mentions": { "M001": { "sentence_id": "001-01", "span": [ 0, 3 ], "text": "パラオ", "entity_type": "LOC_NAME", "entity_id": "E001" },
- An entity object, which corresponds to a coreference cluster of one or more mentions, under
entities
is as follows:- An entity object is associated with one or more mentions.
has_name
indicates whether at least one member mention's entity type is*_NAME
or not.
"entities": { "E001": { "original_entity_id": "E001", "normalized_name": "Republic of Palau;Palau", "entity_type_merged": "LOC", "has_name": true, "has_reference": true, "best_ref_type": "OSM", "best_ref_url": "https://www.openstreetmap.org/relation/571805", "best_ref_query": "Palau", "best_ref_area_type": "FOREIGN", "member_mention_ids": [ "M001", "M012", "M018" ] },
The mention TSV data (atd-mcl-overseas/full/main/mention_tsv_per_doc
) holds mention-related annotation information as follows.
- 1st column: document_id
- 2nd column: section_id:sentence_id
- 3rd column: Sentence
text
- 4th column: Mention information with the following elements. Multiple mentions are enumerated with ";".
- 1st element: mention_id
- 2nd element:
span
- 3rd element:
entity_type
- 4th element: mention
text
- 5th element:
entity_id
- 6th element:
generic
- 7th element:
ref_spec_amb
- 8th element:
ref_hie_amb
Example:
00711 002:002-01 日本で化粧品が発売されて有名になった、ミルキーウェイです。 M006,0:2,LOC_NAME,日本,E004,,,;M007,19:26,LOC_NAME,ミルキーウェイ,E005,,,
The link TSV data (atd-mcl-overseas/full/main/link_tsv_per_doc
) holds link-related annotation information.
Specifically, entities and their member mentions (except for GENERIC and SPEC_AMB entities/mentions) are listed in TSV rows.
The column with a non-empty entity_id
value corresponds to an entity, and the column with a non-empty mention_id
value corresponds to a member mention of the preceding entity column.
Example:
#document_id | entity_id | mention_id | best_ref_type | best_ref_url | best_ref_query | best_ref_status | best_ref_area_type | second_A_ref_type | second_A_ref_url | second_A_ref_query | second_A_ref_status | second_A_ref_area_type | second_B_ref_type | second_B_ref_url | second_B_ref_query | second_B_ref_status | second_B_ref_area_type | entity_type | span | normalized_name | mention_text | ref_hie_amb | sentence_id | sentence_text |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
00711 | E001 | - | OSM | https://www.openstreetmap.org/relation/571805 | Palau | overseas | - | LOC | - | Republic of Palau;Palau | - | - | - | - | ||||||||||
00711 | - | E001:M001 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | LOC_NAME | 0:3 | - | パラオ | 001:001-01 | パラオではオプショナルツアーに参加しないとほとんど観光できません。 |
Notes:
mention_id
column values acutally represent "entity_id:mention_id".sentence_id
column values acutally represent "section_id:sentence_id".
See docs/data_specification
.
- Shohei Higashiyama <shohei.higashiyama [at] nict.go.jp>
This study was partly supported by JSPS KAKENHI Grant Number JP22H03648.
The annotation data was constructed by IR-Advanced Linguistic Technologies Inc.
This data is the extension of atd-mcl-overseas-alpha, with added entity link annotations.
TBA