SAMPLES: wordID + lexicon
You can purchase n-grams sets that contain all 1, 2, 3, 4,
and 5-grams that
occur at least four times in the one billion word Corpus of
Contemporary American English . The samples files that are
available on this page include the first 50,000 entries for words beginning with the
letter [m].
When you purchase the data, you can either use the
"word" or the "wordID + lexicon" format. The [
wordID + lexicon ] is more complicated, but it is also more powerful. Each word
is represented as an integer value, and the value of these integer values is
found in the "lexicon" file (where it indicates the case sensitive
word form, the lemma, and the part of speech).
Joining the n-grams and lexicon data. In the
example below, these are the first few entries (for word1 beginning with "m") of
the 3-grams file. The leftmost column in all of the n-grams tables is the
frequency of the n-grams. The other columns are the integer values for the words (two columns for
2-grams, three for 3-grams, etc).
freq |
wordID1 |
wordID2 |
wordID3 |
59319 | 276 | 5 |
3 |
47138 | 68 | 9222131 |
11 |
37069 | 446 | 5 |
3 |
35531 | 133 | 5 |
3 |
32657 | 68 | 2 |
7 |
Each number corresponds to an entry in the [lexicon] table. For example, the
three entries [276], [5], and [3] in the lexicon table are:
wordID | word (case sensitive) |
lemma | part of speech (info) |
3 | the | the | at |
5 | of | of | io |
276 | most | most | dat |
This means that the first entry in the 3-grams table above is for the string
[ most of the ], which occurs 59,319 times in the corpus.
Note that you would be responsible for creating the SQL statements to group by lemma, word, PoS,
etc and to limit and sort the data. We assume a good knowledge of SQL, as well
as the ability to create the databases and tables from the CSV files. |
|