.lt corpus is an amateurish Lithuanian language text corpus generated from Common Crawl data. The corpus was made from web pages originating from .lt
top-level domain that were collected as part of the September 2015 crawl.
Every page from .lt
TLD was fetched, extracted, tested whether it had any Lithuanian text, cleaned up, tokenized into sentences and words, and lastly put into this corpus.
.lt corpus consists of:
To generate this corpus, I’ve taken the following steps:
.lt
TLDMediaWords::Language::lt
Every line in the file is a JSON document containing a single extracted + tokenized web page, e.g.:
{
"text": "Įlinkusi fechtuotojo špaga. Blykčiodama gręžė apvalų arbūzą.",
"sentences": [
{
"sentence": "Įlinkusi fechtuotojo špaga.",
"words": [
"įlinkusi",
"fechtuotojo",
"špaga"
]
},
{
"sentence": "Blykčiodama gręžė apvalų arbūzą.",
"words": [
"blykčiodama",
"gręžė",
"apvalų",
"arbūzą"
]
}
],
"url": "http://www.url.lt/the/document/was/fetched/from.html"
}
To uncompress the archives, use XZ Utils or 7-Zip.
commoncrawl-corpus-lt.xz; ~124.6 MB, expands to ~1 GB and a single file.
commoncrawl-corpus-lt-html.tar.xz; ~721 MB, expands to ~6 GB and 92,740 files.
I didn’t create any of the content in the corpus (except for a couple of my own websites). That’s all I know.
Feb 17, 2016.