.lt corpus

.lt corpus is an amateurish Lithuanian language text corpus generated from Common Crawl data. The corpus was made from web pages originating from .lt top-level domain that were collected as part of the September 2015 crawl.

Every page from .lt TLD was fetched, extracted, tested whether it had any Lithuanian text, cleaned up, tokenized into sentences and words, and lastly put into this corpus.

Basic stats

.lt corpus consists of:

72,066 documents (web pages)
3,026,827 sentences
37,443,433 words
272,440,746 characters

Procedure

To generate this corpus, I’ve taken the following steps:

Created a list of crawled pages coming from .lt TLD
Fetched said pages from Common Crawl’s archive
Removed WARC and HTTP headers
Filtered out empty documents and non-textual data
Fixed common HTML typos
Extracted majority of pages with Readability
Filtered out non-Lithuanian text using langdetect
Stripped extra whitespace from the resulting text
Tokenized text into sentences and words using MediaWords::Language::lt

Format

Every line in the file is a JSON document containing a single extracted + tokenized web page, e.g.:

{
    "text": "Įlinkusi fechtuotojo špaga. Blykčiodama gręžė apvalų arbūzą.",
    "sentences": [
        {
            "sentence": "Įlinkusi fechtuotojo špaga.",
            "words": [
                "įlinkusi",
                "fechtuotojo",
                "špaga"
            ]
        },
        {
            "sentence": "Blykčiodama gręžė apvalų arbūzą.",
            "words": [
                "blykčiodama",
                "gręžė",
                "apvalų",
                "arbūzą"
            ]
        }
    ],
    "url": "http://www.url.lt/the/document/was/fetched/from.html"
}

Download

To uncompress the archives, use XZ Utils or 7-Zip.

Corpus

commoncrawl-corpus-lt.xz; ~124.6 MB, expands to ~1 GB and a single file.

Raw HTML files

commoncrawl-corpus-lt-html.tar.xz; ~721 MB, expands to ~6 GB and 92,740 files.

Copyright & legal

I didn’t create any of the content in the corpus (except for a couple of my own websites). That’s all I know.

Last updated

Feb 17, 2016.