.lt corpus

.lt corpus is an amateurish Lithuanian language text corpus generated from Common Crawl data. The corpus was made from web pages originating from .lt top-level domain that were collected as part of the September 2015 crawl.

Every page from .lt TLD was fetched, extracted, tested whether it had any Lithuanian text, cleaned up, tokenized into sentences and words, and lastly put into this corpus.

Basic stats

.lt corpus consists of:

Procedure

To generate this corpus, I’ve taken the following steps:

  1. Created a list of crawled pages coming from .lt TLD
  2. Fetched said pages from Common Crawl’s archive
  3. Removed WARC and HTTP headers
  4. Filtered out empty documents and non-textual data
  5. Fixed common HTML typos
  6. Extracted majority of pages with Readability
  7. Filtered out non-Lithuanian text using langdetect
  8. Stripped extra whitespace from the resulting text
  9. Tokenized text into sentences and words using MediaWords::Language::lt

Format

Every line in the file is a JSON document containing a single extracted + tokenized web page, e.g.:

{
    "text": "Įlinkusi fechtuotojo špaga. Blykčiodama gręžė apvalų arbūzą.",
    "sentences": [
        {
            "sentence": "Įlinkusi fechtuotojo špaga.",
            "words": [
                "įlinkusi",
                "fechtuotojo",
                "špaga"
            ]
        },
        {
            "sentence": "Blykčiodama gręžė apvalų arbūzą.",
            "words": [
                "blykčiodama",
                "gręžė",
                "apvalų",
                "arbūzą"
            ]
        }
    ],
    "url": "http://www.url.lt/the/document/was/fetched/from.html"
}

Download

To uncompress the archives, use XZ Utils or 7-Zip.

Corpus

commoncrawl-corpus-lt.xz; ~124.6 MB, expands to ~1 GB and a single file.

Raw HTML files

commoncrawl-corpus-lt-html.tar.xz; ~721 MB, expands to ~6 GB and 92,740 files.

I didn’t create any of the content in the corpus (except for a couple of my own websites). That’s all I know.

Last updated

Feb 17, 2016.