Author Archives: Markus Killer


Extremely promising new Python NLP tool: spaCy (commercial open-source software):

Unfortunately, it is only able to deal with English input at the moment and installation on Windows seems to be tricky. The project is currently under intense development and it will be interesting to check the following links on a regular basis:

Link to github project

Link to documentation

License: AGPLv3 (free for open-source projects), changed to MIT License (27 Sep 2015)

24-07-2015 15-22-36


Source: [accessed: 24/07/2015]

The bizarre world of instructional LPs

Instructional LPs - Relaxed English

Instructional LPs – Relaxed English – Excerpts on BBC Radio 4

Source: [accessed: 18/07/2015]

Programme description:  «How to make an Archive on 4» available on BBC iPlayer

Ever wondered how to make an Archive on 4? Here’s your chance to find out!

Alan Dein enters the strange world of instructional records where you can teach yourself just about anything – from yodelling to training your budgie to talk.

It all started in 1901 when Polish émigré Jacques Roston harnessed the new technology of sound recording to teach foreign languages, signing up such luminaries as George Bernard Shaw and JRR Tolkien to lend their support.

By the 50s and 60s you could buy LPs on how to do just about anything – from keep fit to playing a musical instrument, relaxation and passing your driving test.

Perhaps the most surprising are those which help you to train your pet budgerigar to talk – with help from Sparkie, Britain’s favourite budgie, who supposedly had a vocabulary of over 500 words.

With help from Sparkie, Alan Dein tells the story of instructional records and, along the way, reveals a few of the secrets of how to make an Archive on 4.

Source: [accessed: 18/07/2015]



A tour of the British Isles in accents

A dialect coach, Andrew Jack, gives a tour of the accents of the British Isles. (Release date: 20/02/2014, remix, using google maps 02/04/2014 by Philip Barker)

Source (audio): [accessed: 21/06/2015]
Source (remix): [accessed: 21/06/2015]

Thin results – random sample (lines) from text file

When working with corpora it is sometimes useful to be able to generate random samples from corpus results for manual analysis (e.g. to determine distribution percentages or recall/precision of queries). BNCweb, CQPweb or (No)SketchEngine provide a thin function for this purpose. However, if the results of corpus queries are only available as text files, there is a random thinning option available as part of GNU coreutils. The examples below create a random sample of 100 lines (adapt sample size according to your project’s needs). The reliability of manually checked results can be improved by obtaining several samples of 100 lines (typically 2-3) and using averaged scores.

On Linux, there is a very easy straight-forward way to achieve this (type: man shuf for details):
cd path_to_text_file
shuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:
shuf -n 100 -o random_sample.txt results.txt

On Mac OSX, it is slightly more complicated, as a Linux-like package manager (e.g. Homebrew) and the coreutils package have to be installed first (gshuf Tutorial OSX and corresponding for novice users who are not familiar with OSX terminal). Once the gshuf command is available, the invocation is anologous (type: man gshuf for details):
cd path_to_text_file
gshuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:
gshuf -n 100 -o random_sample.txt results.txt

On Windows, the following Python code snipped could be used to achieve a similar result (please let me know if there are any built-in options):

Random sampler (Python, Algorithm 1)

Random sampler (Python, Algorithm 1)

Source: [accessed: 31/05/2015]

CQPweb tutorial (German)

Noah Bubenhofer's CQPweb Tutorial (German)

Noah Bubenhofer’s CQPweb Tutorial (German)

Linkt to Noah Bubenhofer’s CQPweb Tutorial (German)


2016-04-05 Developer / Project Head: Andrew Hardie
Home web interface to cwb stable: 3.0.16 dev: 3.2.1[r816] 26 Dec 2013 / 5 Apr 2016 Work web-based (also on localhost) open source License: GNU GPLv2+ Other free Programming Language(s): php, mysql, Perl Key features: SERVER INSTALLATION, MANAGE YOUR OWN CORPORA, WEB INTERFACE, CQP QUERIES Website: CQPweb project page Website: CQPWeb SVN Repository Website: UCREL Lancaster Corpus Server (free access to a lot of corpus resources after registration, including the extended Brown-family of corpora) Website: CQPweb at Beijing Foreign Studies University – Large Number of publicly accessible corpora (username: test, password: test) Website: CQPweb Video Tutorials Website: CQPwebInABox Video Tutorials
Return to top.

Related posts on

New release: ParaVoz2


2015-05-21 Developer / Project Head: Ruprecht von Waldenfels
Home Simple web interface for querying (cwb-indexed) parallel corpora. git-commit: 29600cc 21 May 2015 Work Linux/OSX open source License: GNU GPLv2+ Other free Programming Language(s): PHP, XSLT Key features: ONLINE PARALLEL CONCORDANCER, CQP-QUERY SUPPORT, SIMPLE INTERFACE FOR 2-3 LANGUAGES, SUPPORT FOR SENTENCE- AND WORD-ALIGNMENT, IMPROVED INSTALLATION AND PRE-PROCESSING INSTRUCTIONS Website: (v2) Website: Bitbucket Repository (v2)
Return to top.

Release announcement:

> Date: Thu, 21 May 2015 14:41:13 +0200
> From: ruprecht.waldenfels _(at)_
> To: cwb _(at)_
> Subject: [CWB] Interface for parallel corpora
> Dear colleagues,
> we would like to let you know that a new version of the ParaVoz corpus
> interface for parallel corpora hosted with CWB has been released.
> ParaVoz 2.0 has a user friendly interface, it features basic metadata
> management and supports word alignment.
> ParaVoz 2.0 extends (but not replaces) Paravoz 1.0; it is open-source
> and found here:
> A demo version is found here:
> Best,
> Ruprecht von Waldenfels
> Michał Woźniak
> Institute of Polish, Polish Academy of Sciences, Cracow
> _______________________________________________
> CWB mailing list
> CWB _(at)_

Related posts on