Author Archives: Markus Killer
Recent collection of text processing tools: http://textprocessing.org/
Extremely promising new Python NLP tool: spaCy (commercial open-source software):
Unfortunately, it is only able to deal with English input at the moment and installation on Windows seems to be tricky. The project is currently under intense development and it will be interesting to check the following links on a regular basis:
AGPLv3 (free for open-source projects), changed to MIT License (27 Sep 2015)
Source: http://spacy.io/index.html#detailed-speed-comparison [accessed: 24/07/2015]
Recent overview of Python NLP resources on DataScienceCentral
Link to community-edited list on Pansop
Source: http://www.datasciencecentral.com/profiles/blogs/python-nlp-tools [accessed: 24/07/2015]
Keyboard Shortcuts for Novelists
Source: http://www.newyorker.com/books/page-turner/keyboard-shortcuts-for-novelists [accessed: 20/07/2015, spotted on
@NewYorker twitter feed]
The bizarre world of instructional LPs
Source: http://www.bbc.com/news/uk-33464722 [accessed: 18/07/2015]
Programme description: «How to make an Archive on 4» available on BBC iPlayer
Ever wondered how to make an Archive on 4? Here’s your chance to find out!
Alan Dein enters the strange world of instructional records where you can teach yourself just about anything – from yodelling to training your budgie to talk.
It all started in 1901 when Polish émigré Jacques Roston harnessed the new technology of sound recording to teach foreign languages, signing up such luminaries as George Bernard Shaw and JRR Tolkien to lend their support.
By the 50s and 60s you could buy LPs on how to do just about anything – from keep fit to playing a musical instrument, relaxation and passing your driving test.
Perhaps the most surprising are those which help you to train your pet budgerigar to talk – with help from Sparkie, Britain’s favourite budgie, who supposedly had a vocabulary of over 500 words.
With help from Sparkie, Alan Dein tells the story of instructional records and, along the way, reveals a few of the secrets of how to make an Archive on 4.
Source: http://www.bbc.co.uk/programmes/b062dhgb [accessed: 18/07/2015]
A tour of the British Isles in accents
A dialect coach, Andrew Jack, gives a tour of the accents of the British Isles. (Release date: 20/02/2014, remix, using google maps 02/04/2014 by Philip Barker)
Source (audio): http://www.bbc.co.uk/programmes/p01slnp5 [accessed: 21/06/2015]
Source (remix): https://www.youtube.com/watch?v=-8mzWkuOxz8 [accessed: 21/06/2015]
Thin results – random sample (lines) from text file
When working with corpora it is sometimes useful to be able to generate random samples from corpus results for manual analysis (e.g. to determine distribution percentages or recall/precision of queries). BNCweb, CQPweb or (No)SketchEngine provide a thin function for this purpose. However, if the results of corpus queries are only available as text files, there is a random thinning option available as part of GNU coreutils. The examples below create a random sample of 100 lines (adapt sample size according to your project’s needs). The reliability of manually checked results can be improved by obtaining several samples of 100 lines (typically 2-3) and using averaged scores.
On Linux, there is a very easy straight-forward way to achieve this (type:
man shuf for details):
shuf -n 100 results.txt
In order to save the random sample into a new text file, specify an output file:
shuf -n 100 -o random_sample.txt results.txt
On Mac OSX, it is slightly more complicated, as a Linux-like package manager (e.g. Homebrew) and the coreutils package have to be installed first (
gshuf Tutorial OSX and corresponding random_sample.zip for novice users who are not familiar with OSX terminal). Once the
gshuf command is available, the invocation is anologous (type:
man gshuf for details):
gshuf -n 100 results.txt
In order to save the random sample into a new text file, specify an output file:
gshuf -n 100 -o random_sample.txt results.txt
On Windows, the following Python code snipped could be used to achieve a similar result (please let me know if there are any built-in options):
Source: http://metadatascience.com/2014/02/27/random-sampling-from-very-large-files [accessed: 31/05/2015]
CQPweb tutorial (German)
Linkt to Noah Bubenhofer’s CQPweb Tutorial (German)
CQPwebCQPweb2016-04-05 Developer / Project Head: Andrew Hardie
CQPweb is a web-based graphical user interface (GUI) for some elements of the CWB – and in particular, the CQP query processor.
CQPweb is designed to replicate the user-interface of the popular BNCweb tool, which also (in its most recent versions) uses CQP as a back-end. Like BNCweb, CQPweb uses a database alongside the CWB to provide extra functions beyond those built into CWB/CQP. However, unlike BNCweb, CQPweb can be used with any corpus.
CQPweb is especially suitable for students, non-linguists, and others for whom a Unix-like command-line is a terrifying prospect. […]
[More] screenshots of CQPweb can be downloaded from this link.
Source: http://cwb.sourceforge.net/cqpweb.php [accessed: 14/01/2014]
Related posts on langui.ch:
New release: ParaVoz2
ParaVoz2ParaVoz22015-05-21 Developer / Project Head: Ruprecht von Waldenfels
The ParaVoz package provides a simple, yet effective interface for a parallel corpus using OpenCWB (http://cwb.sourceforge.net). It should work on any linux machine with only minimal changes in the settings files to reflect paths, and language codes. All settings are found in the settings directory.
ParaVoz 2.0 extends (but not replaces) ParaVoz 1.0 and is more intuitive, but probably less suited for corpus with a large number of languages; it is best used with a corpus of two or three language. In distinction to ParaVoz 1, with ParaVoz 2.0, the parallel corpus is encoded as a single corpus file for each language, rather than for each text in the corpus. ParaVoz 2.0 now supports both sentence and word alignment.
For ParaSol 2.0, see the demo at http://parasolcorpus.org/ParaVoz ). For ParaSol 1.0, see the movie on the ParaSol website (http://parasolcorpus.org; movie at http://parasolcorpus.org/ParaSol_demo.mp4).
This web interface to CWB was initially written by Roland Meyer for use with the ParaSol corpus (then Regensburg Parallel Corpus) in 2006 and has since been in
development by successive authors. The java script based functionality was mainly added by Andreas Zeman, XSLT-support in the new modular interface mainly by Ruprecht von Waldenfels, who has supervised the publication as open source. Part of the architecture is described in Waldenfels (2011). We thank the Center for the Study of Language and Society, University of Berne, (http://www.csls.unibe.ch) for granting financial support enabling the publication of ParaVoz as open source at this stage.
ParaVoz 2.0 was then developed during the work on a German-Polish parallel corpus supported by a grant of the Johannes Gutenberg University Mainz; mostly by Michal Wozniak, with valuable input from Jan Machalica and under supervision by Ruprecht von Waldenfels.
Source: https://bitbucket.org/rvwfels/paravoz2 [accessed: 21/05/2015]
Quote Tool as:
- Roland Meyer, Ruprecht von Waldenfels, Michal Wozniak, Andreas Zeman (2006-2015): ParaVoz – a simple web interface for querying parallel corpora. Second Version. Bern, Regensburg, Berlin, Krakow.
- Ruprecht von Waldenfels (2011): Recent Developments in ParaSol: Breadth for Depth and XSLT based web concordancing with CWB. In: Daniela Majchráková and Radovan Garabík (eds.): Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Bratislava: Tribun, 156-162. Available online.
> Date: Thu, 21 May 2015 14:41:13 +0200
> From: ruprecht.waldenfels _(at)_ gmx.net
> To: cwb _(at)_ sslmit.unibo.it
> Subject: [CWB] Interface for parallel corpora
> Dear colleagues,
> we would like to let you know that a new version of the ParaVoz corpus
> interface for parallel corpora hosted with CWB has been released.
> ParaVoz 2.0 has a user friendly interface, it features basic metadata
> management and supports word alignment.
> ParaVoz 2.0 extends (but not replaces) Paravoz 1.0; it is open-source
> and found here: https://bitbucket.org/rvwfels/paravoz2
> A demo version is found here: www.parasolcorpus.org/ParaVoz
> Ruprecht von Waldenfels
> Michał Woźniak
> Institute of Polish, Polish Academy of Sciences, Cracow
> CWB mailing list
> CWB _(at)_ sslmit.unibo.it