Author Archives: Markus Killer

Focus week: Datavisualisation & Corpus Linguistics

Class: I1b (KSH)

17 April 2018 – 20 April 2018

KSH-AW I1b April 2018 (in German)

Fünf in der Sprachwissenschaft häufig anzutreffende Visualisierungstypen: Liste, Karte, Partitur, Vektoren, Graph/Netz (gerichtet: Baum; ungerichtet: Netz) [Bubenhofer/Kupietz 2018: 46]

spaCy goes German

textprocessing.org

Recent collection of text processing tools: http://textprocessing.org/

spaCy

Extremely promising new Python NLP tool: spaCy (commercial open-source software):

Unfortunately, it is only able to deal with English input at the moment and installation on Windows seems to be tricky. The project is currently under intense development and it will be interesting to check the following links on a regular basis:

Link to github project

Link to documentation

License: ~~AGPLv3 (free for open-source projects)~~, changed to MIT License (27 Sep 2015)

Source: http://spacy.io/index.html#detailed-speed-comparison [accessed: 24/07/2015]

Keyboard Shortcuts for Novelists

Source: http://www.newyorker.com/books/page-turner/keyboard-shortcuts-for-novelists [accessed: 20/07/2015, spotted on @NewYorker twitter feed]

The bizarre world of instructional LPs

Instructional LPs – Relaxed English – Excerpts on BBC Radio 4

Source: http://www.bbc.com/news/uk-33464722 [accessed: 18/07/2015]

Programme description: «How to make an Archive on 4» available on BBC iPlayer

Ever wondered how to make an Archive on 4? Here’s your chance to find out!

Alan Dein enters the strange world of instructional records where you can teach yourself just about anything – from yodelling to training your budgie to talk.

It all started in 1901 when Polish émigré Jacques Roston harnessed the new technology of sound recording to teach foreign languages, signing up such luminaries as George Bernard Shaw and JRR Tolkien to lend their support.

By the 50s and 60s you could buy LPs on how to do just about anything – from keep fit to playing a musical instrument, relaxation and passing your driving test.

Perhaps the most surprising are those which help you to train your pet budgerigar to talk – with help from Sparkie, Britain’s favourite budgie, who supposedly had a vocabulary of over 500 words.

With help from Sparkie, Alan Dein tells the story of instructional records and, along the way, reveals a few of the secrets of how to make an Archive on 4.

Source: http://www.bbc.co.uk/programmes/b062dhgb [accessed: 18/07/2015]

A tour of the British Isles in accents

A dialect coach, Andrew Jack, gives a tour of the accents of the British Isles. (Release date: 20/02/2014, remix, using google maps 02/04/2014 by Philip Barker)

Source (audio): http://www.bbc.co.uk/programmes/p01slnp5 [accessed: 21/06/2015]
Source (remix): https://www.youtube.com/watch?v=-8mzWkuOxz8 [accessed: 21/06/2015]

Thin results – random sample (lines) from text file

When working with corpora it is sometimes useful to be able to generate random samples from corpus results for manual analysis (e.g. to determine distribution percentages or recall/precision of queries). BNCweb, CQPweb or (No)SketchEngine provide a thin function for this purpose. However, if the results of corpus queries are only available as text files, there is a random thinning option available as part of GNU coreutils. The examples below create a random sample of 100 lines (adapt sample size according to your project’s needs). The reliability of manually checked results can be improved by obtaining several samples of 100 lines (typically 2-3) and using averaged scores.

On Linux, there is a very easy straight-forward way to achieve this (type: man shuf for details):cd path_to_text_file shuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:
shuf -n 100 -o random_sample.txt results.txt

On Mac OSX, it is slightly more complicated, as a Linux-like package manager (e.g. Homebrew) and the coreutils package have to be installed first (gshuf Tutorial OSX and corresponding random_sample.zip for novice users who are not familiar with OSX terminal). Once the gshuf command is available, the invocation is anologous (type: man gshuf for details):cd path_to_text_file gshuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:
gshuf -n 100 -o random_sample.txt results.txt

On Windows, the following Python code snipped could be used to achieve a similar result (please let me know if there are any built-in options):

Random sampler (Python, Algorithm 1)

Source: http://metadatascience.com/2014/02/27/random-sampling-from-very-large-files [accessed: 31/05/2015]

Random sample from text file Mac OSX

Quick step-by-step guide:
Get a random sample of 100 lines per text file on Mac OSX:

Steps 1 to 4 only have to be followed once per computer. After that only steps 6 & 8 are needed.

Open Terminal window:
Install „Homebrew“ package manager (this allows you to install additional Unix/Linux programmes on your Mac). Copy and paste the following line into the Terminal window (all one line):

ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”

Source: http://brew.sh/ (for further documentation)

If it asks you to install “Commandline developer tools”, say YES (might take a while).

Wait for installation to finish, press RETURN and enter your password (the one you use to log on to your Mac).
Type: brew install coreutils
Extract attached zip to your Desktop (make sure that the folder random_sample is visible on your Desktop and that there is a file called test.txt in it.
Go back to Terminal window and type: cd Desktop/random_sample
And now comes the actual shuffling bit: gshuf -n 2 test.txt

Instead of test.txt you can use your query results and instead of 2, you can enter the size of your sample.

If you want to save the sample into a new text file instead of just displaying it in the terminal window, type: gshuf -n 2 test.txt > random_sample1.txt and the results will be saved in the file txt in the same folder (feel free to adapt filenames and be aware of the fact that if you use one name twice the contents of the file with the same name will be overwritten).

Explanation of the different parts of the command:

shuffle command	sample size (display shuffled lines, up to the line number specified by -n switch)	name of file you want to shuffle (lines)	write output into file	name of output file
gshuf	-n SAMPLESIZE	test.txt	>	out.txt

An easy way to navigate to a particular folder: type cd [space] into the terminal window, drag&drop the folder you want to work in from your Finder into the Terminal and press RETURN/ENTER.

Other basic folder/directory navigation from Terminal window:

Source: http://www.cheatography.com/davechild/cheat-sheets/linux-command-line/

Example:

test.txt

gshuf -n 2 test.txt

line 1: Aarau

line 2: Basel

line 3: Bern

line 4: Luzern

line 5: Olten

line 6: St. Gallen

line 7: Zürich

Command for a sample of 100:

cd path_to_folder_with_file_you_want_to_shuffle

gshuf -n 100 results.txt > random_sample1.txt

langui.ch ⇨ /'læŋgwɪtʃ/ ⇨ /'læŋgwɪdʒ/

Collection of resources: corpus linguistics, e-learning, natural language processing, teaching English as a foreign language and tools related to these topics