Sketching words
Some months ago I came across Sketch Engine. Sketch Engine is a website that offers a collection of pre-loaded corpora in several languages and the ability to automatically extract collocation information from them among other things.
You can get a 30-day free trial account if you want to check it out, the point is that I thought it was really cool, but it was a bit pricy, 55 euros/year for an individual and 1080 euros/year for a site (up to 50 employees and students) and these were the academic licenses!!! And I don’t even have any real need for it
So I thought it would be an interesting project to do something similar, albeit just focusing in the ‘word sketching’ part, as described in this paper.
After a weekend I got it working although I didn’t devote any second to make it look good as you can appreciate: Collocations and other stuff.
For now only the corpus of the state of the union addresses is loaded, with almost 400000 words. You can select that corpus, click on sketch and get the sketch of any word, for example, the ‘word sketch’ for problem.
We can see that the adjective used more times with ‘problem’ is ’serious’, although if we look at the relative frequency it’s ‘complex’. The verb which has ‘problem’ as object more times is ’solve’ followed by ‘approach’, ‘address’ and ‘deal’. You can also click on the numbers to see the actual sentences in which these words appear, for example, ’serious problem’.
So how does it work? First of all it does part of speech tagging using Apertium. Once the text is POS-tagged we apply a set of ‘regular expression’-like rules to identify the relation between words, such as:
*DUAL
=a_modifier/modifies
2:[tag=adj] [tag=n]{0,2} 1:[tag=n] [tag!=n]
This rule expresses the relation between adjectives and the nouns they modify, matching sentences like ‘the red ball‘ and ‘the red football ball‘. Each relation is stored in the database with extra info about position in the text. Once the database is created accessing it to display the sketch and concordance information is really simple.
The site and auxiliary tools were written in around 1400 lines of Python/Django. I am still not sure about what to do with this, if there is anyone interested on adding some corpora to it, continue development or anything else, please let me know.
September 18th, 2008 at 2:40 pm
I worked as a student programmer on a project to make such corpus searches more user friendly. The stuff I worked on is still available at: http://corp.hum.sdu.dk/
The really cool thing about his is, that if you chose the new interface and click refine, you get a nice visual interface for building complex cqp ( corpus query language ) queries in an intuitive way.
There are several large corpora, in many languages, on that site that are free to use.
September 19th, 2008 at 7:30 am
i always wanted to dig into text analysis. it seems like the obvious tool to figure out what i’ve been up to the last couple of weeks (i tend to forget). push all your websites and feeds through a parser and check if a pattern emerges (like when i try to code i tend to google for snippets all the time).
another goodie would be the option to group akonadi/akregator items according to semantic relevance: all posts that deal with plasma widgets, svn updates or the recent financial market crisis.
i’m not really sure how to approac this, but grabbing the significant words and building a vector out of them, the compare vectors seems like a nifty idea i picked up somewhere.
September 19th, 2008 at 8:28 am
Hi Adrian, you can see an example of how to do something like what you want in the book, Programming Collective Intelligence: http://oreilly.com/catalog/9780596529321/
Have a look at it if you have the chance.