Log in

No account? Create an account

Word frequencies

« previous entry | next entry »
апр. 16, 2008 | 02:00 pm

Reading a foreign text and having to look up lots of words on every page is pretty exhausting. And if there are really a lot, I tend to forget the words rather quickly.
My strategies to avoid endless flipping the dictionary pages were so far: 

1: choose a scientific book, as the author will try to explain his point as clearly as possible, giving multiple examples an using a rather formal language
2: select a topic that you are rather familiar with, so you can at times make an educated guess about the meaning of an unknown word.  

But even under these conditions the advances are pretty slow. OK, I didn't choose a really easy text, but of course it has to draw my attention as well.

Enter the Jargonizer. It is a C# program I finished today, and which basically does a histogram analysis on the text. It returns a file with two columns: the word and the number of times that it occurs. I manually removed the words I know  and this gave me a list of the 200 most frequent unknown (to me) words in the text. This should speed up the reading.

Some actual data:

Book: Русская Сказка by В.Я. Пропп

Top 10 words:

4527 и
3803 в
1711 не
1133 на
1127 с
1114 сказки
1102 о
917 что
897 к
819 а

Top 10 unknown words:

119 изучения
113 совершенно
100 ред
75 значение
74 изучение
66 является
63 указатель
57 случаях
57 рке
56 происхождение

Ссылка | Оставить комментарий |

Comments {5}


(без темы)

from: pphi
date: апр. 20, 2008 07:00 am (UTC)

ред is an abbreviation of редакция (spelling?)
рке is in my opinion an artefact of the OCR software used to produce the document. The printed original has уже in all these locations :-)

In the mean time I gathered a list of words sorted by frequency from http://www.comp.leeds.ac.uk/ssharoff/frqlist/frqlist-en.html

This should be helpful in filtering out the most common words. As this list also shows you the type of each word, filtering out most of the inflected forms should be possible with not too many false negatives.

Finally, I discovered that the program Freelang (http://www.freelang.net) uses a word list format that is pretty easy to decode: record length=184, translation starts at position 31. So producing a reasonably complete word list for an etext appears to be feasible.

When the program is improved to that level, I shall certainly run some Russian Classics through it, and report on the results.

Ответить | Уровень выше | Ветвь дискуссии