Order from chaos – our psuedorandom sentence generator
My current project in Natural Language Processing is to make a “language model” that is capable of generating random sentences. When I say “random” though, I don’t mean purely random- as in, the program picks words completely willy-nilly, which would produce gibberish like “helped spring to ferocious city while gaps thereby progression”. Instead, what the program does is scan some training data (in our case, hundreds and hundreds of newspaper articles), to build up a “model” of the English language: as it reads, it learns what words are more likely to come after other words. Then, when it’s all done, it generates sentences by first picking a word at random, then noting what words are likely to follow it and building up from there.
This is called an n-gram model, where n is the numer of words before the target word we take into consideration. For example, in a bigram model, we only consider the word immediately before the target word. So in that last sentence, our program would note, “if I see the word only, then consider is likely to come after it, and if I see the word immediately, then before is likely to come after it”. If, instead, we use a trigram model, then the program would note, “if I see immediately before then the next word is likely to be the“, and so on for any value of n we wish. Obviously higher values of n produce more accurate results, but it quickly leads to diminishing returns, and increasing n increases the amount of work the program has to do immensely.
You might think that a bigram model wouldn’t produce very good results, since it considers only one word at a time, but with our large training set, the sentences our program generated were surprisingly good. Obviously, in terms of content the sentences are gibberish, but it looks like almost-decent English, like the kind you would get if you ran a foreign website through Google Translate:
the first western partners will meet with the deal with an increase of the percentage of the united states .
bill clinton replied : it means that relations between the joint operations at the netherlands , including the need to comment on building socialism with no money supply natural gas off the map official sources said that , visited the korea has not less profitable .
local economy and expectations of the normal country are embarking on his speech at home , and the zulus must punish those who dare ignite a means to mount putuo .
it would give power-seekers pause for an effort in contempt of others who receive party organizations in japan .
Not bad at all, I think! Note in the third sentence, the phrase “… must punish those who dare ignite …” is used. That’s a six-word phrase that is legitimate English, and the program only had to look at one word at a time to produce it. Correct prepositional phrases, verb infinitives, and subject-predicate construction are all noticeable in the sample sentences too.
It got even better when we upped our program to use a trigram model. Take a look at these sample sentences:
some sources doubted the possibility of falling oil prices and help retire the agencys charges without admitting or denying any wrongdoing .
despite the problems of our nice little tricks on countervail and what-have-you , we are off of us to watch them grow here without some help .
Apart from the wonky “and help retire” bit in the middle, that first sentence is perfect English! It has subject-predicate agreement, and a relatively complex phrase, “without admitting or denying any wrongdoing”.
I also find it a little amusing that you can tell that our program was trained using newspaper articles: the sentences all have a very newspaper-like tone, and I’m not just talking about vocabulary.
The main point my professor is making in NLP this year is that researches who are trying to get computers to understand and generate English (or any other language) have all but abandoned a “rules-based approach”, where we try and teach the computer the rules of the language. Human languages are complicated things with many exceptions and a great deal of subjective interpretation, and trying to get a deterministic machine to emulate them has proved nigh-impossible with our current technology.
Instead, the wave of the future is “probability-based approaches”, where we feed the computer huge samples of text that humans have already made, and the computer builds a model of the language that way. Probability-based approaches are both easier and far more successful, as our results demonstrate: in just two weeks, three undergraduates with little to no linguistics knowledge got a computer to spit out legitimate English, something that rules-based approaches couldn’t do after decades.
As an aside note, this is one of the reasons Google is going to keep being successful in the years ahead. With access to vast quantities of data, they can teach a computer to do just about anything. Translation is a perfect example: pretty much everyone acknowledges that Google is way ahead of the competition in translating text between languages, because they can use their enormous datasets to do probability-based translation, whereas Babelfish et al. are stuck in the past with rules methods.
Filed under: school | Leave a Comment
Tags: google, language, programming, school