A tutorial on generating pseudo-random text based on Markov chain in Python _python

Source: Internet
Author: User

First look at the definition from Wolfram

Markov chain is a set of random variable {x_t} (t runs through the 0,1,... ), given the current state, the future is independent of the past conditions.

The definition of Wikipedia is a bit clearer.

... Markov chains are stochastic processes with Markov properties. [This means] the state change is probabilistic and the future state depends only on the current state.

The Markov chain has many uses, so let me now look at how to use it to produce seemingly presentable nonsense.

Algorithm is as follows,

    1. Find a corpus of text that the corpus uses to select the next transformation.
    2. Starting with two consecutive words in the text, the last two words form the current state.
    3. The process of generating the next word is Markov transformation. To generate the next word, first look at the corpus and look for the words followed by these two words. Randomly select one from them.
    4. Repeat 2 until the resulting text reaches the desired size.


The code is as follows

Import Random class Markov (object): Def __init__ (self, open_file): Self.cache = {} Self.open_file = Open_file
  Self.words = Self.file_to_words () self.word_size = Len (self.words) self.database () def file_to_words (self):
  Self.open_file.seek (0) data = self.open_file.read () words = Data.split () return words def triples (self): "" "generates triples from the given data string.
  So if we are string were "What a lovely Day", we are generate (What, a, lovely) and then (a, lovely, day). "" If Len (self.words) < 3:return for i in range (len (self.words)-2): Yield (self.words[i), Self.wo RDS[I+1], self.words[i+2] def database (self): for W1, W2, W3 in Self.triples (): Key = (W1, w2) if key in SE Lf.cache:self.cache[key].append (W3) Else:self.cache[key] = [W3] def generate_markov_text (self, size=25 ): seed = random.randint (0, self.word_size-3) seed_word, Next_word = Self.words[seed], self.words[seed+1] W1, w2 = Seed_word, Next_word gen_words = [] for i in xrange (size): Gen_words.append (W1) w1, W2 = W2, Rand
 Om.choice (self.cache[(W1, W2)) Gen_words.append (W2) return ". Join (Gen_words)

In order to see an example result, we took Wodehouse's "My Man Jeeves" as text from the Gutenberg Project, and the sample results are as follows.

In [1]: File_ = open ('/home/shabda/jeeves.txt ') in
 
[2]: Import Markovgen in
 
[3]: Markov = Markovgen. Markov (File_) in
 
[4]: Markov.generate_markov_text ()
out[4]: ' Can and put a few years of your twin-brother C5/>who was apt to rally round a bit. I should strongly advocate the
blue with milk '

[If you want to perform this example, please download Jeeves.txt and markovgen.py
What about the Markov algorithm?

    • The last two words are the current state.
    • The next word depends only on the last two words, the current state.
    • The following words are randomly selected from the corpus's statistical model.

This is an example text.

Copy Code code as follows:
"The quick brown fox jumps over the Brown Fox's who's is slow jumps on the Brown fox who is dead."

This text corresponds to a corpus like this,

{(' The ', ' Quick '): [' Brown '],
 (' Brown ', ' Fox '): [' jumps ', ' who ', ' who '],
 (' Fox ', ' jumps '): ["Over"],
 (' Fox ', ' Who '): [' Are ', ' is '],
 (' is ', ' Slow '): [' jumps '],
 (' jumps ', ' over '): [' The ', ' the '],
 (' over ', ' the '): [' Brown ', ' Brown '],
 (' Quick ', ' Brown '): [' Fox '],
 (' slow ', ' jumps '): [' over '],
 (' The ', ' Brown '): [' Fox ', ' Fox '],
 ("Who", ' is '): [' Slow ', ' Dead. ']}

Now if we start with "Brown Fox", the next word can be "jumps" or "who". If we choose "jumps", then the current state becomes "Fox jumps", then the word is "over", and so on.

Tips

    • The larger the text we choose, the more we choose to convert each time, and the more beautiful the text we generate.
    • The state can be set to depend on one word, two words, or any number of words. As the number of words in each state increases, the generated text is less random.
    • Do not remove punctuation and so on. They make the corpus more representative, the random text looks better.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.