How to generate pseudo-random text based on the Markov chain in Python

Source: Internet
Author: User
This article describes how to generate pseudo-random text based on the Markov chain in Python. it is a small implementation based on the Markov algorithm and fully reflects the use of Python in scientific computing, for more information, see the definition of Wolfram.

A Markov chain is a set of random variables {X_t} (t runs through 0, 1,...). Given the current state, the future is independent of the past conditions.

Definition of Wikipedia is clearer.

... Markov chains are random processes with Markov properties... [This means] state changes are probabilistic, and future states only depend on the current state.

The Markov chain has a variety of uses. Now let me see how to use it to produce seemingly nonsense.

The algorithm is as follows,

  1. Find a text that serves as a corpus and use it to select the next conversion.
  2. Starting from two consecutive words in the text, the last two words constitute the current state.
  3. The process of generating the next word is Markov transformation. To generate the next word, first check the corpus and find the words that follow the two words. Select one of them randomly.
  4. Repeat 2 until the generated text reaches the required size.


The code is as follows:

import random class Markov(object):   def __init__(self, open_file):  self.cache = {}  self.open_file = open_file  self.words = self.file_to_words()  self.word_size = len(self.words)  self.database()      def file_to_words(self):  self.open_file.seek(0)  data = self.open_file.read()  words = data.split()  return words      def triples(self):  """ Generates triples from the given data string. So if our string were    "What a lovely day", we'd generate (What, a, lovely) and then    (a, lovely, day).  """     if len(self.words) < 3:   return     for i in range(len(self.words) - 2):   yield (self.words[i], self.words[i+1], self.words[i+2])     def database(self):  for w1, w2, w3 in self.triples():   key = (w1, w2)   if key in self.cache:    self.cache[key].append(w3)   else:    self.cache[key] = [w3]      def generate_markov_text(self, size=25):  seed = random.randint(0, self.word_size-3)  seed_word, next_word = self.words[seed], self.words[seed+1]  w1, w2 = seed_word, next_word  gen_words = []  for i in xrange(size):   gen_words.append(w1)   w1, w2 = w2, random.choice(self.cache[(w1, w2)])  gen_words.append(w2)  return ' '.join(gen_words)

To see an example, we took My man jeeves from the Furnet plan as the text. The Example result is as follows.

In [1]: file_ = open('/home/shabda/jeeves.txt') In [2]: import markovgen In [3]: markov = markovgen.Markov(file_) In [4]: markov.generate_markov_text()Out[4]: 'Can you put a few years of your twin-brother Alfred,who was apt to rally round a bit. I should strongly advocatethe blue with milk'

If you want to execute this example, download jeeves.txt and markovgen. py
What about Markov algorithms?

  • The last two words are in the current state.
  • The next word only depends on the last two words, that is, the current state.
  • The following words are randomly selected from the statistical model of the corpus.

This is an example text.

The code is as follows:

"The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead ."

The Corpus corresponding to this text is like this,

{('The', 'quick'): ['brown'], ('brown', 'fox'): ['jumps', 'who', 'who'], ('fox', 'jumps'): ['over'], ('fox', 'who'): ['is', 'is'], ('is', 'slow'): ['jumps'], ('jumps', 'over'): ['the', 'the'], ('over', 'the'): ['brown', 'brown'], ('quick', 'brown'): ['fox'], ('slow', 'jumps'): ['over'], ('the', 'brown'): ['fox', 'fox'], ('who', 'is'): ['slow', 'dead.']}

Now, if we start with "brown fox", the next word can be "jumps" or "who ". If we select jumps, the current status will change to "fox jumps", and the next word will be "over", and then we will push it like this.

Prompt

  • The larger the text we select, the more choices will be made for each conversion, and the generated text will look better.
  • The status can be set to dependent on one word, two words, or any number of words. As the number of words in each state increases, the generated text is not random.
  • Do not remove punctuation marks. They make the corpus more representative and the random text look better.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.