Making a Markov Chain Twitter Bot in Python
Table of Contents
The study of Markov Chains is an interesting topic that has many applications. Such techniques can be used to model the progression of diseases, the weather, or even board games. We are going to introduce and motivate the concept mathematically, and then build a “Markov bot” for Twitter in Python. The resulting bot is available on GitHub.
Introduction to Markov Chains
A Markov Chain is used to model events whose outcome only depend on the current state. That is to say, what the system does next only depends on where it currently is. In many ways, a Markov Chain can describe how I play most strategic games… Anyways, if our state at time \(t\) is represented by \(X_t\), then this is the same as writing \begin{equation} P(X_t = x | \text{Where we are + where we’ve been}) = P(X_t = x | \text{Where we are now}). \end{equation}
This can of course be written more mathematically, but it gets the point across. When you have this situation, you can arrange these probabilities into a matrix. To construct such a matrix, the rows and columns will be all the possible states, and the \(state_i, state_j\) entry is the probability of moving from state \(i\) to state \(j\).
Let’s pretend our mood is governed by a Markov Chain each day. If we are happy, then there’s a 75% change for us to be happy the next day, other wise we’re sad. If we’re sad, there’s a 10% change we’ll become happy the next day, and 90% change we’ll stay the same. Putting this into a matrix we have
\[P = \begin{bmatrix} .75 & .25 \\ .1 & .9\\ \end{bmatrix}.\]If we start happy, we can represent that as the row vector \(s_0 = [1, 0]\). If we want to know how we’ll feel tomorrow, we can find the probability distribution like this:
\[s_1 = s_0P = [1,0] \begin{bmatrix} .75 & .25 \\ .1 & .9\\ \end{bmatrix}. = [.75, .25],\]so we have a 75% change to still be happy, and a 25% otherwise. After another day…
1
2
3
4
5
6
7
8
import numpy as np
s_1 = [.75,.25]
P = np.array([[.75,.25],[.1,.9]])
s_2 = np.dot(s_1, P)
print(s_2)
[ 0.5875 0.4125]
And in 100 days…
1
2
3
4
s_100 = s_1
for k in range(0,99):
s_100 = np.dot(s_100, P)
print(s_100)
[ 0.28571429 0.71428571]
And a thousand days…
1
2
3
4
s_1000 = s_100
for k in range(0,900):
s_1000 = np.dot(s_1000, P)
print(s_1000)
[ 0.28571429 0.71428571]
Wait, the distribution is the same? We’ve stumbled onto what is called a limiting distribution. For special transition matrices, these limiting distributions exist, which mean no matter what state you start at, you’ll end up at the same distribution eventually. In this case, the interpretation is this person will be sad approximately 3/4 of the time.
Where does this distribution come from? It’s the eigenvector of \(P^T\) associated with the eigenvalue \(1\) scaled to sum to 1!
1
2
3
4
5
P_t = np.transpose(P)
D, S = np.linalg.eig(P_t)
largest_eigenvector = (S[:,1]*-1) # Make positive
largest_eigenvector /= np.sum(largest_eigenvector) # Scale to 1
print(largest_eigenvector)
[ 0.28571429 0.71428571]
Limiting distributions are related to stationary distributions. It turns out that as long as your transition matrix P satisfies a few qualities, it’s largest eigenvalue will equal 1, the (left) eigenvector associated with 1 will have entries of the same sign, and all other eigenvalues will be strictly smaller than 1 in absolute value. The mathematical underpinning of this guarantee is a special case of the Perron-Frobenius Theorem. We call this left eigenvector the stationary distribution. Of course, it’s clear that for some left eigenvector \(\pi\) associated with \(1\), that \(\pi P = \pi\), but how does a stationary distribution \(\pi\) (new \(\pi\)) come about? The stationary distribution \(\pi\) is a vector such that for an arbitrary vector \(v\), \begin{equation} \lim_{n\to\infty} v P^n = \pi. \end{equation}
Now we will look at an example of this behavior with right eigenvalues to accustom ourselves to idea of limiting distributions. Let \(M\) be a 2-by-2 matrix which has eigenvalues \(1, \epsilon\) with \(\epsilon<1\) and eigenvectors \(e_1, e_2\) which span \(\mathbb{R}^2\). This means any vector \(x\) in \(R^2\) can be written \(x = c_1 e_1 + c_2 e_2\). Another way to put it is the eigenvectors form a basis for the space. This isn’t always true, but assume we have a matrix where this is true.
Then, look what happens when we start multiplying \(x\) by our matrix \(M\):
\begin{equation} M x = M( c_1 e_1 + c_2 e_2) = c_1 M e_1 + c_2 M e_2 = c_1 e_1 + c_2 \epsilon e_2 \end{equation} which follows since \(e_1\) and \(e_2\) are eigenvectors. Notice, we can keep doing this for the relation: \begin{equation} M^n x = M^n( c_1 e_1 + c_2 e_2) = c_1 M^n e_1 + c_2 M^n e_2 = c_1 (1)^n e_1 + c_2 (\epsilon)^n e_2 \end{equation}
As \(n\to \infty\), the term \(c_2 (\epsilon)^n e_2 \to 0\) and the other term will be the only term left. So, \begin{equation} \lim_{n\to \infty} M^n x = \pi \end{equation} for all \(x\in \mathbb{R}^2\). This gives us an idea of how limiting distributions can come about in the wild, but this is not the whole story!
It’s interesting to note that if a transition matrix \(P\) has a stationary distribution, it may not necessarily be a limiting distribution. However, if we have an idealized case like our matrix \(M\) , then the stationary distribution is the limiting distribution.
Implicitly Building a Transition Matrix From Text
With the theory out of the way, let’s talk about how you can actually build such a matrix (implicitly). We’re going to use a phrase from “Smells Like Teen Spirit” from Nirvana to do an example.
Hello, hello, hello, how low?
In the case of text Markov Chain, a state is what word you’re currently at and the transition matrix is determined by what comes after that word in the whole “corpus” you are using (in this case, this line of the song).
Think about what the transition matrix of this sentence would look like when you ignore punctuation and capitalization. If you are at the word “hello”, there is 66% chance you’ll end up back the word “hello” and a 34% chance you’ll end up at “how”. If you are at “how”, you have a 100% chance to get to “low”. For text, it’s usually easy to actually represent this as a dictionary in Python to avoid calculating everything:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
corpus = 'hello hello hello how low'
def build_transition_matrix(corpus):
corpus = corpus.split(' ')
transitions = {}
for k in range(0,len(corpus)):
word = corpus[k]
if k != len(corpus) - 1: # Deal with last word
next_word = corpus[k+1]
else:
next_word = corpus[0] # To loop back to the beginning
if word not in transitions:
transitions[word] = []
transitions[word].append(next_word)
return transitions
print(build_transition_matrix(corpus))
{'low': ['hello'], 'hello': ['hello', 'hello', 'how'], 'how': ['low']}
Now, this isn’t quite a matrix, but we can use it in the same way.
Sampling from the Chain
Let’s say we want to “sample” a sentence from corpus that is typical of the underlying distribution of the corpus (via it’s transition matrix). We first choose a starting state (word) and then we keep applying the transition matrix (via the dictionary) until we have our final sentence. We are really using a Markov Chain Monte Carlo technique in hopes to sample from the limiting distribution. So, if your corpus really has a limiting distribution, you would end up with realization of the chain, or a path, which samples each word from the limiting distribution, if it exists.
For this experiment we will use the full song without repeating any parts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
corpus = '''
Load up on guns, bring your friends
It's fun to lose and to pretend
She's over bored and self assured
Oh no, I know a dirty word
Hello, hello, hello, how low?
Hello, hello, hello!
With the lights out, it's less dangerous
Here we are now, entertain us
I feel stupid and contagious
Here we are now, entertain us
A muchacho
An albino
A mosquito
My libido
Yeah, hey, yay
I'm worse at what I do best
And for this gift I feel blessed
Our little group has always been
And always will until the end
And I forget just why I taste
Oh yeah, I guess it makes me smile
I found it hard, it's hard to find
Oh well, whatever, never mind
'''
def sample_sentence(corpus, sentence_length, burn_in = 1000):
corpus = corpus
sentence = []
transitions = build_transition_matrix(corpus)
# Make a sentence that is 50 words long
# We sample the sentence after running through the chain 1000 times to hope
# to near a stationary distribution.
current_word = np.random.choice(corpus.split(' '), size=1)[0]
for k in range(0, burn_in + sentence_length):
# Sample from the lists with an equal chance for each entry
# This chooses a word with the correct probability distribution in the transition matrix
current_word = np.random.choice(transitions[current_word], size=1)[0]
if k >= burn_in:
sentence.append(current_word)
return ' '.join(sentence)
print(sample_sentence(corpus, 50, 1000))
a dirty word
Hello, hello, hello, hello, hello, how low?
Hello, hello, how low?
Hello, hello, hello, hello, how low?
Hello, hello, hello!
With the lights out, it's hard to lose and self assured
Oh no, I taste
Oh yeah, I feel blessed
Our little group has always will until the lights out, it's hard to find
Oh well,
Normally, the end of these chains are fun to read in a nonsensical way. With a larger corpus, you begin to get better combinations.
1
2
longer_sample = sample_sentence(corpus, 100, 10000)
print(longer_sample)
just why I taste
Oh yeah, I guess it hard, it's hard to find
Oh well, whatever, never mind
Load up on guns, bring your friends
It's fun to pretend
She's over bored and self assured
Oh no, I feel blessed
Our little group has always will until the end
And I guess it hard, it's less dangerous
Here we are now, entertain us
A muchacho
An albino
A mosquito
My libido
Yeah, hey, yay
I'm worse at what I taste
Oh yeah, I feel blessed
Our little group has always been
And always been
And always will until the end
And I guess it makes me smile
I found it makes me smile
I found it hard, it's less dangerous
Here
Leaving in the new lines makes it more humorous because you get two words that “go” together in the song. We can actually expand our state space to be pairs of words rather than single words to get better results. But who’s going to see our results? The answer… the world:
Assembling the Bot
Now, we have explored how to use a Markov Chain to generate gibberish (essentially), and now we’re going to unleash that gibberish on the world with an automated twitter bot. In a brief post, we piece together the components necessary to have a bot. The resulting bot is available on GitHub.
To build a bot, there’s really two steps:
- Sample from a Markov Chain
- Post to twitter
The bot will allow you to input a series of text files as a corpora and use all of them to create your tweets. Having multiple documents complicates things a little when building the transition matrix, but it’s not too bad.
The Basics of the Bot
First, I have a class called Bot. Bot not only stores the transition dictionary but it also can generate tweets and post to twitter via the TwitterAPI. This class’s init gives you all of the options and creates some basic instance level variables that will get filled out later.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Bot:
def __init__(self,
documents,
twitter_api,
max_document_length = 1000,
burn_in = 250,
tweets_per_hour = 60*60):
self._documents = documents
self._twitter_api = twitter_api
self._corpus = {}
self._logger = logging.getLogger(__name__)
self._max_sentence_length = max_document_length
self._burn_in = burn_in
#Calculate sleep timer
self.sleep_timer = int(60*60/tweets_per_hour)
The main loop of the bot is the following function. In an eternal loop, it samples a tweet, posts a tweet, and waits– before starting all over again for all eternity.
1
2
3
4
5
6
7
8
9
10
def run(self):
self._load_data() # Here it loads the corpora and converts them into a transition matrix
while True:
try:
tweet = self._get_tweet() # Samples
self._twitter_api.tweet(tweet) # Posts to twitter
except Exception as e:
self._logger.error(e, exc_info=True)
self._twitter_api.disconnect()
time.sleep(self.sleep_timer) #Every 10 minutes
Cleaning Up the Output
In our last post, we would sample text but did nothing to clean it up to make it more readable. There’s a lot that can be done to clean it up and make it more sensical. First, you can leave in the punctuation to make it flow better on the page. Another trick for a larger corpus is to make your state two words rather than just a single word. This will make the output make more sense because the words will follow a more natural pattern.
Transition Matrix with Two States
In this next snippet of code, we have a corpora stored in self._documents. We follow the advice of the previous paragraph to create a transition matrix (dictionary) with two states. The code is slightly more complicated due to the special case of having one word in one corpus and one word in another, but it’s not so bad.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def _load_data(self):
next_key = None
for doc in self._documents:
with open(doc, "r") as f:
for line in f.readlines():
parsed, add = self._line_to_array(line)
if add:
if next_key is None:
a = -2
else:
a = 0
for k in range(0, (len(parsed) + a)):
if next_key is not None:
key = next_key
next_key = (next_key[1], parsed[k])
else:
key = (parsed[k], parsed[k + 1])
self._add_to_corpus(parsed, key, k, next_key) # You can imagine what this function does
if k == len(parsed) - 3 and next_key is None:
next_key = (parsed[k + 1], parsed[k + 2])
self.last_key = next_key
Generating the Text in a Nicer Way
This part is more of an art than a science. We apply a few heuristics to make the text flow better. As before we have a burn in section to get ourselves further into the state space. Then we make sure our tweet begins at the first word of a sentence and add proper capitalization.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def _generate_text(self, size=10000):
size += self._burn_in #For a burn in of 250 words.
start_word = self._grab_random_two_words()
text = ['']*size
cap = [False]*size
cap[0] = True
text[0] = start_word[0]
text[1] = start_word[1]
punc = set(st.punctuation)
# Create Sample
i = 2
while i < size:
if any([True if k in punc and k != ',' else False for k in text[i-1]]):
cap[i] = True
key = (text[i - 2], text[i - 1])
if key == self.last_key:
#Restart if last key is chosen
new_key = self._grab_random_two_words()
text[i] = new_key[0]
if i+1 < size:
text[i+1] = new_key[1]
key = new_key
i+=2
choice = np.random.choice(self._corpus[key])
if i < size:
text[i] = choice
i += 1
#Capitalize
for k in range(0,size):
if cap[k]:
text[k] = text[k].capitalize()
if k == size-1:
if not any([True if j in punc and j != '\'' else False for j in text[k]]):
text[k] = text[k] + '.'
# Find the first period after the burn in section
for first_period in range(self._burn_in, size):
if '.' in text[first_period]:
break
return ' '.join(text[(first_period+1):]).strip()
Posting to Twitter
The Twitter API is a custom class that wraps the library tweepy. It takes authentication information and allows you to login, tweet and disconnect from twitter.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class Twitter_Api():
def __init__(self, consumer_key, consumer_secret, access_key, access_secret):
self._logger = logging.getLogger(__name__)
self._consumer_key = consumer_key
self._consumer_secret = consumer_secret
self._access_key = access_key
self._access_secret = access_secret
self._authorization = None
if consumer_key is None:
self.tweet = lambda x : self._logger.info("Test tweet: " + x)
self._login = lambda x : self._logger.debug("Test Login completed.")
def _login(self):
auth = tweepy.OAuthHandler(self._consumer_key, self._consumer_secret)
auth.set_access_token(self._access_key, self._access_secret)
self._authorization = auth
def tweet(self, tweet):
if self._authorization is None:
self._login()
pass
api = tweepy.API(self._authorization)
stat = api.update_status(tweet)
self._logger.info("Tweeted: " + tweet)
self._logger.info(stat)
def disconnect(self):
self._authorization = None
Using the Bot
We run the bot by calling its main method. In the main method, we have to specify the corpora and the twitter authentification information.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def main():
# Configure Logger
logger = logging.getLogger()
handler = logging.StreamHandler()
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
# Get these from twitter, read tweepy's docs
consumer_key = None
consumer_secret = None
access_key = None
access_secret = None
documents = ["corpus_1.txt", "corpus_2.txt", 'corpus_3.txt'] # Specify the documents
twitter_api = Twitter_Api(consumer_key, consumer_secret, access_key, access_secret)
bot = Bot(documents, twitter_api)
bot.run()
Then, you can sit back, pour a tasty beverage, and watch it post the tweets to the console and your Twitter.