One of my playground projects is this bot. Based on generative grammar, it generates daily horoscope predictions (in Spanish) by combining different phrases randomly - or mostly randomly. Some of the generated sentences are quite long and complex, and can evoke some kind of situation on the mind of the reader.
Here are two reasons to pick horoscope predictions as a topic:
- Most of the sentences use verbs in future tense (predict what is going to happen) or imperative (tells you to do this or to avoid that).
- At the same time, they are often ambiguous enough so that the reader himself has to imagine or visualize what it actually means.
These two preconditions simplify the rules needed for creating a generative bot, so that we can scope down the theory / grammar part, and focus on getting a first version that generates consistent text.
Parts of a sentence
Basically there are only 3 types of templates:
- Outer template: some kind of structure with gaps that can be filled with random parts of a sentence
- Noun phrase: a part of a sentence describing one or several objects or persons. These can work as a subject or as an object.
- Verb phrase: a part of a sentence that describes an action
You could combine noun phrases and verb phrases directly to create basic sentences, by picking one from each column:
|Noun phrase||Verb phrase|
|Someone||is not telling the truth|
|My horoscope bot||is going to ruin your day|
|Your neighbour||wants to hack your wifi|
|Terminator||does not like sweet potato ice cream|
|That girl you used to know||evolved into a crazy cat lady|
If you notice, all these verb phrase contain noun phrases. These are marked in italics.
But these sentences are kind of boring, so here is where the outer templates make the generated text flow better:
- [noun phrase] will tell you that [noun phrase] [verb phrase]
- Did you know that [noun phrase] [verb phrase]? Well, now you know!
Unfortunately this is not enough, since template combination should not break some rules. Example:
- Your neighbour wants to hack my wifi
- You wants to hack my wifi
Can you tell which grammar rule is being broken on the second sentence? Hint…
Some constraints are set in order to make the generated paragraphs credible, or at least grammatically correct. As native speakers of a language, we are often mostly unaware of the grammar rules of our own language, but our brains are still watching these rules for us.
So your generator has to take care of this one way or another. One possibility is to restrict the templates to always be on one side of the rule (example: use only singular). Another one is to make functions that rewrite certain templates to comply with the rules (example: if 3rd person detected, append s suffix to regular verb).
I will not focus on the specific rules for the horoscope bot, since it generates text in Spanish
Grammar rules are different for each language, so you will need to check resources about grammar or linguistics for the target language. And you will find yourself hoping that the rules you need to encode are not too irregular.
On a side note, I tried to test my generator after translating some of the templates into English, and the result, though hilarious and understandable, was not correct English at all. But it was fun to give it a try.
In addition to grammar constraints, you will need some semantics constrains. For example, with transitive verbs (verbs that take an object, example: “need”). These should be very flexible when it comes to the objects they accept, since the objects are selected randomly by the generator. You might need a rubber chicken, but you cannot drink a rubber chicken. This makes the verb to drink not suitable to be used on the verb templates that take a variable object.
We could in theory have had some kind of semantic information about verbs like: what type of word usually follows “to drink”? Then we would need to have some kind of model to clarify the objects: “is rubber chicken a drink?”. Or maybe some kind of statistics, like the ones used to autocomplete text. I decided to leave semantics out of the generator to make it much, much simpler - after all, it is just a hobby project for fun. But if you happen to know about frameworks or tools offering support around semantics, I would be really happy to learn about them :-)
Once you set up the generator rules, the code of the generator itself is quite simple. The most complicated part is adding templates that can be mixed together into correct sentences. That, and of course testing. Lots of manual testing. Because you cannot guarantee that you are always generating correct grammar without having a flawless parser to test with. And the only one known (to me, at least) is the human brain.
How many templates do I need?
Lots of them! The bot generates the horoscope paragraphs once a day, with:
- 12 different horoscope signs
- 3 sentences per sign
- Each sentence uses 1 outer template + 1 - 2 verb phrases + 1 - 3 noun phrases
Which means, these end up repeating quite often, unless you have a long list of templates to choose among. So there is a need to add new templates to the lists relatively often. Which is… time consuming.
In the next post, I will explain how I semi-automated the process of adding new templates to the bot, so that I can keep it fresh with minimum effort.