NLP-Learn

SentenceLearn generates a alignment-based language model from a document by mapping individual words and two-word phrases to their successors in the text from which the model is learning. It can then generate random sentences using the lexicon and the grammar learned from the language model.

API

To get up and running, simply place the text to train on in a file and then run:

python3 sentence_gen.py [training file name]

This code will construct a language model (or load one if it already exists) and will allow you to generate random sentences of a given length.

There are 3 pieces of code that are significant:

process.py

Before running this file, any text can be put into a text file and placed in the same directory as process.py. When run from STDin, process.py will ask for the filename as the input in the Terminal, construct the word model, and save that as a file in the same directory. Another pre-processed text file for the original text file will also appear in the same directory.
sentence_gen.py

This code will generate a language model (or load a pre-computed one) and then enter a loop where it generates random sentences of a given length that the user can input from STDin.
successor_model.py

Public Methods:

a. random_sent() - Generates a random sentence

b. generate_k_sentences(int k) - Generates k random sentences.

c. generate_sentence_length(int k) - Generates a sentence of length k

Description of the model

I built a successor table with each word being mapped into a dictionary that maps any word to all of the words that have followed that word in the learning text. From there, the random sentence generator picks randomly from any words that start a sentence (words that follow a period) and for every word, it randomly picks the next word from the list of successors of the current word. This model selects using a weighted probability because the table also keeps track of how many times a word has succeeded another word. This process continues until a period. I called the successors table a unigram table.

This model is actually surprisingly successful, although not perfect. Frequently, the sentences produced by this model lose their "train of thought" or start devolving into strange and incorrect sentences. I felt like, in general, the problem was a lack of context.

Therefore, I also constructed a bigram model, which maps two-word phrases to successors. Then, when generating the random sentences, the previous 2 words can be taken into context, and, if we also search the bigram set, then we can find a successor the sentence that might make more sense.

I integrated the bigram set with the unigram set and set a constant by which I increased the probabilistic weight of a bigram result.

In terms of data structures, I used a dictionary to map words to their successors, and for each successor word, I put it in a 2-element list with its weight. Therefore, the primary data structure is a dictionary with strings as keys and a list of 2-element lists. Random Notes

There are a few more tweaks in the model:

I increased the factor by which I preferred the bigram results if I am looking for a successor of a "stop" word (of, in, etc.) because context is especially useful in those contexts.
I eliminated quotation marks and dashes, which were creating bizzarre sentences.
I kept tweaking the value for the factor by which I prefer bigrams to generate better sentences until I found a happy middle.
I throw out random sentences that are longer than 15 words (these tend to be sentences that started well intentioned but devolved).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
README.md		README.md
__init__.py		__init__.py
frankenstein.txt		frankenstein.txt
process.py		process.py
sentence_gen.py		sentence_gen.py
successor_model.py		successor_model.py
trump.txt		trump.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Learn

API

Description of the model

In terms of data structures, I used a dictionary to map words to their successors, and for each successor word, I put it in a 2-element list with its weight. Therefore, the primary data structure is a dictionary with strings as keys and a list of 2-element lists. Random Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP-Learn

API

Description of the model

In terms of data structures, I used a dictionary to map words to their successors, and for each successor word, I put it in a 2-element list with its weight. Therefore, the primary data structure is a dictionary with strings as keys and a list of 2-element lists. Random Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages