= np.random.randint(0 , 100, (100, 100))
x 'test.npy', x)
np.save(= load_pmi('test.npy') read_file
Loaded test.npy
loader (path:str|Path, extension:str, recursive:bool=True, symlinks:bool=True, file_glob:str=None, file_re:str=None, folder_re:str=None, skip_file_glob:str=None, skip_file_re:str=None, skip_folder_re:str=None, func:callable=<function join>, ret_folders:bool=False)
Given a Path and an extension, returns all files with the extension in the path
Type | Default | Details | |
---|---|---|---|
path | str | Path | path to a given folder, | |
extension | str | extension of the file you want | |
recursive | bool | True | search subfolders |
symlinks | bool | True | follow symlinks? |
file_glob | str | None | Only include files matching glob |
file_re | str | None | Only include files matching regex |
folder_re | str | None | Only enter folders matching regex |
skip_file_glob | str | None | Skip files matching glob |
skip_file_re | str | None | Skip files matching regex |
skip_folder_re | str | None | Skip folders matching regex, |
func | callable | join | function to apply to each matched file |
ret_folders | bool | False | return folders, not just files |
Returns | L | returns L |
get_data (fname:str|pathlib.Path)
Reads from a txt file
Type | Details | |
---|---|---|
fname | str | Path | path to the file |
Returns | str | returns content of the file |
load_pmi (fname:str|pathlib.Path)
Loads the PMI matrix
Type | Details | |
---|---|---|
fname | str | Path | |
Returns | np.ndarray | name of pmi file # pmi matrix |
Loaded test.npy
load_dictionary (fname:str)
Given a fname, function loads a pkl
dictionary from the current directory
Type | Details | |
---|---|---|
fname | str | path to the pkl file |
Returns | dict | returns the contents |
normalize (data:numpy.ndarray)
Given an input array, return normalized array
Type | Details | |
---|---|---|
data | np.ndarray | input array |
Returns | np.ndarray | normalized array |
chelp ()
Show help for all console scripts
clean_file Takes name of a txt file and writes the tokenized sentences into a new txt file
corr_hm Generates correlation plots from normalized SSMs
cp_help Show help for all console scripts
heatmaps Generates plots for embeddings in the folder
heatmaps_pkl Generates SSMs from pkl files
histograms Generates histograms for embeddings in the folder
lex_ts Generate lexical TS from Lexical SSM
make_pkl Create pkl for time series from embeddings
ts_pkl Plot timeseries from the pkl file
Before using any of the cleaning utils in the file, please run download_nltk_dep
first.
download_nltk_dep ()
Downloads the nltk
dependencies
split_by_newline (text:str)
Only use when sentences are already tokenized returns sentences split by
Type | Details | |
---|---|---|
text | str | sentences separated by |
Returns | L | list of sentences |
(#2) ['Hello there!','This is how this functions works!']
rm_useless_spaces (t:str)
Removes useless spaces
Type | Details | |
---|---|---|
t | str | sentence with extra spaces |
Returns | str | sentence without extra spaces |
'This is test sentence. This removes all the extra spaces.'
make_sentences (text:str)
Converts given bulk into sentences
Type | Details | |
---|---|---|
text | str | bulk text |
Returns | L | list of sentences |
write_to_file_cleaned (sentences:list, fname:str)
Writes the sentences to a .txt file
Type | Details | |
---|---|---|
sentences | list | list of sentences |
fname | str | name of output file |
Returns | None |
clean (fname:str)
Takes name of a txt file and writes the tokenized sentences into a new txt file
Type | Details | |
---|---|---|
fname | str | name of input txt file |
Returns | None |
All functions mentioned above are merged into a single function called clean
. You only need to give it the name of the .txt file that you want to clean and call the function
MARLEY was dead: to begin with. There is no doubt
whatever about that. The register of his burial was
signed by the clergyman, the clerk, the undertaker,
and the chief mourner. Scrooge signed it: and
Scrooge's name was good upon 'Change, for anything he
chose to put his hand to. Old Marley was as dead as a
door-nail.
Mind! I don't mean to say that I know, of my
own knowledge, what there is particularly dead about
a door-nail. I might have been inclined, myself, to
regard a coffin-nail as the deadest piece of ironmongery
in the trade. But the wisdom of our ancestors
is in the simile; and my unhallowed hands
shall not disturb it, or the Country's done for. You
will therefore permit me to repeat, emphatically, that
Marley was as dead as a door-nail.
This is a new sentence.
It goes from this to
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]
The clean
function writes these sentences into a txt file with the name <fname>_cleaned.txt
get_wordnet_pos (word:str)
Map POS tag to first character lemmatize() accepts
Type | Details | |
---|---|---|
word | str | input word token |
Returns | str | POS of the given word |
remove_stopwords (sentence:str)
Takes a sentence and removes stopwords from it
Type | Details | |
---|---|---|
sentence | str | input sentence |
Returns | str | output sentence |
remove_punctuations (sentence:str)
Takes a sentence and removes punctuations from it
Type | Details | |
---|---|---|
sentence | str | input sentence |
Returns | str | output sentence |
remove_punc_clean (sentence:str, lemmatize:bool=False)
*Takes a sentence and removes punctuations and stopwords from it
Will lemmatize words if lemmatize = True
*
Type | Default | Details | |
---|---|---|---|
sentence | str | input sentence | |
lemmatize | bool | False | flag to lemmatize |
Returns | str |
It is possible that while using remove_punc_clean
, a sentence might get eliminated completely as it only contained stopwords.
process_for_lexical (fname:str)
Given an input txt file, return removed sentences
Type | Details | |
---|---|---|
fname | str | name of the input txt file |
Returns | L |
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]
Let’s continue the same example from above
Here, the remove_punc_clean
function removes punctuations, STOPWORDS and lemmatizes the word and returns the cleaned sentence.
It is possible that a sentence may be removed completely as it may contain only STOPWORDS.
This method is to be used for methods involving lexical analysis.
Without lemmatization
MARLEY dead begin
doubt whatever
register burial signed clergyman clerk undertaker chief mourner
Scrooge signed Scrooge name good upon Change anything chose put hand
Old Marley dead door nail
Mind
mean say know knowledge particularly dead door nail
might inclined regard coffin nail deadest piece ironmongery trade
wisdom ancestors simile unhallowed hands shall disturb Country done
therefore permit repeat emphatically Marley dead door nail
new sentence
With Lemmatization
MARLEY dead begin
doubt whatever
register burial sign clergyman clerk undertaker chief mourner
Scrooge sign Scrooge name good upon Change anything chose put hand
Old Marley dead door nail
Mind
mean say know knowledge particularly dead door nail
might inclined regard coffin nail deadest piece ironmongery trade
wisdom ancestor simile unhallowed hand shall disturb Country do
therefore permit repeat emphatically Marley dead door nail
new sentence
num_words (sentence:str)
Returns the number of words in a sentence
Type | Details | |
---|---|---|
sentence | str | input sentence |
Returns | int | number of words |
pathlib.Path
With all these utility functions, these are just some additional functions which are applied to pathlib.Path
. There are 3 additional functions/properties when you have a numpy array or a txt file inside a Path object.
Path.shape ()
Imagine I read a numpy array and I wish to see its shape. If I were to use the regular route, I would have to…
Instead of all of that, I can just call Path().shape
, like this
Path.text ()
Using this same logic, when I have a txt file inside a Path
object
"MARLEY was dead: to begin with. There is no doubt\nwhatever about that. The register of his burial was\nsigned by the clergyman, the clerk, the undertaker,\nand the chief mourner. Scrooge signed it: and\nScrooge's name was good upon 'Change, for anything he\nchose to put his hand to. Old Marley was as dead as a\ndoor-nail.\n\nMind! I don't mean to say that I know, of my\nown knowledge, what there is particularly dead about\na door-nail. I might have been inclined, myself, to\nregard a coffin-nail as the deadest piece of ironmongery\nin the trade. But the wisdom of our ancestors\nis in the simile; and my unhallowed hands\nshall not disturb it, or the Country's done for. You\nwill therefore permit me to repeat, emphatically, that\nMarley was as dead as a door-nail.\n\nThis is a new sentence."
Path.sentences ()
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]