Utils

Various utils for cleaning, organizing and capturing other information.

Generic utils


source

loader

 loader (path:str|Path, extension:str, recursive:bool=True,
         symlinks:bool=True, file_glob:str=None, file_re:str=None,
         folder_re:str=None, skip_file_glob:str=None,
         skip_file_re:str=None, skip_folder_re:str=None,
         func:callable=<function join>, ret_folders:bool=False)

Given a Path and an extension, returns all files with the extension in the path

Type Default Details
path str | Path path to a given folder,
extension str extension of the file you want
recursive bool True search subfolders
symlinks bool True follow symlinks?
file_glob str None Only include files matching glob
file_re str None Only include files matching regex
folder_re str None Only enter folders matching regex
skip_file_glob str None Skip files matching glob
skip_file_re str None Skip files matching regex
skip_folder_re str None Skip folders matching regex,
func callable join function to apply to each matched file
ret_folders bool False return folders, not just files
Returns L returns L

source

get_data

 get_data (fname:str|pathlib.Path)

Reads from a txt file

Type Details
fname str | Path path to the file
Returns str returns content of the file

source

load_pmi

 load_pmi (fname:str|pathlib.Path)

Loads the PMI matrix

Type Details
fname str | Path
Returns np.ndarray name of pmi file # pmi matrix
x = np.random.randint(0 , 100, (100, 100))
np.save('test.npy', x)
read_file = load_pmi('test.npy')
Loaded test.npy
test_eq(x, read_file)

source

load_dictionary

 load_dictionary (fname:str)

Given a fname, function loads a pkl dictionary from the current directory

Type Details
fname str path to the pkl file
Returns dict returns the contents

source

normalize

 normalize (data:numpy.ndarray)

Given an input array, return normalized array

Type Details
data np.ndarray input array
Returns np.ndarray normalized array
test_eq(normalize([1, 2, 3, 4, 5]), [0.  , 0.25, 0.5 , 0.75, 1.  ])

source

chelp

 chelp ()

Show help for all console scripts

chelp()
clean_file                Takes name of a txt file and writes the tokenized sentences into a new txt file
corr_hm                   Generates correlation plots from normalized SSMs
cp_help                   Show help for all console scripts
heatmaps                  Generates plots for embeddings in the folder
heatmaps_pkl              Generates SSMs from pkl files
histograms                Generates histograms for embeddings in the folder
lex_ts                    Generate lexical TS from Lexical SSM
make_pkl                  Create pkl for time series from embeddings
ts_pkl                    Plot timeseries from the pkl file

Utils for cleaning text

Before using any of the cleaning utils in the file, please run download_nltk_dep first.


source

download_nltk_dep

 download_nltk_dep ()

Downloads the nltk dependencies


source

split_by_newline

 split_by_newline (text:str)

Only use when sentences are already tokenized returns sentences split by

Type Details
text str sentences separated by
Returns L list of sentences
text = "Hello there!\nThis is how this functions works!"
split_by_newline(text)
(#2) ['Hello there!','This is how this functions works!']

source

rm_useless_spaces

 rm_useless_spaces (t:str)

Removes useless spaces

Type Details
t str sentence with extra spaces
Returns str sentence without extra spaces
rm_useless_spaces('  This is      test sentence.  This removes  all the extra  spaces.  ')
'This is test sentence. This removes all the extra spaces.'

source

make_sentences

 make_sentences (text:str)

Converts given bulk into sentences

Type Details
text str bulk text
Returns L list of sentences

source

write_to_file_cleaned

 write_to_file_cleaned (sentences:list, fname:str)

Writes the sentences to a .txt file

Type Details
sentences list list of sentences
fname str name of output file
Returns None

source

clean

 clean (fname:str)

Takes name of a txt file and writes the tokenized sentences into a new txt file

Type Details
fname str name of input txt file
Returns None

All functions mentioned above are merged into a single function called clean. You only need to give it the name of the .txt file that you want to clean and call the function

fname = '../files/dummy.txt'
text = get_data(fname)
print(text)
MARLEY was dead: to begin with. There is no doubt
whatever about that. The register of his burial was
signed by the clergyman, the clerk, the undertaker,
and the chief mourner. Scrooge signed it: and
Scrooge's name was good upon 'Change, for anything he
chose to put his hand to. Old Marley was as dead as a
door-nail.

Mind! I don't mean to say that I know, of my
own knowledge, what there is particularly dead about
a door-nail. I might have been inclined, myself, to
regard a coffin-nail as the deadest piece of ironmongery
in the trade. But the wisdom of our ancestors
is in the simile; and my unhallowed hands
shall not disturb it, or the Country's done for. You
will therefore permit me to repeat, emphatically, that
Marley was as dead as a door-nail.

This is a new sentence.

It goes from this to

make_sentences(get_data(fname))
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]

The clean function writes these sentences into a txt file with the name <fname>_cleaned.txt


source

get_wordnet_pos

 get_wordnet_pos (word:str)

Map POS tag to first character lemmatize() accepts

Type Details
word str input word token
Returns str POS of the given word

source

remove_stopwords

 remove_stopwords (sentence:str)

Takes a sentence and removes stopwords from it

Type Details
sentence str input sentence
Returns str output sentence

source

remove_punctuations

 remove_punctuations (sentence:str)

Takes a sentence and removes punctuations from it

Type Details
sentence str input sentence
Returns str output sentence

source

remove_punc_clean

 remove_punc_clean (sentence:str, lemmatize:bool=False)

*Takes a sentence and removes punctuations and stopwords from it

Will lemmatize words if lemmatize = True*

Type Default Details
sentence str input sentence
lemmatize bool False flag to lemmatize
Returns str
Note

It is possible that while using remove_punc_clean, a sentence might get eliminated completely as it only contained stopwords.


source

process_for_lexical

 process_for_lexical (fname:str)

Given an input txt file, return removed sentences

Type Details
fname str name of the input txt file
Returns L

Example contd.

data = get_data(fname)
sentences = make_sentences(data)
sentences
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]

Let’s continue the same example from above

Here, the remove_punc_clean function removes punctuations, STOPWORDS and lemmatizes the word and returns the cleaned sentence.

It is possible that a sentence may be removed completely as it may contain only STOPWORDS.

This method is to be used for methods involving lexical analysis.

Without lemmatization

for sentence in sentences:
    print(remove_punc_clean(sentence))
MARLEY dead begin
doubt whatever
register burial signed clergyman clerk undertaker chief mourner
Scrooge signed Scrooge name good upon Change anything chose put hand
Old Marley dead door nail
Mind
mean say know knowledge particularly dead door nail
might inclined regard coffin nail deadest piece ironmongery trade
wisdom ancestors simile unhallowed hands shall disturb Country done
therefore permit repeat emphatically Marley dead door nail
new sentence

With Lemmatization

for sentence in sentences:
    print(remove_punc_clean(sentence, lemmatize=True))
MARLEY dead begin
doubt whatever
register burial sign clergyman clerk undertaker chief mourner
Scrooge sign Scrooge name good upon Change anything chose put hand
Old Marley dead door nail
Mind
mean say know knowledge particularly dead door nail
might inclined regard coffin nail deadest piece ironmongery trade
wisdom ancestor simile unhallowed hand shall disturb Country do
therefore permit repeat emphatically Marley dead door nail
new sentence
clean('../files/dummy.txt')
dummy.txt contains 11 sentences
process_for_lexical('../files/dummy.txt')
Done processing dummy.txt
(#0) []
Path('../files/').ls()
(#2) [Path('../files/dummy.txt'),Path('../files/dummy_cleaned.txt')]

source

num_words

 num_words (sentence:str)

Returns the number of words in a sentence

Type Details
sentence str input sentence
Returns int number of words
print(sentences[0])
num_words(sentences[0])
MARLEY was dead: to begin with.
6
print(sentences[1])
num_words(sentences[1])
There is no doubt whatever about that.
7

Patches to pathlib.Path

With all these utility functions, these are just some additional functions which are applied to pathlib.Path. There are 3 additional functions/properties when you have a numpy array or a txt file inside a Path object.


source

Path.shape

 Path.shape ()

Imagine I read a numpy array and I wish to see its shape. If I were to use the regular route, I would have to…

with working_directory('/home/deven'):
    p = 'test.npy'
    arr = np.load(p)
arr.shape
(100, 100)

Instead of all of that, I can just call Path().shape, like this

with working_directory('/home/deven'):
    shp = Path('test.npy').shape
    test_eq(arr.shape, Path('test.npy').shape)

source

Path.text

 Path.text ()

Using this same logic, when I have a txt file inside a Path object

Path('../files/dummy.txt').text
"MARLEY was dead: to begin with. There is no doubt\nwhatever about that. The register of his burial was\nsigned by the clergyman, the clerk, the undertaker,\nand the chief mourner. Scrooge signed it: and\nScrooge's name was good upon 'Change, for anything he\nchose to put his hand to. Old Marley was as dead as a\ndoor-nail.\n\nMind! I don't mean to say that I know, of my\nown knowledge, what there is particularly dead about\na door-nail. I might have been inclined, myself, to\nregard a coffin-nail as the deadest piece of ironmongery\nin the trade. But the wisdom of our ancestors\nis in the simile; and my unhallowed hands\nshall not disturb it, or the Country's done for. You\nwill therefore permit me to repeat, emphatically, that\nMarley was as dead as a door-nail.\n\nThis is a new sentence."

source

Path.sentences

 Path.sentences ()
Path('../files/dummy.txt').sentences
(#11) ['MARLEY was dead: to begin with.','There is no doubt whatever about that.','The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner.',"Scrooge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to.",'Old Marley was as dead as a door-nail.','Mind!',"I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail.",'I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade.',"But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for.",'You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail.'...]