I have a lot of text generated by some voice-to-text application (I honestly do not know the name of the application since I do not have physical access, however I have access to the live output). I am mining this data realtime and the output text looks like the first attachment, some parts are very clean, and some, well, very redundant.
I have now written a piece of software in Python that cleans up the text (Attachment two). The thing is, I can only do it on a lot of text at a time, eg. my backups which has hundreds of megabytes of pure text, when it comes in realtime, it gets hard to process only a few strings, since the semi-redundancy lasts 15-25 lines (as you can see in attachment 1).
The software works on the bigger files, and I am now trying to rewrite the code so it works with the live output.
But since I am a self taught programmer, I was wondering if anybody could share how their approach to doing the job would be.
My approach is (also see attachment two, however I am bad at commenting, so I do not know if you would get much out of it):
Open the file(plain text) and wait until 25 lines has been written to the file
read 25 lines into a list, lets call it MasterList
run clean-up functions (1-7) on MasterList (see below)
Print lines 10-14 to cleaned up file (first time it prints lines 0-14)
Push line 5-24 of MasterList to the beginning of MasterList, making them now have indices 0-19
Read 5 new lines to Masterlist or wait until 5 new lines are ready
Go back to #3
--> Note regarding #3: The cleanup functions do the following: *Compare lines by the use of Fuzzy String Matching, fuzzywuzzy, and delete duplicate or semiduplicate lines
*Check if the first word in a sentence is the same as the last word in the previous sentence, and in that case delete the last word in the previous sentence *Smaller stuff, to make the text look clean.
My questions is: Would you go about it in a completely different way? Maybe machine learning? Another language may be better suited? Any libraries or even software that already would do this?
If you do read my code, I am also eager to learn my mistakes, if you see some stupid thing I am doing, criticism (even harsh criticism if you feel like bashing me) is very welcome.