Gender? | w2 – python with command line- textblob

 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

The above is the poem I made in the end.

It has two original texts. Both of them are translated from ancient Chinese poems. One poet is male and the other is female.

The poem composed by the male poet is called Bring In The Wine; The poem composed by female poet is called Little Overlapping Hills. Here they are.

The process I made the final poem.

  • Use textblob to identify the tags of words, like noun, verbs.
  • Create a python function file to count the top 10 words that show up the most in each poem.
  • Create another python function file, that combines the tow word list I made together by adding a “:” in between them.

The command I used in command line. The actual function python files attached after these.

  • python wordCount_textBlob_tag.py <bring_in_the_wine.txt >male.txt
  • python wordCount_textBlob_tag.py <little_over_lapping_hills.txt >female.txt
  • python combineTwoFiles.py male.txt female.txt >final.txt

Code for wordCount_textBlob_tag.py


import sys,string
from textblob import TextBlob
from collections import Counter
import codecs

whole_line = ""
word_list = []
text = sys.stdin.read()
text = text.lower()
text = text.translate(string.maketrans("",""), string.punctuation)
# text = sys.stdin.read().decode('ascii', errors="replace")
blob = TextBlob(text)
tags = blob.tags
for word,tag in blob.tags:
    if ("NN" in tag):
        word_list.append(word)
        # print word

counter = Counter(word_list)
most_common = counter.most_common(10)
for item in most_common:
    print item[0]

 

 

Code for combineTwoFiles.py


import sys

word_list=[]
minLen = 0
# for n in sys.argv[1:]:
    # print n
file1 = open(sys.argv[1])
file1_lines = file1.readlines()

file2 = open(sys.argv[2])
file2_lines = file2.readlines()


if(len(file1_lines)<len(file2_lines)):
    minLen = len(file1_lines)
else:
    minLen = len(file2_lines)
i = 0
while(i < minLen):
    word_list.append(file1_lines[i] +":"+file2_lines[i])
    i += 1

for item in word_list:
    item = item.replace('\n','')
    print item


>>>>>>>Some Detour I made before

I used word count method to list the top 10 words first, without having it analyzed by textblob. But it turns out to have a lot of “the, a, of …” So I was not happy about it.

The following are the pure word count method I made in python.

 


import sys,string
from collections import Counter

whole_line = ""
word_list = []

for line in sys.stdin:
    line = line.split('\n')
    whole_line += line[0]+" "

whole_line = whole_line.lower()
#remove all the punctuations
whole_line = whole_line.translate(string.maketrans("",""), string.punctuation)
word_list = whole_line.split()

counter = Counter(word_list)
most_common = counter.most_common(10)
for item in most_common:
    print item[0]

Leave a Reply

Your email address will not be published. Required fields are marked *