Extracting information from text: Using Python NLTK library and TF-IDF algorithm

Update: NLTK has its own algorithm for TF-IDF, please forgive my ignorance.



I’ve been working on extracting information from a large amount of text which mainly consists of thousands of journal article’s abstract. TF–IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Wikipedia has a page introducing this algorithm in detail, so I will not discuss more about this.

I was expecting there might be TF-IDF function in python’s NLTK library; however, I did not find (I am a beginner of NLTK, so may be due to my ignorance). So I wrote the function by myself. Several important points worth noting:

  1. Encoding is tricky, but codecs is a great helper.
  2. Require a lot of time to computing. The computing of IDF requires the transverse of all the articles for each term. May be my algorithm is ineffective.
  3. There are ghost terms which exist in the whole corpus but cannot be found while searching within each article individually. I am using the variable ghost_terms to debug.
import nltk
import codecs
from nltk import word_tokenize, FreqDist
from math import log

raw_file = codecs.open('V6.4all-utf8', 'r', 'utf8')

ab_corpus_list = []
for x in raw_file:
if x.startswith('AB'):

print 'ab_corpus_list complete'

ab_corpus_string = ''
for x in ab_corpus_list:
ab_corpus_string = ab_corpus_string + x

print 'ab_corpus_string complete'

ab_corpus_token = word_tokenize(ab_corpus_string.lower())

print 'ab_corpus_token complete'

word_freq_total = FreqDist(ab_corpus_token)

print 'word_freq_total complete'

ghost_terms = []

def idf(tm):
num_doc_tm = 0
for x in ab_corpus_list:
if FreqDist(word_tokenize(x))[tm] != 0:
num_doc_tm += 1
if num_doc_tm == 0:
return 0
if num_doc_tm != 0:
return log(float(len(ab_corpus_list)) / num_doc_tm)

tf_idf = []
progress = 0
for x in word_freq_total:
tf_idf_medium = []
tf_idf_score = float(word_freq_total[x]) * float(idf(x))
progress += 1
print str((round(float(progress) / len(word_freq_total), 4) * 100)) + '%'

print tf_idf[-10:]

tf_idf_file = open('result.txt', 'wb')
for x in tf_idf:
tf_idf_file.writelines(str(x[0]) + ", " + str(x[1]) + 'n')

print ghost_terms
ghost_terms_file = open('ghost_terms.txt', 'wb')
for x in ghost_terms: