Extracting information from text: Using Python NLTK library and TF-IDF algorithm

Update: NLTK has its own algorithm for TF-IDF.



I’ve been working on extracting information from a large amount of text which mainly consists of thousands of journal article’s abstract. TF–IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Wikipedia has a page introducing this algorithm in detail, so I will not discuss more about this.

Understanding the Journal Review Process: How Associate Editors Work?

I have submitted a manuscript in mid-January; thereafter, I got another routine besides refreshing my Facebook page. The progress has been staying in “Awaiting Referee Selection” for about two months; until today, it changes to “Awaiting Referee Invitation.” I am so curious (and also frustrated) about the review process, and the following slide meets my curiosity perfectly – it will tell you how Associate Editors work.

This is an operation manual of Manuscript Central for AEs. MC is a popular manuscript processing system through which I have submitted my paper. I have embedded this file in this post, original link of this file is:


A simple data visualization example: Project Database for Chinese Offset Projects

Data visualization is one of my main focuses this semester. I have been trying several different visualization tools, such as python’s matplotlib, VTK, Plotly, and finally I came to Tableau. Online and interaction are two important characters, especially for the visualization of information in huge size (which we may call big data). Tableau is the best one that can perfectly meet my demands (so far): you don’t have to coding, totally GUI interface but still retains great flexibility, online and interactive. The following chart is a simple example.

