Data Science – MA, Ji@UT Austin LBJ School of Public Affairs

June 3, 2022February 7, 2024

2023 Spring Course: Computational Social Science Methods (Text and Network Analysis)

The course may also be cross-listed at UT’s soc, info, and govt departments.

The course has demanding prerequisites (https://css.jima.me/prerequisites/), and I hope to recruit highly motivated students to join the self-learning group which helps them meet the requirements. Students who are interested in this course should prepare for the prerequisites over the summer and fall so that they can meet the requirements before the 2023 spring. Course details are below.

Computational Social Science Methods (2023 Spring, graduate course)

Course detail: https://css.jima.me/
Instructor: Dr. Ji Ma (https://jima.me/)
Email: maji@austin.utexas.edu

This course introduces computational social science methods and contextualizes these methods within the social science research design. The first part of this course (w1–w3) gives you an overview of this course, programming fundamentals, and how to use high-performance cloud computing resources (https://www.tacc.utexas.edu/systems/chameleon). The second part (w4–w12) is analysis-oriented and primarily covers text analysis (w4–w8; with an emphasis on multilingual language analysis) and network analysis (w9–w12). The last few weeks focus on research design with computational methods and the final project. Bilingual or multilingual language ability is a plus. Programming is an essential part of this course but not the purpose and will not be taught. We will be coding for social good.

The course has demanding prerequisites (https://css.jima.me/prerequisites); therefore, students may need to work on the prerequisites in 2022 summer and fall if they are highly motivated. All registrations need to be approved by the instructor in late 2022 fall. Students who are interested in this course can join the learning group (https://uta-css.slack.com/) where more learning resources will be shared.

February 20, 2020February 20, 2020

Has Hong Kong Anti-Extradition Movement Ended?

Needless to introduce the background of the 2019 Hong Kong anti-extradition movement which started in March 2019. It has been a year since its inception. Although people are still talking about the protests, the movement gradually steps out the mass media’s front page. An important question to ask is, has this movement ended?

February 19, 2020April 10, 2021

Build your own computing cluster on ChameleonCloud

Social scientists also run heavy computational jobs. In one of my projects, I need to analyze the psychological state of a few billion Telegram messages. ChameleonCloud provides hosts with up to 64 cores (or “threads”, sometimes “workers”, yes these terms are confusing but CS folks to blame). But even with parallel computing on the best server, the job will run for years, and I need this project for tenure.

January 25, 2020February 19, 2020

Operating large files on ChameleonCloud

I primarily use Chameleon Cloud (CC) for my research projects. It provides great flexibility because I can run bare-metal servers (e.g., 44 threads/cores, 128G+ RAM) for a seven-day lease which is also renewable if the hosts I’m using are not booked by others. Its supporting team is also amazing.

But everything becomes slow if you are working with a really big dataset. For example, I’m working on a Telegram project and have 1TB+ data. This really gets me a headache. Well, the CC machines are able to handle this but need extra configurations.

February 10, 2016September 29, 2019

Parallel computing using IPython: Important notes for naive scholars without CS background

Analysis of network and complex system requires too much computing resources. Although the learning curve is deep, the power of parallel computing must be utilized, otherwise, more time will be spent on waiting. Moreover, for exploratory academic research, we will not know what’s the next step until we finish the current analysis. So the research life-cycle becomes hypothesis -> operationalization -> LONG TIME coding and debugging -> LONG TIME waiting for result -> new hypothesis.

With IPython Notebook, parallel computing can be easily operated; however, like what I’ve said: We cannot understand the easiest programming skills unless we are able to operate them. I’ll not come to this post if I do not have to wait for a week only for one result. Playing parallel computing with IPython is easy, but for real jobs, it’s not. Scholars in social science area may be less skilled in programming – we are not trained to be. I’ve made great efforts and finally got some progress which may be laughed by CS guys.

While using IPython Notebook (now named Jupyter Notebook) for parallel computing, Jupyter will start several remote engines beside the local one we are using. These remote engines are blank which means that the variables and functions defined and modules imported on the local engine do not work on the remote ones. Specifically, the puzzle for me was (yes, was!): How to operate the variables, functions, and modules on the remote engines.

Continue reading “Parallel computing using IPython: Important notes for naive scholars without CS background”

September 20, 2015July 29, 2016

Parallel Computing using IPython (1)

We cannot understand the easiest programming skills unless we are able to manipulate them. Parallel computing first seemed to be hard, but once we could write even several lines of codes, it became so easy.

Continue reading “Parallel Computing using IPython (1)”

April 12, 2015September 3, 2017

Extracting information from text: Using Python NLTK library and TF-IDF algorithm

Update: NLTK has its own algorithm for TF-IDF, please forgive my ignorance.

http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf_idf

=====================

I’ve been working on extracting information from a large amount of text which mainly consists of thousands of journal article’s abstract. TF–IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Wikipedia has a page introducing this algorithm in detail, so I will not discuss more about this.

Continue reading “Extracting information from text: Using Python NLTK library and TF-IDF algorithm”

March 20, 2015

Information Visualization with Geographical Attributes

Information is abstract – may be a reason that why we tend to visualize. If it has some geographical attributes, visualization will be much more concrete and understandable. Below is an example using the same data set (Mainland China offset projects).

[cjtoolbox name=’china offset project app’]

March 13, 2015

A simple data visualization example: Project Database for Chinese Offset Projects

Data visualization is one of my main focuses this semester. I have been trying several different visualization tools, such as python’s matplotlib, VTK, Plotly, and finally I came to Tableau. Online and interaction are two important characters, especially for the visualization of information in huge size (which we may call big data). Tableau is the best one that can perfectly meet my demands (so far): you don’t have to coding, totally GUI interface but still retains great flexibility, online and interactive. The following chart is a simple example.

[cjtoolbox name=’Project Database for Chinese Offset Projects’]