I’m working on crawling data from some websites for my research, the most challenging issue is the verification image – the barrier set by websites to prevent programmed crawling. I’ve tried different approaches, but all failed: the success rate is too low to be usable. Looks like such verification mechanism is not as vulnerable as people always assume. However, it is beneficial to write down my lesson, for my own reference and other folks who may want to give a try. Promising solutions for avoiding verification may be the IP pools and delayed requests (courtesy to servers!). Continue reading “web crawling and OCR of verification image”
Ji Ma, Sara Konrath
This empirical study examines knowledge production between 1986 and 2015 in nonprofit and philanthropic studies using science mapping and network analysis. Results suggest that scholars in this field have been actively generating a considerable amount of literature and a solid intellectual base for the continuing development of this field as a new discipline. Knowledge production in this field is also growing in cohesion – several main themes have been formed and actively developed since the mid-1980s. Future advancement of this field faces a critical challenge: the lack of geographic and cultural diversity resulting from the domination of research taking place in the “Anglosphere.” We also emphasize the importance of new paradigms in mitigating the tension between theory and practice – a challenge commonly faced by academic disciplines. Methodological and pedagogical implications, limitations, and future directions are also discussed.
Number of Pages in PDF File: 52
Keywords: nonprofit and philanthropic studies, network analysis, knowledge production, paradigm shift, science mapping
Full text available at SSRN.
I was on a project reviewing the literature on nonprofit management education. The outcome of this project is an unpublished English manual and an article in a peer-reviewed Chinese journal (The China Nonprofit Review). The following items are the references in the literature pool. This should be helpful if you are developing a course (or a series of courses) of nonprofit management.
Update 12/2018: Another paper which reviews the scholarship on nonprofit studies in the last century was recently published and selected as the “Editor’s Choice Free Article”: A Century of Nonprofit Studies: Scaling the Knowledge of the Field (Ma, J. & Konrath, S. Voluntas (2018) 29: 1139. https://doi.org/10.1007/s11266-018-00057-5)
Ji Ma, Simon DeDeo
In response to failures of central planning, the Chinese government has experimented not only with free-market trade zones, but with allowing non-profit foundations to operate in a decentralized fashion. A network study shows how these foundations have connected together by sharing board members, in a structural parallel to what is seen in corporations in the United States. This board interlock leads to the emergence of an elite group with privileged network positions. While the presence of government officials on non-profit boards is widespread, state officials are much less common in a subgroup of foundations that control just over half of all revenue in the network. This subgroup, associated with business elites, not only enjoys higher levels of within-elite links, but even preferentially excludes government officials from the nodes with higher degree. The emergence of this structurally autonomous sphere is associated with major political and social events in the state-society relationship.
For full text, refer to http://arxiv.org/abs/1606.08103
Analysis of network and complex system requires too much computing resources. Although the learning curve is deep, the power of parallel computing must be utilized, otherwise, more time will be spent on waiting. Moreover, for exploratory academic research, we will not know what’s the next step until we finish the current analysis. So the research life-cycle becomes hypothesis -> operationalization -> LONG TIME coding and debugging -> LONG TIME waiting for result -> new hypothesis.
With IPython Notebook, parallel computing can be easily operated; however, like what I’ve said: We cannot understand the easiest programming skills unless we are able to operate them. I’ll not come to this post if I do not have to wait for a week only for one result. Playing parallel computing with IPython is easy, but for real jobs, it’s not. Scholars in social science area may be less skilled in programming – we are not trained to be. I’ve made great efforts and finally got some progress which may be laughed by CS guys.
While using IPython Notebook (now named Jupyter Notebook) for parallel computing, Jupyter will start several remote engines beside the local one we are using. These remote engines are blank which means that the variables and functions defined and modules imported on the local engine do not work on the remote ones. Specifically, the puzzle for me was (yes, was!): How to operate the variables, functions, and modules on the remote engines.
We cannot understand the easiest programming skills unless we are able to manipulate them. Parallel computing first seemed to be hard, but once we could write even several lines of codes, it became so easy.
Update: NLTK has its own algorithm for TF-IDF, please forgive my ignorance.
I’ve been working on extracting information from a large amount of text which mainly consists of thousands of journal article’s abstract. TF–IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Wikipedia has a page introducing this algorithm in detail, so I will not discuss more about this.
Information is abstract – may be a reason that why we tend to visualize. If it has some geographical attributes, visualization will be much more concrete and understandable. Below is an example using the same data set (Mainland China offset projects).
[cjtoolbox name=’china offset project app’]
I have submitted a manuscript in mid-January; thereafter, I got another routine besides refreshing my Facebook page. The progress has been staying in “Awaiting Referee Selection” for about two months; until today, it changes to “Awaiting Referee Invitation.” I am so curious (and also frustrated) about the review process, and the following slide meets my curiosity perfectly – it will tell you how Associate Editors work.
This is an operation manual of Manuscript Central for AEs. MC is a popular manuscript processing system through which I have submitted my paper. I have embedded this file in this post, original link of this file is:
Data visualization is one of my main focuses this semester. I have been trying several different visualization tools, such as python’s matplotlib, VTK, Plotly, and finally I came to Tableau. Online and interaction are two important characters, especially for the visualization of information in huge size (which we may call big data). Tableau is the best one that can perfectly meet my demands (so far): you don’t have to coding, totally GUI interface but still retains great flexibility, online and interactive. The following chart is a simple example.
[cjtoolbox name=’Project Database for Chinese Offset Projects’]