Operating large files on ChameleonCloud

I primarily use Chameleon Cloud (CC) for my research projects. It provides great flexibility because I can run bare-metal servers (e.g., 44 threads/cores, 128G+ RAM) for a seven-day lease which is also renewable if the hosts I’m using are not booked by others. Its supporting team is also amazing.

But everything becomes slow if you are working with a really big dataset. For example, I’m working on a Telegram project and have 1TB+ data. This really gets me a headache. Well, the CC machines are able to handle this but need extra configurations.

Use Object Store (OS)

The OS can store up to 8TB data. The guide is clear. The advantage of using OS is that, I don’t have to upload the 1TB dataset every time when I start a new server. I can directly mount OS to the new server as a disk. But the issue is, the dataset needs to be split into smaller files (i.e., < 4G) for uploading. I was thinking, after uploading the 1TB file, I can merge the segments into one again. But the mounted OS is not a “real disk.” The reason behind is complicated and beyond my knowledge, but the consequence is clear–I can’t operate the segments like I’m using a “real disk.” I’ll not detail my failed attempts and frustration. It just can’t work! Like Gandalf is in front me: “You shall not pass!”

Mount Additional Disks

But I have to work that out. The strategy is that, mount a large real disk to the server, merge the segments and save the merged file on that real disk.

Mounting disk sounds like an easy task, but there are glitches all the time. This post is very helpful.

  • sudo lvmdiskscan lists all the available devices. Make sure using root; otherwise, information is limited and not helpful. This gives me messages like below. The /dev/nvme0n1-like path is the path to the device.
  /dev/nvme0n1                                                                                  [      <1.82 TiB]
  /dev/ceph-7416c6b0-b419-4227-b9a9-5ca48d295f90/osd-block-edc9fd01-90b8-4aaf-bcfb-ce263a5f72c6 [    <400.00 GiB]
  /dev/sda1                                                                                     [     558.91 GiB]
  /dev/ceph-bbcb4121-a89c-44f8-a321-ef02e6577f29/osd-block-23d73189-88cd-45b8-a315-a8a00979b82d [    <400.00 GiB]
  /dev/ceph-a905226d-5dea-49dd-ab2d-e9984fcdf9cb/osd-block-6c90347c-2758-46e4-a6f7-d5c5e84cf29c [     363.01 GiB]
  /dev/ceph-47613a0a-c021-40d6-aa63-2b0121ec2c1f/osd-block-effdb3b8-d2aa-47d1-bf41-58ae6253928d [    <158.91 GiB]
  /dev/ceph-6c63c94c-0b3a-4d41-b3e1-216ed9457527/osd-block-5d84ef65-97a7-4d8e-add0-9df1e6a8dde8 [    <200.00 GiB]
  /dev/nvme1n1                                                                                  [      <1.82 TiB]
  /dev/ceph-43dfb719-bda9-46d0-9a7a-4807522300c9/osd-block-a790ea25-2fcc-442d-8610-8cb3906b6915 [    <400.00 GiB]
  /dev/nvme1n1p1                                                                                [     500.00 GiB] LVM physical volume
  /dev/ceph-2fcca38b-ea7b-406c-bd77-f76da9e94194/osd-block-426c2997-8550-4fb6-b132-3471da6d5c14 [    <200.00 GiB]
  /dev/nvme1n1p2                                                                                [     500.00 GiB] LVM physical volume
  /dev/ceph-1b085445-bc2c-44a7-8294-93b87798717a/osd-block-fd344ea5-24ce-4f9f-a392-252897acb1e2 [    <500.00 GiB]
  /dev/nvme1n1p3                                                                                [     500.00 GiB] LVM physical volume
  /dev/ceph-0f2967c8-090a-40c0-8f10-a9baf44ca4ef/osd-block-8478ba73-8099-4ef7-a155-36459b1561dd [    <500.00 GiB]
  /dev/nvme1n1p4                                                                                [    <363.02 GiB] LVM physical volume
  /dev/ceph-04f7289a-53be-4637-9c60-a7049c6f0b90/osd-block-4d6d60ec-6b49-4f77-9e46-27071e27132c [    <500.00 GiB]
  /dev/ceph-813e17e9-dd1d-4a1f-b544-11c161b47ea2/osd-block-43fbe500-b12f-4eb2-80c1-4e576a9f048e [     588.49 GiB]
  /dev/ceph-b7bb80f7-f023-49a7-94aa-6ce06734cae2/osd-block-95705a63-e60d-43e8-8b13-3eebb00288e4 [     558.91 GiB]
  /dev/sdb1                                                                                     [     200.00 GiB] LVM physical volume
  /dev/sdb2                                                                                     [     200.00 GiB] LVM physical volume
  /dev/sdb3                                                                                     [     158.91 GiB] LVM physical volume
  /dev/sdc                                                                                      [     558.91 GiB] LVM physical volume
  /dev/sde1                                                                                     [     400.00 GiB] LVM physical volume
  /dev/sde2                                                                                     [     400.00 GiB] LVM physical volume
  /dev/sde3                                                                                     [     400.00 GiB] LVM physical volume
  /dev/sde4                                                                                     [    <588.50 GiB] LVM physical volume
  /dev/sdf1                                                                                     [      <1.75 TiB]
  6 disks
  10 partitions
  1 LVM physical volume whole disk
  11 LVM physical volumes
  • sudo lvscan -> sudo vgchange -ay I’m not sure the two commands really make some difference, just run them all.
  • Oh, yes, if you get an mount: wrong fs type, bad option, bad superblock error message, you probably need to create a file system on the disk with, e.g., mkfs.ext4 /dev/sdb1.
  • Mount the device to folders, for example mount /dev/sdb1 /root/data_store.
  • Remember, use root all the time.

Following the same rationale, mount as many (available) devices as you want, do the merge operation outside of the OS folder (I’m guessing working in cloudfuse folder will result in additional errors, don’t want to mess with my time), save to the mounted device. Now I have the following device:

Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 2.4M 51G 1% /run
/dev/sda1 550G 27G 501G 5% /
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
tmpfs 51G 0 51G 0% /run/user/1000
tmpfs 51G 0 51G 0% /run/user/1010
/dev/sdf1 1.8T 1.1T 610G 64% /home/cc/hold
/dev/nvme0n1 1.8T 77M 1.7T 1% /root/tg_upload

The /dev/sdf1 holds all the 1TB zip file. I then extract the file to /dev/nvme0n1. You may ask, why don’t upload the file to server directly? Obviously because it’s slow–with 55MB+ uploading speed, it will take more than 5 hours to finish uploading.

OK, start a Jupyter Notebook under root. Now let’s get to work.

Lineage–the Yangs

In 2019 August, we finished our fieldwork in two rural villages in southeast China. The graph below shows the self-governance organizations weave together through local elites (xiangxian). I wrote a non-academic article introducing our work, which was featured in the Nonprofit Academic Centers Council’s monthly newsletter and IC2’s website. You can read the full article here.



  • 日程:待定
  • 地点:待定





  • 和对方说话是一种真心的交流


举例一个反面教材,比如今年和某校的某助理教授交流时,他总是盯着天花板,然后对我的回应就是“yeah, yeah, right, I’m not sure”。递给他一张名片希望可以拜读文章,他一只手夹过名片看也没看就揣进了口袋。我可以保证这样的表现在campus visit 的时候绝对会被刷掉,因为他的研究还没有好到让人对他性格上的瑕疵进行包容。很多学校在招聘的时候会利用学术会议的机会考察候选人,就算今年不在job market上,学术圈子是很小的,一旦给人留下不好印象,说不定在什么时候就会造成影响。

  • 不提没有问题的问题

展示结束的提问环节,一定要保证自己的问题是有“问题”的,而且能够让被提问的人说出很多展示的时候他没有机会展示的信息或者观点。常见的反面教材是,提问的人说了一大堆自己的研究和看法,最后chair不得不打断并问“so what’s your question?”这是非常尴尬的。提问的目的是为了从回答问题的人那里获得智慧,而不是展示自己——正确的展示自己的方式是提出一个简洁但切中要点的问题。

  • 不提没有建设性的建议


  • 真心地了解和欣赏别人的研究


我自己在这方面一开始也没有做好。记得很多年前第一次参加学术会议的时候,我们组有个emeritus professor,他不仅把我的文章从头到尾仔细读了一遍,还做了非常仔细的批注,展示结束后他把手稿给我,我当时就震撼了,并深感惭愧,因为我并没有仔细阅读同组的其它文章。另外一次是博士二年级时参加校内的一个博士生研讨会展示,会前大家并没有交流文章,但展示完了后我收到了两份来自同组同学的意见表,感动又羞愧。后来那篇文章发表的时候我对他们进行了致谢,也算是表达我的一份歉意。


  • 不要让自己的ego大过curiosity



以上五点仅仅是参会礼仪的一小部分,而且基本上都是常识。“礼仪”这种事情是不可能穷尽的,但归结一点,做一个nice scholar比做一个smartscholar更重要。而学术真正做得非常顶尖的,人品一般都非常好。大道至简,学好做人在我看来比学好做学术更重要,我也还在认真领会和完善自己。如何才能“学好做人”没有技巧而言,根本上是我们对生命和生活的感悟和体验,这对人文社会科学家来说可能尤为重要。


[Preprint] Funding nonprofits in a networked society: Two modes of crowding mechanism of government support

This paper studies the impact of social relations on the crowding process of government funding–the effect that government funding to nonprofits may crowd out or crowd in private donations. By using a novel panel dataset across 12 years from the People’s Republic of China, this study suggests that, although government funding to a nonprofit may crowd out the private donations to the same organization, private donations are not reduced but redistributed to other nonprofits in the organizational network. Policy and practical implications are discussed.

Keywords: crowd out, crowd in, social relation, government funding, nonprofit organization, networked society

Full-text: https://ssrn.com/abstract=3262798

[Voluntas] A Century of Nonprofit Studies: Scaling the Knowledge of the Field

I started to work on this project since early 2015, and the first paper is finally accepted in Voluntas today, which is my civil calendar birthday. Although our family tradition is to use the Chinese lunar calendar, still a nice gift.

Sara and I started to work on the first draft at Mo’Joe Coffeehouse, which was permanently closed in June this year. Another coffeehouse, Thirsty Scholar, was also closed around the same time. Lots of memories with friends in both places.

There are at least three versions of this paper. The first draft almost entirely relied on a citation analysis software package named CiteSpace. It was a very simple paper but it helped me get familiar with relevant concepts and methodology and cleaned a part of the dataset used in the final analysis. In the second draft, I started to write Python scripts for processing and analyzing data. In early 2017, while I was waiting for my wife, parents, and parents-in-law at Kuala Lumpur airport to start a wonderful journey in Malaysia, I received the rejection from a journal. Then I tried to rewrite the whole paper to analyze the literature published in the last century. I still remember the classroom in which I crawled the first hundreds of records – it was a classroom on the first floor of Teaching Building 2 in Beijing Normal University, where I also spent many nights for preparing my Ph.D. application. I then had a lunch with a good friend who just returned from UPenn about a year ago. She said she felt her heart was in peace, and she was sure about the direction of her career. That was a day in March, Beijing was snowing heavily.

In late June of this year, I submitted the third draft to Voluntas in the office at IQSS, where Prof. Peter Bol treated me so well. We got “minor revision” in early August, and I had a phone call with Sara on the third day after moving to Austin. Life was pretty hectic.

A paper for me has two meanings: the words and numbers for reviewers and readers, and the memories for myself. All things grow, I’m waiting and watching 万物并作,吾以观复.


This empirical study examines knowledge production between 1925 and 2015 in nonprofit and philanthropic studies from quantitative and thematic perspectives. Quantitative results suggest that scholars in this field have been actively generating a considerable amount of literature and a solid intellectual base for developing this field towards a new discipline. Thematic analyses suggest that knowledge production in this field is also growing in cohesion – several main themes have been formed and actively advanced since the 1980s, and the study of volunteering can be identified as a unique core theme of this field. The lack of geographic and cultural diversity is a critical challenge for advancing nonprofit studies. New paradigms are needed for developing this research field and mitigating the tension between academia and practice. Methodological and pedagogical implications, limitations, and future studies are discussed.

Keywords: nonprofit and philanthropic studies; network analysis; knowledge production; paradigm shift; science mapping

Fulltext: https://papers.ssrn.com/abstract=2834121

Datasets in “state power and elite autonomy in a networked civil society”

The paper State power and elite autonomy in a networked civil society: The board interlocking of Chinese non-profits is published at Social Networks (Open Access, you can get the paper free of charge because we’ve paid for the knowledge we produced). Here are the hand-coded datasets in the paper. You are welcome to use as long as you give appropriate attribution.

All the datasets used in this paper are open to use, review, or replicate. Feel free to send me an email if you need more information.

Continue reading “Datasets in “state power and elite autonomy in a networked civil society””

The research infrastructure of Chinese foundations, a database for Chinese civil society studies @Scientific Data

Ma, J., Wang, Q., Dong, C., & Li, H. (2017). The research infrastructure of Chinese foundations, a database for Chinese civil society studies. Scientific Data, 4, sdata201794. https://doi.org/10.1038/sdata.2017.94

Continue reading “The research infrastructure of Chinese foundations, a database for Chinese civil society studies @Scientific Data”

web crawling and OCR of verification image

I’m working on crawling data from some websites for my research, the most challenging issue is the verification image – the barrier set by websites to prevent programmed crawling. I’ve tried different approaches, but all failed: the success rate is too low to be usable. Looks like such verification mechanism is not as vulnerable as people always assume. However, it is beneficial to write down my lesson, for my own reference and other folks who may want to give a try. Promising solutions for avoiding verification may be the IP pools and delayed requests (courtesy to servers!). Continue reading “web crawling and OCR of verification image”

[Preprint] Thirty Years of Nonprofit Research: Scaling the Knowledge of the Field 1986 – 2015

Ji Ma, Sara Konrath

This empirical study examines knowledge production between 1986 and 2015 in nonprofit and philanthropic studies using science mapping and network analysis. Results suggest that scholars in this field have been actively generating a considerable amount of literature and a solid intellectual base for the continuing development of this field as a new discipline. Knowledge production in this field is also growing in cohesion – several main themes have been formed and actively developed since the mid-1980s. Future advancement of this field faces a critical challenge: the lack of geographic and cultural diversity resulting from the domination of research taking place in the “Anglosphere.” We also emphasize the importance of new paradigms in mitigating the tension between theory and practice – a challenge commonly faced by academic disciplines. Methodological and pedagogical implications, limitations, and future directions are also discussed.

Number of Pages in PDF File: 52

Keywords: nonprofit and philanthropic studies, network analysis, knowledge production, paradigm shift, science mapping

Full text available at SSRN.