web crawling and OCR of verification image

I’m working on crawling data from some websites for my research, the most challenging issue is the verification image – the barrier set by websites to prevent programmed crawling. I’ve tried different approaches, but all failed: the success rate is too low to be usable. Looks like such verification mechanism is not as vulnerable as people always assume. However, it is beneficial to write down my lesson, for my own reference and other folks who may want to give a try. Promising solutions for avoiding verification may be the IP pools and delayed requests (courtesy to servers!).

General idea: Process the raw image, filter away the crappy background (PIL and cv2), recognize with OCR library (pyocr, depends on tesseract).

1. Python Imaging Library (PIL) or Pillow, the latter is a fork of PIL and an actively developed python image library, so Pillow is recommended. While using pillow, it is still called by the namespace “PIL,” so no many differences in command. This is the foundational image process package on which all the following works are based.

2. pillowfight package contains various image processing algorithms, but unfortunately it does not support python 2.x yet (11/4/2016). I’m a 2.x user, so I’m not sure whether the result will be significantly improved by this package (I hope so! But not very positive …).

3. I tried the cv2 package and self-contained modules embedded in PIL. They can only recognize part of the texts/digits, and the success rate is too low to be usable.

4. OCR packages: pyocr depends on tesseract which should be installed first (use apt-get install on Linux).

5. Good article for considering the legal and ethical issues with web crawling by Prof. Filippo Menczer (he is too much famous, and from our university!).



Leave a Reply

%d bloggers like this: