Online Captcha Solving When Using R for Web Scraping

R
Python
OCR
Wec Scraping
use OCR
Author
Affiliation

Xinzhuo Huang

HKUST SOSC

Published

October 28, 2023

Modified

November 24, 2023

How to write a robust web scraper using R that can automatically bypass captcha checks? Our basic solution is as follows:

Firstly, maintain a logged-in state with httr. Next, download the captcha locally. Following this, import the Python module ddddocr for OCR recognition to identify the characters in the captcha. After the captcha has been successfully identified, we submit it to bypass the verification.

This allows us to continue with our web scraping activities unhindered. In web scraping, effective exception handling is rather important, and the functional programming tools provided by purrr can be incredibly helpful.

Pakcage Management
require(pacman)
p_load(reticulate, httr, tidyverse, rvest, reactable, tippy, htmltools)
Code
handle <- handle("sample_website")

GET(
    "sample_website", 
    handle = handle, 
    write_disk("blog/captcha/captcha_new.jpg", overwrite = TRUE)
)

OCR recognition:

Code
reticulate::use_python("C:\\Users\\xhuangcb\\anaconda3\\envs\\pytorch_gpu\\python.exe")

ddddocr <- reticulate::import("ddddocr")

ocr <- ddddocr$DdddOcr(beta = TRUE)

builtins <- import_builtins()

f <- builtins$open("blog/captcha/captcha_new.jpg", "rb")

image <- f$read()

yzm <- ocr$classification(image)
[1] "nf353"

After submission, the status code is 200, indicating success.

Code
response <- POST(
    "sample_website",
    body = list(yzm = yzm),
    handle = handle
)

response$status_code
[1] 200

Let’s integrate the OCR module into our web scraping process. With the powerful exception handling provided by purrr, we can create a more robust web scraper, which supports automatic retry and bypassing captchas.

Code
extract_links <- possibly(
    insistently(
        \(page_num = 1, source = 2, sleep = sample(seq(2, 5, 0.05), 1), location = NULL) {
            page <- POST(
                url = "sample_website",
                encode = "form",
                body = list(
                    `ajlb` = "2",
                    `st` = "1",
                    `jbfyId` = "",
                    `sxnflx` = "0",
                    `zscq` = "",
                    `cwslbmc` = "",
                    `prompt` = "",
                    `dsrName` = "",
                    `ajmc` = "",
                    `ay` = "",
                    `ah` = "",
                    `startCprq` = "2013-01-01",
                    `endCprq` = "2023-10-23",
                    `page` = page_num
                ),
                handle = handle
            )

            sign <- page %>%
                read_html() %>%
                html_text() %>%
                str_remove_all("\\\r|\\\n|\\\t|\\s+|\\p{P}")

            if (sign == "varcontextPath=提交") {
                GET(
                    "sample_website",
                    handle = handle,
                    write_disk("blog/captcha/captcha_new.jpg", overwrite = TRUE)
                )

                yzm <- ocr$classification(image)

                response <- POST(
                    "sample_website",
                    body = list(yzm = yzm),
                    handle = handle
                )

                if(response$status_code != 200) {stop("OCR failed!")}
                
            } else {
                links <- page %>%
                    read_html() %>% 
                    html_elements(xpath = "//li[@class='refushCpws']") %>%
                    html_nodes("a")

                id <- links %>%
                    html_attr("href")

                title <- links %>%
                    html_text(trim = TRUE)

                courts <- page %>%
                    read_html() %>%
                    html_elements(xpath = "//span[@class='sp_right']")

                court <- courts %>%
                    html_elements(xpath = "//span[@class='sp_name']") %>%
                    html_text(trim = TRUE)

                date <- courts %>%
                    html_elements(xpath = "//span[@class='sp_time']") %>%
                    html_text(trim = TRUE)

                result <- tibble(
                    id = id,
                    title = title,
                    court = court,
                    date = date
                )
            }

            Sys.sleep(sleep)

            if (is.null(location)) {
                return(result)
            } else {
                write_rds(result, file = str_c(location, "/", page_num, ".Rds"))
            }
        },
        rate = rate_backoff(
            pause_base = 2,
            pause_cap = 60,
            pause_min = 1,
            max_times = 10,
            jitter = TRUE
        )
    )
)

result <- map(1:10, extract_links, .progress = TRUE) 


The results of running this web scraper are as follows:

Citation

BibTeX citation:
@online{xinzhuo2023,
  author = {Xinzhuo, Huang},
  title = {Online {Captcha} {Solving} {When} {Using} {R} for {Web}
    {Scraping}},
  date = {2023-10-28},
  url = {https://xinzhuo.work/blog/captcha},
  langid = {en}
}
For attribution, please cite this work as:
Xinzhuo, Huang. 2023. “Online Captcha Solving When Using R for Web Scraping.” October 28, 2023. https://xinzhuo.work/blog/captcha.