Online Captcha Solving When Using R for Web Scraping

Python

OCR

Wec Scraping

use OCR

Author

Affiliation

Xinzhuo Huang

HKUST SOSC

Published

October 28, 2023

Modified

November 24, 2023

How to write a robust web scraper using R that can automatically bypass captcha checks? Our basic solution is as follows:

Firstly, maintain a logged-in state with httr. Next, download the captcha locally. Following this, import the Python module ddddocr for OCR recognition to identify the characters in the captcha. After the captcha has been successfully identified, we submit it to bypass the verification.

This allows us to continue with our web scraping activities unhindered. In web scraping, effective exception handling is rather important, and the functional programming tools provided by purrr can be incredibly helpful.

Pakcage Management

require(pacman)
p_load(reticulate, httr, tidyverse, rvest, reactable, tippy, htmltools)

Code

handle <- handle("sample_website")

GET(
    "sample_website", 
    handle = handle, 
    write_disk("blog/captcha/captcha_new.jpg", overwrite = TRUE)
)

OCR recognition:

Code

reticulate::use_python("C:\\Users\\xhuangcb\\anaconda3\\envs\\pytorch_gpu\\python.exe")

ddddocr <- reticulate::import("ddddocr")

ocr <- ddddocr$DdddOcr(beta = TRUE)

builtins <- import_builtins()

f <- builtins$open("blog/captcha/captcha_new.jpg", "rb")

image <- f$read()

yzm <- ocr$classification(image)

[1] "nf353"

After submission, the status code is 200, indicating success.

Code

response <- POST(
    "sample_website",
    body = list(yzm = yzm),
    handle = handle
)

response$status_code

[1] 200

Let’s integrate the OCR module into our web scraping process. With the powerful exception handling provided by purrr, we can create a more robust web scraper, which supports automatic retry and bypassing captchas.

Code

extract_links <- possibly(
    insistently(
        \(page_num = 1, source = 2, sleep = sample(seq(2, 5, 0.05), 1), location = NULL) {
            page <- POST(
                url = "sample_website",
                encode = "form",
                body = list(
                    `ajlb` = "2",
                    `st` = "1",
                    `jbfyId` = "",
                    `sxnflx` = "0",
                    `zscq` = "",
                    `cwslbmc` = "",
                    `prompt` = "",
                    `dsrName` = "",
                    `ajmc` = "",
                    `ay` = "",
                    `ah` = "",
                    `startCprq` = "2013-01-01",
                    `endCprq` = "2023-10-23",
                    `page` = page_num
                ),
                handle = handle
            )

            sign <- page %>%
                read_html() %>%
                html_text() %>%
                str_remove_all("\\\r|\\\n|\\\t|\\s+|\\p{P}")

            if (sign == "varcontextPath=提交") {
                GET(
                    "sample_website",
                    handle = handle,
                    write_disk("blog/captcha/captcha_new.jpg", overwrite = TRUE)
                )

                yzm <- ocr$classification(image)

                response <- POST(
                    "sample_website",
                    body = list(yzm = yzm),
                    handle = handle
                )

                if(response$status_code != 200) {stop("OCR failed!")}
                
            } else {
                links <- page %>%
                    read_html() %>% 
                    html_elements(xpath = "//li[@class='refushCpws']") %>%
                    html_nodes("a")

                id <- links %>%
                    html_attr("href")

                title <- links %>%
                    html_text(trim = TRUE)

                courts <- page %>%
                    read_html() %>%
                    html_elements(xpath = "//span[@class='sp_right']")

                court <- courts %>%
                    html_elements(xpath = "//span[@class='sp_name']") %>%
                    html_text(trim = TRUE)

                date <- courts %>%
                    html_elements(xpath = "//span[@class='sp_time']") %>%
                    html_text(trim = TRUE)

                result <- tibble(
                    id = id,
                    title = title,
                    court = court,
                    date = date
                )
            }

            Sys.sleep(sleep)

            if (is.null(location)) {
                return(result)
            } else {
                write_rds(result, file = str_c(location, "/", page_num, ".Rds"))
            }
        },
        rate = rate_backoff(
            pause_base = 2,
            pause_cap = 60,
            pause_min = 1,
            max_times = 10,
            jitter = TRUE
        )
    )
)

result <- map(1:10, extract_links, .progress = TRUE)

The results of running this web scraper are as follows:

Citation

BibTeX citation:

@online{xinzhuo2023,
  author = {Xinzhuo, Huang},
  title = {Online {Captcha} {Solving} {When} {Using} {R} for {Web}
    {Scraping}},
  date = {2023-10-28},
  url = {https://xinzhuo.work/blog/captcha},
  langid = {en}
}

For attribution, please cite this work as:

Xinzhuo, Huang. 2023. “Online Captcha Solving When Using R for Web Scraping.” October 28, 2023. https://xinzhuo.work/blog/captcha.