v1으로 업그레이드하기

개요

3년 반 동안의 빠른 개발과 많은 변경 및 폐기를 거쳐 마침내 Apify SDK v1이 출시되었습니다. 이번 릴리스의 두 가지 주요 목표는 안정성과 Firefox와 Webkit(Safari)를 포함한 더 많은 브라우저 지원입니다.

SDK는 수천 개의 웹 스크래핑 및 자동화 프로젝트를 지원하며 지난 몇 년간 큰 인기를 얻었습니다. 개발자들이 안정적인 환경에서 작업할 수 있도록, SDK v1 출시와 함께 연 1회 메이저 버전 업데이트에서만 주요 변경사항을 적용하기로 했습니다.

PuppeteerPool을 browser-pool로 교체하여 더 많은 브라우저를 지원하게 되었습니다. 이는 이를 위해 특별히 개발된 새로운 라이브러리입니다. PuppeteerPool의 아이디어를 기반으로 하여 Playwright 지원을 추가했습니다. Playwright는 Puppeteer와 유사한 브라우저 자동화 라이브러리로, 모든 주요 브라우저를 지원하며 Puppeteer와 거의 동일한 인터페이스를 사용하면서도 유용한 기능을 추가하고 일반적인 작업을 단순화했습니다. 물론 새로운 BrowserPool에서도 Puppeteer를 계속 사용할 수 있습니다.

SDK v1에서는 puppeteer와 playwright 모듈이 기본으로 포함되지 않는다는 큰 변화가 있습니다. 설치를 더 쉽고 빠르게 하기 위해 사용자가 직접 원하는 라이브러리와 버전을 설치해야 합니다. 이를 통해 앞으로 더 많은 라이브러리를 지원할 수 있게 되었습니다.

Playwright 지원 추가로 이제 PlaywrightCrawler를 사용할 수 있습니다. 이는 PuppeteerCrawler와 매우 유사하며 원하는 것을 선택해서 사용할 수 있습니다. 이에 따라 인터페이스도 일부 변경되었습니다. PuppeteerCrawler의 launchPuppeteerFunction 옵션이 제거되었고 launchPuppeteerOptions가 launchContext로 대체되었습니다. handlePageFunction 인자의 구조도 변경되었습니다. 자세한 설명과 마이그레이션 예제는 마이그레이션 가이드를 참조하세요.

SDK v2에서는 어떤 변화가 있을까요? SDK를 더 작은 라이브러리로 분할하여 필요한 기능만 설치할 수 있도록 할 예정입니다. TypeScript로 마이그레이션하여 크롤러 개발을 더 빠르고 안전하게 만들 계획입니다. 또한 전체 SDK의 인터페이스를 검토하여 개발자 경험을 개선할 예정입니다. 버그 수정과 스크래핑 기능은 1.X 버전에서도 계속 추가될 것입니다.

마이그레이션 가이드

v1.0.0 릴리스에는 많은 주요 변경사항이 있지만, 코드 업데이트는 몇 분이면 충분할 것입니다. 아래에서 업데이트 방법과 새로운 기능 사용법에 대한 간단한 튜토리얼을 확인할 수 있습니다.

새로운 기능 중 일부는 고급 사용자를 위한 것이므로, 복잡해 보인다고 걱정하지 마세요. 꼭 사용할 필요는 없습니다.

설치

이전 버전의 SDK에는 puppeteer 패키지가 포함되어 있어 별도 설치가 필요 없었습니다. SDK v1은 playwright도 지원하므로 두 패키지를 모두 설치하도록 강제하지 않습니다. Puppeteer와 함께 SDK v1을 설치하려면 다음 명령을 실행하세요:

npm install apify puppeteer

Playwright와 함께 SDK v1을 설치하려면 다음과 같이 실행하세요:

npm install apify playwright

초기 릴리스에서 가장 중요한 기능을 추가하려 했지만, 아직 Playwright에서 지원되지 않는 Puppeteer 전용 유틸리티나 옵션이 있을 수 있습니다.

Apify 플랫폼에서 실행하기

Apify 플랫폼에서 Playwright를 사용하려면 Playwright를 지원하는 Docker 이미지를 사용해야 합니다. 이를 위해 새로운 이미지들을 준비했으니 Docker 이미지 가이드를 참고하여 필요에 맞는 이미지를 선택하세요.

주의할 점은 package.json에 반드시 puppeteer 또는 playwright를 의존성으로 포함해야 한다는 것입니다. 이를 명시하지 않으면 액터 빌드 시 node_modules 폴더에서 해당 라이브러리가 제거됩니다.

핸들러 인자가 이제 크롤링 컨텍스트로 변경됨

이전에는 사용자가 제공한 핸들러 함수의 인자들이 별도의 객체로 제공되었습니다. 이로 인해 함수 호출 간에 값을 추적하기가 어려웠습니다.

const handlePageFunction = async (args1) => {
    args1.hasOwnProperty('proxyInfo') // true
}

const handleFailedRequestFunction = async (args2) => {
    args2.hasOwnProperty('proxyInfo') // false
}

args1 === args2 // false

이는 각 함수마다 새로운 인자 객체가 생성되었기 때문입니다. SDK v1에서는 크롤링 컨텍스트라는 단일 객체를 사용합니다.

const handlePageFunction = async (crawlingContext1) => {
    crawlingContext1.hasOwnProperty('proxyInfo') // true
}

const handleFailedRequestFunction = async (crawlingContext2) => {
    crawlingContext2.hasOwnProperty('proxyInfo') // true
}

// 모든 컨텍스트가 동일한 객체입니다
crawlingContext1 === crawlingContext2 // true

크롤링 컨텍스트와 ID의 `Map`

이제 모든 객체가 동일하므로 실행 중인 모든 크롤링 컨텍스트를 추적할 수 있습니다. 이는 crawlingContext의 새로운 id 속성을 통해 가능합니다. 컨텍스트 간 접근이 필요할 때 유용합니다.

let masterContextId;
const handlePageFunction = async ({ id, page, request, crawler }) => {
    if (request.userData.masterPage) {
        masterContextId = id;
        // 마스터 페이지 준비
    } else {
        const masterContext = crawler.crawlingContexts.get(masterContextId);
        const masterPage = masterContext.page;
        const masterRequest = masterContext.request;
        // 이제 다른 handlePageFunction에서 마스터 데이터를 조작할 수 있습니다
    }
}

`autoscaledPool`이 `crawlingContext.crawler`로 이동됨

핵심 객체에 대한 접근을 더 쉽게 하고 불필요한 중복을 피하기 위해, crawler 속성을 핸들러 페이지 인자에 노출했습니다.

const handlePageFunction = async ({ request, page, crawler }) => {
    await crawler.requestQueue.addRequest({ url: 'https://example.com' });
    await crawler.autoscaledPool.pause();
}

이에 따라 puppeteerPool이나 autoscaledPool과 같은 일부 단축 속성은 더 이상 필요하지 않게 되었습니다.

const handlePageFunction = async (crawlingContext) => {
    crawlingContext.autoscaledPool // 더 이상 존재하지 않음
    crawlingContext.crawler.autoscaledPool // <= 올바른 사용법
}

`PuppeteerPool`이 `BrowserPool`로 대체됨

BrowserPool은 다른 브라우저 자동화 라이브러리도 관리할 수 있도록 PuppeteerPool을 확장한 것입니다. API는 비슷하지만 완전히 동일하지는 않습니다.

실행 중인 `BrowserPool`에 접근하기

PuppeteerCrawler와 PlaywrightCrawler만 BrowserPool을 사용합니다. crawler 객체를 통해 접근할 수 있습니다.

const crawler = new Apify.PlaywrightCrawler({
    handlePageFunction: async ({ page, crawler }) => {
        crawler.browserPool // <-----
    }
});

crawler.browserPool // <-----

페이지에 이제 ID가 있음

페이지 ID는 crawlingContext.id와 동일하며, 이를 통해 훅에서 전체 crawlingContext에 접근할 수 있습니다. 자세한 내용은 아래의 수명 주기 훅을 참조하세요.

const pageId = browserPool.getPageId

설정과 수명 주기 훅

BrowserPool의 가장 중요한 추가 기능은 수명 주기 훅입니다. 두 크롤러 모두 browserPoolOptions를 통해 접근할 수 있습니다. browserPoolOptions의 전체 목록은 browser-pool README에서 확인할 수 있습니다.

const crawler = new Apify.PuppeteerCrawler({
    browserPoolOptions: {
        retireBrowserAfterPageCount: 10,
        preLaunchHooks: [
            async (pageId, launchContext) => {
                const { request } = crawler.crawlingContexts.get(pageId);
                if (request.userData.useHeadful === true) {
                    launchContext.launchOptions.headless = false;
                }
            }
        ]
    }
})

`BrowserController` 소개

BrowserController는 browser-pool의 브라우저 관리를 담당하는 클래스입니다. Puppeteer와 Playwright 브라우저 모두를 위한 단일 API를 제공하는 것이 목적입니다. 자동으로 백그라운드에서 작동하지만, 브라우저를 올바르게 종료하고 싶을 때는 browserController를 사용해야 합니다. 핸들 페이지 인자에서 찾을 수 있습니다.

const handlePageFunction = async ({ page, browserController }) => {
    // 잘못된 사용법. BrowserPool을 우회하므로 문제가 될 수 있습니다.
    await page.browser().close();

    // 올바른 사용법. 정상적인 종료를 허용합니다.
    await browserController.close();

    const cookies = [/* 쿠키 객체들 */];
    // 잘못된 사용법. Puppeteer에서만 작동하고 Playwright에서는 작동하지 않습니다.
    await page.setCookies(...cookies);

    // 올바른 사용법. 둘 다에서 작동합니다.
    await browserController.setCookies(page, cookies);
}

BrowserController는 브라우저가 실행된 컨텍스트와 같은 중요한 정보도 포함합니다. 이는 SDK v1 이전에는 구현하기 어려웠던 기능입니다.

const handlePageFunction = async ({ browserController }) => {
    // 브라우저가 사용하는 프록시 정보
    browserController.launchContext.proxyInfo

    // 브라우저가 사용하는 세션
    browserController.launchContext.session
}

`BrowserPool` 메서드 vs `PuppeteerPool`

일부 함수가 제거되었고(이전 deprecation에 따라), 일부는 약간 변경되었습니다:

// 이전
await puppeteerPool.recyclePage(page);

// 새로운 방식
await page.close();

// 이전
await puppeteerPool.retire(page.browser());

// 새로운 방식
browserPool.retireBrowserByPage(page);

// 이전
await puppeteerPool.serveLiveViewSnapshot();

// 새로운 방식
// BrowserPool에는 LiveView가 없습니다

`PuppeteerCrawlerOptions` 업데이트

PuppeteerCrawler와 PlaywrightCrawler의 일관성을 유지하기 위해 옵션들을 업데이트했습니다.

`gotoFunction` 제거

설정 가능한 gotoFunction 개념은 이상적이지 않았습니다. 특히 수정된 gotoExtended를 사용하는 상황에서는 더욱 그렇습니다. 사용자가 기본 동작을 확장하려면 gotoFunction을 preNavigationHooks와 postNavigationHooks로 대체하기로 결정했습니다.

다음 예제는 gotoFunction이 어떻게 복잡성을 증가시키는지 보여줍니다:

const gotoFunction = async ({ request, page }) => {
    // 전처리
    await makePageStealthy(page);

    // 이렇게 해야 한다는 것을 기억해야 함:
    const response = await gotoExtended(page, request, {/* 기본값을 기억해야 함 */});

    // 후처리
    await page.evaluate(() => {
        window.foo = 'bar';
    });

    // 반드시 잊지 말아야 함!
    return response;
}

const crawler = new Apify.PuppeteerCrawler({
    gotoFunction,
    // ...
})

preNavigationHooks와 postNavigationHooks를 사용하면 훨씬 간단해집니다. preNavigationHooks는 crawlingContext와 gotoOptions 두 인자와 함께 호출되고, postNavigationHooks는 crawlingContext만을 인자로 받습니다.

const preNavigationHooks = [
    async ({ page }) => makePageStealthy(page)
];

const postNavigationHooks = [
    async ({ page }) => page.evaluate(() => {
        window.foo = 'bar'
    })
]

const crawler = new Apify.PuppeteerCrawler({
    preNavigationHooks,
    postNavigationHooks,
    // ...
})

`launchPuppeteerOptions`가 `launchContext`로 변경됨

이전에는 Apify 옵션과 Puppeteer의 launchOptions가 혼합되어 있어 혼란스러웠습니다.

const launchPuppeteerOptions = {
    useChrome: true, // Apify 옵션
    headless: false, // Puppeteer 옵션
}

이제 launchOptions를 명시적으로 정의하는 새로운 launchContext 객체를 사용하세요. launchPuppeteerOptions는 제거되었습니다.

const crawler = new Apify.PuppeteerCrawler({
    launchContext: {
        useChrome: true, // Apify 옵션
        launchOptions: {
            headless: false // Puppeteer 옵션
        }
    }
})

LaunchContext는 browser-pool의 타입이며 구조가 정확히 동일합니다. SDK는 추가 옵션만 제공합니다.

`launchPuppeteerFunction` 제거

browser-pool은 브라우저 수명 주기의 특정 이벤트가 발생할 때 실행되는 함수인 수명 주기 훅의 개념을 도입했습니다.

const launchPuppeteerFunction = async (launchPuppeteerOptions) => {
    if (someVariable === 'chrome') {
        launchPuppeteerOptions.useChrome = true;
    }
    return Apify.launchPuppeteer(launchPuppeteerOptions);
}

const crawler = new Apify.PuppeteerCrawler({
    launchPuppeteerFunction,
    // ...
})

이제 preLaunchHook을 사용하여 동일한 기능을 구현할 수 있습니다:

const maybeLaunchChrome = (pageId, launchContext) => {
    if (someVariable === 'chrome') {
        launchContext.useChrome = true;
    }
}

const crawler = new Apify.PuppeteerCrawler({
    browserPoolOptions: {
        preLaunchHooks: [maybeLaunchChrome]
    },
    // ...
})

이 방식이 여러 면에서 더 좋습니다. Puppeteer와 Playwright 모두에서 일관되게 동작하며, 미리 정의된 동작으로 브라우저를 쉽게 구성할 수 있습니다:

const preLaunchHooks = [
    maybeLaunchChrome,
    useHeadfulIfNeeded,
    injectNewFingerprint,
]

그리고 crawler.crawlingContexts 추가 덕분에 함수들이 실행을 트리거한 request의 crawlingContext에도 접근할 수 있습니다.

const preLaunchHooks = [
    async function maybeLaunchChrome(pageId, launchContext) {
        const { request } = crawler.crawlingContexts.get(pageId);
        if (request.userData.useHeadful === true) {
            launchContext.launchOptions.headless = false;
        }
    }
]

실행 함수들

Apify.launchPuppeteer() 외에도 이제 Apify.launchPlaywright()를 사용할 수 있습니다.

인자 업데이트

혼란을 자주 야기했던 실행 옵션 객체를 업데이트했습니다.

// 이전
await Apify.launchPuppeteer({
    useChrome: true,
    headless: true,
})

// 새로운 방식
await Apify.launchPuppeteer({
    useChrome: true,
    launchOptions: {
        headless: true,
    }
})

커스텀 모듈

Apify.launchPuppeteer는 이미 puppeteerModule 옵션을 지원했습니다. Playwright에서는 playwright 모듈 자체가 브라우저를 실행하지 않기 때문에 이름을 launcher로 통일했습니다.

const puppeteer = require('puppeteer');
const playwright = require('playwright');

await Apify.launchPuppeteer();
// 다음과 동일:
await Apify.launchPuppeteer({
    launcher: puppeteer
})

await Apify.launchPlaywright();
// 다음과 동일:
await Apify.launchPlaywright({
    launcher: playwright.chromium
})

개요​

마이그레이션 가이드​

설치​

Apify 플랫폼에서 실행하기​

핸들러 인자가 이제 크롤링 컨텍스트로 변경됨​

크롤링 컨텍스트와 ID의 Map​

autoscaledPool이 crawlingContext.crawler로 이동됨​

PuppeteerPool이 BrowserPool로 대체됨​

실행 중인 BrowserPool에 접근하기​

페이지에 이제 ID가 있음​

설정과 수명 주기 훅​

BrowserController 소개​

BrowserPool 메서드 vs PuppeteerPool​

PuppeteerCrawlerOptions 업데이트​

gotoFunction 제거​

launchPuppeteerOptions가 launchContext로 변경됨​

launchPuppeteerFunction 제거​

실행 함수들​

인자 업데이트​

커스텀 모듈​

개요