Browser Automation with Playwright (Apify)
This article is a follow-up to Browser Automation with Playwright (Robocorp), which discusses the downsides of traditional RPA browser extensions, the benefits of using browser automation libraries like Playwright, and how to use Playwright in Robocorp. This article will simply focus on how to use Playwright in Apify, so if you want more background info, read the beginning of the previous article first.
Apify prides itself as "the most powerful web scraping and automation platform." Whether it's the most powerful automation platform is up to you, but Apify makes a very good argument for it being the most powerful web scraping platform, due in large part to their Node.js web scraping library Crawlee. I am not an experienced web scraper developer by any means (at least for now), but one particular quote on Crawlee's site sticks out to me as a sign of Crawlee's quality:
"We believe websites are best scraped in the language they're written in."
One alternative to Crawlee is Scrapy, a web scraping library for Python. These two libraries are maintained by companies who have cloud platforms for developers to deploy and run web crawlers. Apify maintains Crawlee, and Zyte maintains Scrapy. I would argue that Crawlee is the better web scraping library, and after using both Apify and Zyte's cloud platforms, Apify is decidedly the better cloud platform.
Web scraping is a whole other realm of automation, and if you want to learn more about it, Apify has their own academy that teaches everything about web scraping, from beginner to advanced. It's a very well-made course and well worth your time.
A major component of web scraping is using headless browsers to scrape dynamic content, so naturally Apify supports headless Playwright / Puppeteer browsers for Crawlee to use. These browsers can also be used without Crawlee to support RPA browser interaction. Here's how to create a basic RPA flow as an Apify actor:
- Create a new actor, and use the Basic Node.js actor template
.actor/Dockerfile
- replaceFROM apify/actor-node:16
with
FROM apify/actor-node-playwright-chrome:18
This means we want to use Apify's Docker image that uses Node 18, and only has Playwright Chromium installed
.actor/INPUT_SCHEMA.json
- delete all existing properties, i.e."properties": {}
main.js
- replace all existing code with the snippet belowpackage.json
- replace"crawlee": "^3.0.0"
with"playwright": "*"
We use *
as the version because we want to match the Playwright version that is already installed in the Docker image.
// Adapted from one of Playwright's Node.js examples
// https://playwright.dev/docs/library#library-example
import { Actor } from 'apify';
import { chromium } from 'playwright';
await Actor.init();
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/');
console.log(await page.title());
await browser.close();
await Actor.exit();
Save the project, build it, and then run it. Once it finishes running, view the run's log. You should see that your actor logged "Example Domain", which is the title of example.com. This example is as basic as it gets, but it demonstrates Apify's ability to use Playwright directly, thus providing the same serverless browser automation capabilities as Robocorp.
Apify's web IDE allows you to develop, build images, and run tests very quickly, but sometimes you may want to develop locally and then move your local code to your actor. The one thing you need to note when developing locally is that your local environment needs to match Apify's environment. This means that you need to install the same versions of Node.js and Playwright that Apify uses.
The right versions of Node.js and Playwright can be found in your build logs. Once you find them, install Node.js, and then run npm install playwright@1.27.1 (replace 1.27.1 with whatever version is in your build logs)