logo
Published on

Building an AI-Powered Crawler

Read in: 한국어
Authors

1. Crawling + AI = Automated Research

Gathering information from websites, analyzing it, and organizing it takes a lot of time. Combining a crawler that collects data with AI that analyzes it lets you automate this entire process.

AI crawler overview

Use cases:

  • Competitor monitoring -- Detect changes in competitor blogs/social media
  • Price comparison -- Aggregate prices across multiple shopping sites
  • Review analysis -- Extract patterns and complaints from customer reviews
  • Job market analysis -- Identify trends from job postings

2. Crawling Tools

2.1 Simple Pages: fetch + cheerio

import * as cheerio from 'cheerio';

async function scrape(url: string) {
  const res = await fetch(url);
  const html = await res.text();
  const $ = cheerio.load(html);

  // Extract text
  const title = $('h1').text();
  const content = $('article').text();
  const links = $('a').map((_, el) => $(el).attr('href')).get();

  return { title, content, links };
}

2.2 Dynamic Pages: Playwright

Pages rendered with JavaScript require a browser:

import { chromium } from 'playwright';

async function scrapeWithBrowser(url: string) {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle' });

  const content = await page.textContent('main');
  await browser.close();

  return content;
}

2.3 Large-Scale Crawling: Using Sitemaps

async function crawlSitemap(sitemapUrl: string) {
  const res = await fetch(sitemapUrl);
  const xml = await res.text();

  // Extract URLs from sitemap.xml
  const urls = xml.match(/<loc>(.*?)<\/loc>/g)
    ?.map(m => m.replace(/<\/?loc>/g, '')) || [];

  // Crawl each URL (with rate limiting)
  const results = [];
  for (const url of urls) {
    results.push(await scrape(url));
    await sleep(1000); // 1 second interval
  }

  return results;
}

3. AI Analysis

Send the crawled data to AI for analysis:

3.1 Summary Analysis

const prompt = `The following is content crawled from a web page.

Please extract the key information:
1. Topic/category
2. Key points (3-5)
3. Important numbers/data
4. Related keywords

Web page content:
${crawledContent}`;

3.2 Comparison Analysis

const prompt = `The following is information about the same product collected from multiple websites.

Please create a comparison table:
- Price
- Key specs
- Pros and cons
- Recommended for

Collected data:
${JSON.stringify(products, null, 2)}`;

3.3 Trend Analysis

const prompt = `The following is a list of tech blog posts collected over the past week.

Please analyze the trends:
1. Top 10 most-mentioned technologies/keywords
2. Newly emerging trends
3. Fading trends
4. Weekly summary (3-5 lines)

Collected data:
${articles.map(a => `[${a.date}] ${a.title}: ${a.summary}`).join('\n')}`;

4. Practical Example: Competitor Monitoring

Competitor monitoring
async function monitorCompetitors() {
  const competitors = [
    { name: 'CompanyA', url: 'https://companya.com/blog' },
    { name: 'CompanyB', url: 'https://companyb.com/changelog' },
  ];

  for (const comp of competitors) {
    const current = await scrape(comp.url);
    const previous = loadPrevious(comp.name); // Yesterday's crawl result

    if (current !== previous) {
      const analysis = await analyzeChanges(previous, current);
      await sendSlackNotification(comp.name, analysis);
      saveCurrent(comp.name, current);
    }
  }
}

5. Important Considerations

  • Check robots.txt -- Only crawl pages where crawling is permitted
  • Rate limiting -- Space out requests to avoid overloading servers
  • Personal data -- Do not collect data containing personal information
  • Terms of service -- Review each site's terms of service

6. Summary

StepToolRole
Crawlingcheerio / PlaywrightExtract text from web pages
AnalysisClaude / Gemini APISummarization, comparison, trends
StorageFiles/DBHistory management
AlertsSlack / EmailNotifications on detected changes

The crawler collects the data, and AI analyzes it. You can automate the research you used to do manually every day.