Web scraping with GPT-4o: powerful but expensive

The author explores the new structured outputs feature in OpenAI’s API by developing an AI-assisted web scraper using GPT-4o. The post details experiments with extracting data from simple and complex tables, noting successes and failures. Combining two approaches – data extraction and XPath generation – improves results, but issues such as image-to-text conversion arise. The author highlights the high cost of using GPT-4o and implements logic to reduce unnecessary data in the HTML string. Despite the cost, the author sees potential for AI-assisted web scraping tools. A demo using Streamlit is available, with potential future improvements discussed. The source code is available on GitHub.

https://blancas.io/blog/ai-web-scraper/

To top