Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
- corpus
- html2text
- news-crawler
- natural-language-processing
- scraper
- tei-xml
- text-extraction
- webscraping
- web-scraping
- article-extractor
- corpus-builder
- corpus-tools
- crawler
- html-to-markdown
- llm
- news-aggregator
- nlp
- rag
- readability
- rss-feed
- scraping
- tei
- text-cleaning
- text-mining
- text-preprocessing