https://www.searchstax.com/podcasts/data-ingestion-podcast/
One solution is web scrapping, with many quality issues. A better solution is extracting and cleaning data from the database itself.
This is exactly what is done by WPSOLR: the plugin understands how your WordPress or WooCommerce data is stored, and therefore is able to index it in the best way possible into Solr.
What is a post type, or a product attribute? How to send them so they can be used as facets/filters ?
This is where WPSOLR shines.
And it can also index files like PDFs or .docx files from the media library.
See it by yourself with WooCommerce + Apache Solr + SearchStax: wpsolr.com