Best practice to Import data in WordPress & Elasticsearch

Imagine a WooCommerce site with no products yet, ready to import thousands of products.

WPSOLR is then activated, and a new Elasticsearch index created.

A first import is executed to import tens of thousands of products. More imports will follow to add more products, but also to delete or update some products previously imported.

How can we optimize the indexing performance?

How can we also prevent the Elasticsearch index being desynchronized from the WooCommerce database?

Deactivate the real-time indexing

Below is a schematic of how to deactivate real-time indexing during imports:

Detailed explanations

By default, WPSOLR is activated with a feature called “Real-time indexing”.

This feature will insert/update/delete a document in the Elasticsearch index immediately after the corresponding product (or any other post type) is created/updated/unpublished.

Real-time indexing is great when a single product is modified. Indeed, the indexing time is then small compared to the time required for WordPress to update the backend database, and to refresh the product editing page.

But if you’re importing thousands of products, this is terribly inefficient. Every product imported will then call the Elasticsearch server to update the index, leading to thousands of calls. If your Elasticsearch server answers in 0,2s (standard internet network latency), it adds up to 3 minutes for a thousand products (1000*0,2s = 200s = 3 mn). So, for 100K products it adds 5 hours (100x3mn = 5h).

No, let’s compare if we could send the products by batches of 100 products to Elasticsearch. It would require 3 minutes (100K products/100 products = 1000 calls). A hundred times faster, if we expect that ingesting one document takes the same order of time as ingesting 100 documents (which is true).

With the same idea, a batch of 1000 products would be 1000 faster than real-time indexing. But it is not clear that ingesting 1000 documents is quite the same as ingesting 1 document this time. Unless your cluster is distributed on several (many?) nodes. And you also have to consider the WooCommerce back-end: gathering the 1000 products’ data from SQL could be a lot of stress on your server. Again, a dedicated replicated MySQL node could help.

The size of the batch is not fixed. It depends on your WordPress environment, but also on your Elasticsearch environment. You can try with a batch size of 50, then increase by 50 until the performance is degrading.

Solve issues with deleted products

The way products are imported depends on your batch.

There are two types of batches: incremental or not incremental.

Incremental batches

The first batch imports all the products, while the following batches deal with fewer products to create/update/delete.

The first batch should follow the “no real-time” principle described in the previous chapter:

Following batches must be performed with the “real-time” option:

Performance is not an issue then, as much less products are imported and indexed
Deletion of products has to be performed immediately with WPSOLR

Non incremental batches

The first and following batches import all the products. This is like a “delete and recreate all products” each time.

a) Reuse the same index

If you want to keep the same index, you need to delete all the products first (“Real-time” option “off”), empty the index, then import the batch and re-index. But until the reindexing is complete, your online search will yield uncomplete results.

b) Use a new index

To keep the online search complete during indexing, you can use a second index.

The first index is used during the first import, and powers the current online search.

The second index is used after the next import.

Below is the screen to index your second index data:

Several other reasons to use a second index for indexing:

Updating and deleting documents from an index is costly (requires Lucene shards merging). Adding new documents to a new index is much more effective.
During indexing, the online search remains ok
The second index can be placed on dedicated nodes for better writes speed, without penalizing the first index read access