Indexing Millions of SKU's with Magento?

From BlueBen on Magento Forums:

http://www.magentocommerce.com/boards/viewthread/219698/#t300497

I had the pleasure of seeing and meeting Alistair Stead from Session Digital at the Magento Imagine conference. Alistair is doing lots of work for the Magento developer community, including MageTolol, a Magento unit testing facility. The Magento community benefits from people such as Alistair, and Magento, Inc. have committed to being more involved with community recommendations.

Alistair just posted some catalog index findings to the Magento developers’ list. I’m copying his post here for the benefit of others:

The import process is much improved in the latest version but the catalog category index process is still a problem if you have a larger catalog with multiple stores. Under testing with a catalog over 2 million SKUs the index process can take over 24 hours. The issue is the following statement from within the index process.

SELECT 3 AS `category_id`, `e`.`entity_id` AS `product_id`, 0 AS
`position`, 1 AS `is_parent`, 1 AS `store_id`, `ei`.`visibility` FROM
`catalog_product_entity` AS `e`
-> INNER JOIN `catalog_category_product_index_enbl_idx` AS `ei` ON
ei.product_id = e.entity_id
-> LEFT JOIN `catalog_category_product_index_idx` AS `i` ON
i.product_id = e.entity_id AND i.category_id = ‘3’ AND i.store_id =
‘1’
-> WHERE (i.product_id IS NULL)

This statement results in the database attempting to create a huge temporary table including many millions of records before running the final insert.

The reason for this is clarified by asking mysql to explain the query:

mysql> explain SELECT 3 AS `category_id`, `e`.`entity_id` AS
`product_id`, 0 AS `position`, 1 AS `is_parent`, 1 AS `store_id`,
`ei`.`visibility` FROM `catalog_product_entity` AS `e`
-> INNER JOIN `catalog_category_product_index_enbl_idx` AS `ei` ON
ei.product_id = e.entity_id
-> LEFT JOIN `catalog_category_product_index_idx` AS `i` ON
i.product_id = e.entity_id AND i.category_id = ‘3’ AND i.store_id =
‘1’
-> WHERE (i.product_id IS NULL)\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: e
type: index
possible_keys: PRIMARY
key: FK_CATALOG_PRODUCT_ENTITY_ENTITY_TYPE
key_len: 2
ref: NULL
rows: 503313
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: ei
type: ref
possible_keys: IDX_PRODUCT
key: IDX_PRODUCT
key_len: 4
ref: magento.e.entity_id
rows: 1
Extra:
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2653541
Extra: Using where; Not exists
3 rows in set (0.00 sec)

----------------------------------------------------------------------

In order to improve the performance we simply need to add the following index to the table. This will take the catalog category indexing down to 7 minutes within our testing.

CREATE INDEX catalog_cat_product_idx ON catalog_category_product_index_idx(product_id,category_id,store_id);

This is a massive performance boost if you need to run the entire indexing process.

Tags: