Collections / Elasticsearch / custom app

  • Hello,

    I have a Runway installation. The collection_index table has over 900,000 rows now. I'm only indexing the fields that I need for filtering/sorting.

    So I have some performance issues. Particularly with perch_collection(). Pages that perform simple filtering with perch_collection() and get visited more frequently are fast to load though, so I'm assuming that's the result of SQL caching.

    There are currently ~2300 items for all Collections. The site integrates with an external source to import new items and update existing ones which happens frequently. So the table keeps growing.

    Although I'm trying to avoid this, but I may also have to import 20k+ archived items.

    I'm trying to study what I can do to improve the performance here. Should I reach for something like Elasticsearch and use its queries instead of perch_collection()? Or should I perhaps build my own app so my database tables would have columns for each field I need for filtering/sorting (and fewer rows)? Is there something else I should be considering?

  • My intention was always that elasticsearch (or solr or whatever) would be the natural progression here. The region.index event enables you to listen for any content being indexed and push it into elasticsearch.

  • Sounds reasonable. So it seems this means I'd have to get the fields I want to push into Elasticsearch from itemJSON:

    1. $API->on('region.index', function(PerchSystemEvent $Event){
    2. $fields = json_decode($Event->subject->itemJSON());
    3. // do stuff
    4. });

    I guess I can also use the same method to populate each item into a database table (with each filterable field as a column) and query that table when looking up items. Would your recommendation still be to use something like Elasticsearch? (I'm feeling an "it depends" answer coming :S)

  • Hi both - do you mind if I turn this into a bit of a discussion?

    Drew - I have a (kind of) related question/request - would it be possible to allow a "non-obvious" config file setting to stop the index table optimisation that happens after a write to the collection index?

    The reason I ask this is due to my testing of changing the database/table engine to InnoDB - it appeared to give me a slight performance increase for queries, but it was unusable after updating anything - due to the optimisation failing on the InnoDB table.

    Hussein - I have a site with about 1000 items in collections which equates to about 114000 rows in the index table - I'm expecting this to grow to at least 5x bigger than it is now, and I've seen some intermittent performance issues. Hence why I tried the InnoDB experiment, purely down to the fact that tweaking the InnoDB cache size seems to be more effective than tweaking the MyISAM cache settings (I could be wrong about this though).

    I also tried a test with putting data into a CouchDB database (if you don't know it, it's another NoSQL document-based database, not too dissimilar to ElasticSearch) - as it uses JSON as its default data type, getting data in was easy. Getting it out was fairly trivial and in some cases a LOT faster - but as soon as the queries got a bit more complicated, it dramatically slowed. The same may happen with ElasticSearch...

    Again, apologies for a thread derail - let me know if it's annoying and I'll hide it behind a spoiler tag!

  • dunc - Your input is very much appreciated.

    I decided to test the approach of creating a database table with the filterable fields and querying it instead of using perch_collection().

    The results so far are very encouraging. Tested some of the queries that used to take a few seconds; with the new approach they now take a fraction of a second. I'm yet to test some of the more complex queries, but it seems this approach will get me the results I'm looking for.

    I'm not ruling out using something like Elasticsearch (for search at least), but I may test it at a later stage.

    Thanks drewm, dunc

  • As you're discovering, the trade-off for the flexibility of being able to randomly configure any fields you like and sort and filter by any of them is that you lose some performance as the volume of data grows.

    My rough plan for Runway in this regard is to enable the creation of custom indexes, which would basically do as hus_hmd is suggesting and spin out a new database table with a set structure. Something like:

    1. Create a new index
    2. Pick the fields you want to sort/filter by
    3. Runway builds a table to match those fields, and indexes into it
    4. A new set option in perch_collection() lets you specify which index you want to use for your query

    The downside to this is keeping control of those tables and making sure not to slow things up too much when writing changes and reindexing the collection.

    This isn't in development, but that's the next step I have mapped out in my head for providing for those cases.

  • I tested the re-indexing into a custom table approach first. Performance definitely improved, but updating Collection items via the Import API was still slow (because I was still indexing into the default index table). I also tried not indexing any fields into the default index table, but this meant I/editors couldn't locate items easily via the control panel (search and sort didn't work).

    I figured there's no reason why I shouldn't embrace Perch Runway for the framework it is and take full advantage of the Perch API. So I ended up building a full-featured custom app.

    Runway Collections is not the limit of a Runway project. You just need to look past the CMS. Perch is more than that. drewm and rachelandrew just don't brag enough about it :burd1fly: