cqlsh - Cassandra performance on distinct query -


in cassandra , read need design table schema such minimum number of partitions hit. have designed schema meet requirement. in scenario need partition keys alone. planning use

select distinct <partitionkeys> table

i ran distinct query using cqlsh around 15k rows .it quite fast.

questions

  1. will there performance issues if use distinct ?
  2. how cassandra fetches partition keys alone ?
  3. i need know limitations on distinct query.

will there performance issues if use distinct? how cassandra fetches partition keys alone?

basically, cassandra has rip through nodes , pull partition (row) keys table. querying these keys how cassandra designed work, not surprised performed you. drawback, have hit or of nodes complete operation, performance slow if have large number of nodes.

this difference between cql rows , rows in underlying storage comes play. if @ data cassandra-cli tool, can see how partition keys treated differently. here example crew members of ship stored in table, ship.

aploetz@cqlsh:presentation> select * shipcrewregistry ;   shipname | lastname  | firstname | citizenid                            | aliases ----------+-----------+-----------+--------------------------------------+--------------------------------------  serenity |      book |    derial | 48bc975a-c9f2-474d-8a29-247503445877 |                       {'classified'}  serenity |      cobb |     jayne | 2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354 |                   {'hero of canton'}  serenity |      frye |    kaylee | d556cf44-348b-4ea3-8c19-ba9d4877818c |                                 null  serenity |     inara |     serra | a25b7e02-8099-401a-8c41-d9d2ea894b72 |                                 null  serenity |  reynolds |   malcolm | 169382b7-21b0-47bf-b1c8-19bc008a9060 |             {'mal', 'sgt. reynolds'}  serenity |       tam |     river | af68201f-4135-413e-959c-dd81ea651e52 |                                 null  serenity |       tam |     simon | aa090e1a-7792-4d7b-bba9-bac66f8c1f15 |                          {'dr. tam'}  serenity | washburne |     hoban | 73f591df-c0dc-44c4-b3f3-9c37453c9537 |                             {'wash'}  serenity | washburne |      zoey | 46bc77ad-53ad-4402-b252-a0543005c583 | {'corporal alleyne', 'zoey alleyne'}  (9 rows) 

but when query within cassandra-cli:

[default@presentation] list shipcrewregistry; using default limit of 100 using default cell limit of 100 ------------------- rowkey: serenity => (name=book:derial:48bc975a-c9f2-474d-8a29-247503445877:, value=, timestamp=1424904853420170) => (name=book:derial:48bc975a-c9f2-474d-8a29-247503445877:aliases:434c4153534946494544, value=, timestamp=1424904853420170) => (name=cobb:jayne:2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354:, value=, timestamp=1424904853492976) => (name=cobb:jayne:2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354:aliases:4865726f206f662043616e746f6e, value=, timestamp=1424904853492976) => (name=frye:kaylee:d556cf44-348b-4ea3-8c19-ba9d4877818c:, value=, timestamp=1428442425610395) => (name=inara:serra:a25b7e02-8099-401a-8c41-d9d2ea894b72:, value=, timestamp=1428442425621555) => (name=reynolds:malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:, value=, timestamp=1424904853505461) => (name=reynolds:malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:aliases:4d616c, value=, timestamp=1424904853505461) => (name=reynolds:malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:aliases:5367742e205265796e6f6c6473, value=, timestamp=1424904853505461) => (name=tam:river:af68201f-4135-413e-959c-dd81ea651e52:, value=, timestamp=1428442425575881) => (name=tam:simon:aa090e1a-7792-4d7b-bba9-bac66f8c1f15:, value=, timestamp=1424904853518092) => (name=tam:simon:aa090e1a-7792-4d7b-bba9-bac66f8c1f15:aliases:44722e2054616d, value=, timestamp=1424904853518092) => (name=washburne:hoban:73f591df-c0dc-44c4-b3f3-9c37453c9537:, value=, timestamp=1428442425587484) => (name=washburne:hoban:73f591df-c0dc-44c4-b3f3-9c37453c9537:aliases:57617368, value=, timestamp=1428442425587484) => (name=washburne:zoey:46bc77ad-53ad-4402-b252-a0543005c583:, value=, timestamp=1428442425596863) => (name=washburne:zoey:46bc77ad-53ad-4402-b252-a0543005c583:aliases:436f72706f72616c20416c6c65796e65, value=, timestamp=1428442425596863) => (name=washburne:zoey:46bc77ad-53ad-4402-b252-a0543005c583:aliases:5a6f657920416c6c65796e65, value=, timestamp=1428442425596863)  1 row returned. elapsed time: 86 msec(s). 

this intended show how 9 cql rows 1 row "under hood."

i need know limitations on distinct query.

in cql, distinct work on partition keys. not sure how many rows negate usefulness. 15000 cql rows should fine it. if have millions of distinct partition keys (high cardinality) expect performance drop off...especially several nodes in cluster.


Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -