scala - How to do distributed Principal Components Analysis + Kmeans using Apache Spark? -

i need run principal components analysis , k-means clustering on large-ish dataset (around 10 gb) spread out on many files. want use apache spark since it's known fast , distributed.

however, haven't found example demonstrates how many files in distributed manner.

Th