Cassandra The Definitive Guide

Cassandra The Definitive Guide

(Parte 1 de 6)

Cassandra: The Definitive Guide Cassandra: The Definitive Guide

Cassandra: The Definitive Guide

Eben Hewitt Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo

Cassandra: The Definitive Guide by Eben Hewitt

Copyright © 2011 Eben Hewitt. All rights reserved. Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com

Editor: Mike Loukides Production Editor:Holly Bauer Copyeditor: Genevieve d’Entremont Proofreader: Emily Quill

Indexer:Ellen Troutman Zaig Cover Designer:Karen Montgomery Interior Designer:David Futato Illustrator: Robert Romano

Printing History: November 2010:First Edition.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Cassandra: The Definitive Guide, the image of a Paradise flycatcher, and related trade dress are trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

TM This book uses RepKover™, a durable and flexible lay-flat binding.

ISBN: 978-1-449-39041-9 [M] 1289577822

This book is dedicated to my sweetheart,

Alison Brown. I can hear the sound of violins, long before it begins.

Forewordxv
Prefacexvii
1. Introducing Cassandra1

Table of Contents

What’s Wrong with Relational Databases?1 A Quick Review of Relational Databases6

RDBMS: The Awesome and the Not-So-Much6 Web Scale12

The Cassandra Elevator Pitch14

Cassandra in 50 Words or Less14 Distributed and Decentralized14 Elastic Scalability 16 High Availability and Fault Tolerance16 Tuneable Consistency 17 Brewer’s CAP Theorem19 Row-Oriented 23 Schema-Free 24 High Performance24

Where Did Cassandra Come From?24 Use Cases for Cassandra25

Large Deployments25 Lots of Writes, Statistics, and Analysis26 Geographical Distribution 26 Evolving Applications 26

Who Is Using Cassandra?26 Summary 28

2. Installing Cassandra29

Installing the Binary29 Extracting the Download29 vii

What’s In There?29

Building from Source30

Additional Build Targets32 Building with Maven32

Running Cassandra33

On Windows33 On Linux33 Starting the Server34

Running the Command-Line Client Interface35 Basic CLI Commands36

Help 36 Connecting to a Server36 Describing the Environment37 Creating a Keyspace and Column Family38 Writing and Reading Data39 Summary 40

3.The Cassandra Data Model41

The Relational Data Model41 A Simple Introduction42 Clusters 45 Keyspaces 46 Column Families47

Column Family Options49

Columns 49

Wide Rows, Skinny Rows51 Column Sorting52

Super Columns53

Composite Keys55

Design Differences Between RDBMS and Cassandra56

No Query Language56 No Referential Integrity56 Secondary Indexes56 Sorting Is a Design Decision57 Denormalization 57

Design Patterns58

Materialized View59 Valueless Column59 Aggregate Key59

Some Things to Keep in Mind60 Summary 60 viii|Table of Contents

Data Design61 Hotel App RDBMS Design62 Hotel App Cassandra Design63 Hotel Application Code64

Creating the Database65 Data Structures66 Getting a Connection67 Prepopulating the Database68 The Search Application80

Twissandra 85 Summary 85

5.The Cassandra Architecture87

System Keyspace87 Peer-to-Peer 8 Gossip and Failure Detection88 Anti-Entropy and Read Repair90 Memtables, SSTables, and Commit Logs91 Hinted Handoff93 Compaction 94 Bloom Filters95 Tombstones 95 Staged Event-Driven Architecture (SEDA)96 Managers and Services97

Cassandra Daemon97 Storage Service97 Messaging Service97 Hinted Handoff Manager98 Summary 98

6. Configuring Cassandra9

Keyspaces 9

Creating a Column Family102 Transitioning from 0.6 to 0.7103

Replicas 103 Replica Placement Strategies104

Simple Strategy105 Old Network Topology Strategy106 Network Topology Strategy107

Replication Factor 107

Increasing the Replication Factor108 Partitioners 110

Table of Contents|ix

Random Partitioner 110 Order-Preserving Partitioner 110 Collating Order-Preserving Partitioner 1 Byte-Ordered Partitioner 1

Snitches 1

Simple Snitch111 PropertyFileSnitch 112

Creating a Cluster113

Changing the Cluster Name113 Adding Nodes to a Cluster114 Multiple Seed Nodes116

Dynamic Ring Participation117 Security 118

Using SimpleAuthenticator 118 Programmatic Authentication 121 Using MD5 Encryption122 Providing Your Own Authentication122

Miscellaneous Settings 123 Additional Tools124

Viewing Keys124 Importing Previous Configurations125 Summary 127

7.Reading and Writing Data129

Query Differences Between RDBMS and Cassandra129

No Update Query129 Record-Level Atomicity on Writes129 No Server-Side Transaction Support129 No Duplicate Keys130

Basic Write Properties130 Consistency Levels 130 Basic Read Properties132 The API133

Ranges and Slices133

Setup and Inserting Data134 Using a Simple Get140 Seeding Some Values142 Slice Predicate142

Getting Particular Column Names with Get Slice142 Getting a Set of Columns with Slice Range144 Getting All Columns in a Row145

Get Range Slices145 Multiget Slice147

Deleting 149 Batch Mutates150

Batch Deletes151 Range Ghosts152

Programmatically Defining Keyspaces and Column Families152 Summary 153

8. Clients155

Basic Client API156 Thrift 156

Thrift Support for Java159 Exceptions 159 Thrift Summary160

Avro 160

Avro Ant Targets162 Avro Specification 163 Avro Summary164

A Bit of Git164 Connecting Client Nodes165

Client List165 Round-Robin DNS165 Load Balancer165

Cassandra Web Console165 Hector (Java)168

Features 169 The Hector API170

HectorSharp (C#)170 Chirper 175 Chiton (Python)175 Pelops (Java)176 Kundera (Java ORM)176 Fauna (Ruby)177 Summary 177

9. Monitoring179

Logging 179

Tailing 181 General Tips182

Overview of JMX and MBeans183

MBeans 185 Integrating JMX187

Interacting with Cassandra via JMX188 Cassandra’s MBeans 190

Table of Contents|xi org.apache.cassandra.concurrent 193 org.apache.cassandra.db 193 org.apache.cassandra.gms 194 org.apache.cassandra.service 194

Custom Cassandra MBeans196 Runtime Analysis Tools199

Heap Analysis with JMX and JHAT199 Detecting Thread Problems203

Health Check204 Summary 204

10. Maintenance207

Getting Ring Information208

Info 208 Ring 208

Getting Statistics 209

Using cfstats209 Using tpstats210

Basic Maintenance 211

Repair 211 Flush 213 Cleanup 213

Snapshots 213

Taking a Snapshot213 Clearing a Snapshot214

Load-Balancing the Cluster215 loadbalance and streams215

Decommissioning a Node218 Updating Nodes220

Removing Tokens220 Compaction Threshold 220 Changing Column Families in a Working Cluster220 Summary 221

1. Performance Tuning223

Data Storage223 Reply Timeout225 Commit Logs225 Memtables 226 Concurrency 226 Caching 227 Buffer Sizes228 Using the Python Stress Test228 xii|Table of Contents

Generating the Python Thrift Interfaces229 Running the Python Stress Test230

Startup and JVM Settings232

Tuning the JVM232 Summary 234

12. Integrating Hadoop235

What Is Hadoop?235 Working with MapReduce236

Cassandra Hadoop Source Package236

Running the Word Count Example237

Outputting Data to Cassandra239 Hadoop Streaming239

Tools Above MapReduce239

Pig 240 Hive 241

Cluster Configuration 241 Use Cases242

Raptr.com: Keith Thornhill243 Imagini: Dave Gardner243 Summary 244

Appendix: The Nonrelational Landscape245
Glossary271
Index285

Table of Contents|xiii

Foreword

Cassandra was open-sourced by Facebook in July 2008. This original version of Cassandra was written primarily by an ex-employee from Amazon and one from Microsoft. It was strongly influenced by Dynamo, Amazon’s pioneering distributed key/ value database. Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.

I became involved in December of that year, when Rackspace asked me to build them a scalable database. This was good timing, because all of today’s important open source scalable databases were available for evaluation. Despite initially having only a single major use case, Cassandra’s underlying architecture was the strongest, and I directed my efforts toward improving the code and building a community.

Cassandra was accepted into the Apache Incubator, and by the time it graduated in March 2010, it had become a true open source success story, with committers from Rackspace, Digg, Twitter, and other companies that wouldn’t have written their own database from scratch, but together built something important.

Today’s Cassandra is much more than the early system that powered (and still powers) Facebook’s inbox search; it has become “the hands down winner for transaction processing performance,” to quote Tony Bain, with a deserved reputation for reliability and performance at scale.

As Cassandra matured and began attracting more mainstream users, it became clear that there was a need for commercial support; thus, Matt Pfeil and I cofounded Riptano in April 2010. Helping drive Cassandra adoption has been very rewarding, especially seeing the uses that don’t get discussed in public.

Another need has been a book like this one. Like many open source projects, Cassandra’s documentation has historically been weak. And even when the documentation ultimately improves, a book-length treatment like this will remain useful.

Thanks to Eben for tackling the difficult task of distilling the art and science of developing against and deploying Cassandra. You, the reader, have the opportunity to learn these new concepts in an organized fashion.

—Jonathan Ellis Project Chair, Apache Cassandra, and Cofounder, Riptano xvi | Foreword

Preface

Why Apache Cassandra?

Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems.

Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.

Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a schema-free data model.

Is This Book for You? This book is intended for a variety of audiences. It should be useful to you if you are:

•A developer working with large-scale, high-volume websites, such as Web 2.0 social applications

•An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores

•A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a faulttolerant, eventually consistent data store xvii

•A manager who wants to understand the advantages (and disadvantages) of Cassandra and related columnar databases to help make decisions about technology strategy

•A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options

This book is a technical guide. In many ways, Cassandra represents a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well-versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is very different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be).

Using Cassandra does not mean that you have to be a Java developer. However, Cassandra is written in Java, so if you’re going to dive into the source code, a solid understanding of Java is crucial. Although it’s not strictly necessary to know Java, it can help you to better understand exceptions, how to build the source code, and how to use some of the popular clients. Many of the examples in this book are in Java. But because of the interface used to access Cassandra, you can use Cassandra from a wide variety of languages, including C#, Scala, Python, and Ruby.

Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned developer or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ivy is used to build Cassandra, and a popular client (Hector) is available via Git. In cases where I speculate that you’l need to do a little setup of your own in order to work with the examples, I try to support that.

What’s in This Book?

This book is designed with the chapters acting, to a reasonable extent, as standalone guides. This is important for a book on Cassandra, which has a variety of audiences and is changing rapidly. To borrow from the software world, I wanted the book to be “modular”—sort of. If you’re new to Cassandra, it makes sense to read the book in order; if you’ve passed the introductory stages, you will still find value in later chapters, which you can read as standalone guides.

Here is how the book is organized:

Chapter 1, Introducing Cassandra

This chapter introduces Cassandra and discusses what’s exciting and different about it, who is using it, and what its advantages are.

Chapter 2, Installing Cassandra This chapter walks you through installing Cassandra on a variety of platforms.

xviii | Preface

Chapter 3, The Cassandra Data Model

Here we look at Cassandra’s data model to understand what columns, super columns, and rows are. Special care is taken to bridge the gap between the relational database world and Cassandra’s world.

Chapter 4, Sample Application

This chapter presents a complete working application that translates from a relational model in a well-understood domain to Cassandra’s data model.

Chapter 5, The Cassandra Architecture

This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more.

Chapter 6, Configuring Cassandra

This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices.

(Parte 1 de 6)

Comentários