PODC

Posted on feb. 2 2014

PODC-keynote

Inktomi at a Glance Company Overview Applications Search Technology Network Products Online Shopping Wireless Systems Towards Robust Distributed Systems Dr. Eric A. Brewer Professor, UC Berkeley Co-Founder & Chief Scientist, Inktomi CoPODC Keynote, July 19, 2000 “INKT” on NASDAQ Founded 1996 out of UC Berkeley ~700 Employees PODC Keynote, July 19, 2000 Our Perspective Inktomi builds two distributed systems: – Global Search Engines – Distributed Web Caches “Distributed Systems” don’t work... There exist working DS: – Simple protocols: DNS, WWW – Inktomi search, Content Delivery Networks – Napster, Verisign, AOL Based on scalable cluster & parallel computing technology But very little use of classic DS research... PODC Keynote, July 19, 2000 But these are not classic DS: – – – – Not distributed objects No RPC No modularity Complex ones are single owner (except phones) PODC Keynote, July 19, 2000 1 Three Basic Issues Where is the state? Consistency vs. Availability Understanding Boundaries PODC Keynote, July 19, 2000 Where’s the state? (not all locations are equal) PODC Keynote, July 19, 2000 Santa Clara Cluster Delivering High Availability We kept up the service through: Crashes & disk failures (weekly) Database upgrades (daily) Software upgrades (weekly to monthly) OS upgrades (twice) Power outage (several) Network outages (now have 11 connections) Physical move of all equipment (twice) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 • Very uniform • No monitors • No people • No cables • Working power • Working A/C • Working BW 2 Berkeley Ninja Architecture Persistent State is HARD Base: Scalable, highlyavailable platform for persistent-state services Classic DS focus on the computation, not the data – this is WRONG, computation is the easy part Workstations & PCs Data centers exist for a reason – can’t have consistency or availability without them AP Internet Other locations are for caching only: – proxies, basestations, set-top boxes, desktops basestations, set– phones, PDAs, … PDAs, Active Proxy: Bootstraps thin devices into infrastructure, runs mobile code AP Distributed systems can’t ignore location distinctions Cellphones, Pagers, etc. PODC Keynote, July 19, 2000 PDAs (e.g. IBM Workpad) ACID vs. BASE DBMS research is about ACID (mostly) Consistency vs. Availability (ACID vs. BASE) But we forfeit “C” and “I” for availability, graceful degradation, and performance This tradeoff is fundamental. BASE: – Basically Available – Soft-state oft– Eventual consistency PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 3 ACID vs. BASE ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) – The CAP Theorem BASE Weak consistency stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution Tolerance to network Consistency Availability Partitions But I think it’s a spectrum PODC Keynote, July 19, 2000 Theorem: You can have at most two of these properties for any shared-data system PODC Keynote, July 19, 2000 Forfeit Partitions Examples Single-site databases SingleCluster databases Forfeit Availability Examples Distributed databases Distributed locking Consistency Availability LDAP xFS file system Traits 2-phase commit cache validation protocols PODC Keynote, July 19, 2000 Consistency Availability Majority protocols Traits Tolerance to network Tolerance to network Partitions Partitions Pessimistic locking Make minority partitions unavailable PODC Keynote, July 19, 2000 4 Forfeit Consistency Examples Coda Web cachinge These Tradeoffs are Real The whole space is useful Real internet systems are a careful mixture of ACID and BASE subsystems – We use ACID for user profiles and logging (for revenue) Traits Consistency Availability DNS But there is almost no work in this area Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) PODC Keynote, July 19, 2000 Tolerance to network expirations/leases conflict resolution optimistic PODC Keynote, July 19, 2000 Partitions CAP Take Homes Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice OS/Networking good at BASE/Availability, but terrible at consistency Databases better at C than Availability Wide-area databases can’t have both WideDisconnected clients can’t have both All systems are probabilistic… PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 Understanding Boundaries (the RPC hangover) 5 The Boundary The interface between two modules – client/server, peers, libaries, etc… libaries, Different Address Spaces What if the two sides are NOT in the same address space? – IPC or LRPC Basic boundary = the procedure call C S Can’t do pass-by-reference (pointers) pass-by– Most IPC screws this up: pass by value-result value– There are TWO copies of args not one – thread traverses the boundary – two sides are in the same address space PODC Keynote, July 19, 2000 What if they share some memory? – Can pass pointers, but… – Need synchronization between client/server – Not all pointers can be passed PODC Keynote, July 19, 2000 Trust the other side? What if we don’t trust the other side? Have to check args, no pointer passing args, Kernels get this right: – copy/check args – use opaque references (e.g. File Descriptors) Partial Failure Can the two sides fail independently? – RPC, IPC, LRPC Can’t be transparent (like RPC) !! New exceptions (other side gone) Reclaim local resources – e.g. kernels leak sockets over time => reboot Most systems do not: – TCP – Napster – web browsers PODC Keynote, July 19, 2000 Can use leases? – Different new exceptions: lease expired RPC tries to hide these issues (but fails) PODC Keynote, July 19, 2000 6 Multiplexing clients? Boundary evolution? Does the server have to: – – – – – deal with high concurrency? Say “no” sometimes (graceful degradation) Treat clients equally (fairness) Bill for resources (and have audit trail) Isolate clients performance, data, …. Can the two sides be updated independently? (NO) The DLL problem... Boundaries need versions Negotiation protocol for upgrade? Promises of backward compatibility? Affects naming too (version number) These all affect the boundary definition PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 Example: protocols vs. APIs Example: XML Protocols have been more successful the APIs Some reasons: – – – – protocols are pass by value protocols designed for partial failure not trying to look like local procedure calls explicit state machine, rather than call/return (this exposes exceptions well) XML doesn’t solve any of these issues It is RPC with an extensible type system It makes evolution better? – two sides need to agree on schema – can ignore stuff you don’t understand Protocols still not good at trust, billing, evolution PODC Keynote, July 19, 2000 Can mislead us to ignore the real issues PODC Keynote, July 19, 2000 7 Boundary Summary Conclusions We have been very sloppy about boundaries Leads to fragile systems Root cause is false transparency: trying to look like local procedure calls Relatively little work in evolution, federation, client-based resource allocation, failure recovery client- Classic Distributed Systems are fragile Some of the causes: – – – – focus on computation, not data ignoring location distinctions poor definitions of consistency/availability goals poor understanding of boundaries (RPC in particular) These are all fixable, but need to be far more common PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The DQ Principle Harvest & Yield Yield: Fraction of Answered Queries Yield: Data/query / Queries/sec = constant = DQ – for a given node – for a given app/OS release – Related to uptime but measured by queries, not by time – Drop 1 out of 10 connections => 90% yield – At full utilization: yield ~ capacity ~ Q Harvest: Fraction of the Complete Result Harvest: – Reflects that some of the data may be missing due to faults – Replication: maintain D under faults A fault can reduce the capacity (Q), completeness (D) or both Faults reduce this constant linearly (at best) DQ corollary: harvest / yield ~ constant – ACID => choose 100% harvest (reduce Q but 100% D) – Internet => choose 100% yield (available but reduced D) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 8 Harvest Options 1) Ignore lost nodes – RPC gives up – forfeit small part of the database – reduce D, keep Q Replica Groups With n members: Each fault reduces Q by 1/n 1/n D stable until nth fault Added load is 1/(n-1) per fault 1/(n RAID RAID – n=2 => double load or 50% capacity – n=4 => 133% load or 75% capacity – “load redirection problem” 2) Pair up nodes – RPC tries alternate – survives one fault per pair – reduce Q, keep D 3) n-member replica groups Decide when you care... PODC Keynote, July 19, 2000 Disaster tolerance: better have >3 mirrors PODC Keynote, July 19, 2000 Graceful Degradation Goal: smooth decrease in harvest/yield proportional to faults – we know DQ drops linearly Thinking Probabilistically Maximize symmetry – SPMD + simple replication schemes Make faults independent – – – – requires thought avoid cascading errors/faults understand redirected load KISS Saturation will occur – high peak/average ratios... – must reduce harvest or yield (or both) – must do admission control!!! One answer: reduce D dynamically – disaster => redirect load, then reduce D to compensate for extra load PODC Keynote, July 19, 2000 Use randomness – makes worst-case and average case the same worst– ex: Inktomi spreads data & queries randomly – Node loss implies a random 1% harvest reduction PODC Keynote, July 19, 2000 9 Server Pollution Can’t fix all memory leaks Third-party software leaks memory and sockets Third– so does the OS sometimes Evolution Three Approaches: Flash Upgrade – Fast reboot into new version – Focus on MTTR (< 10 sec) – Reduces yield (and uptime) Some failures tie up local resources Solution: planned periodic “bounce” – Not worth the stress to do any better – Bounce time is less than 10 seconds – Nice to remove load first… PODC Keynote, July 19, 2000 Rolling Upgrade – Upgrade nodes one at time in a “wave” – Temporary 1/n harvest reduction, 100% yield – Requires co-existing versions co- “Big Flip” PODC Keynote, July 19, 2000 The Big Flip Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100% Key New Problems Unknown but large growth – Incremental & Absolute scalability – 1000’s of components Must be truly highly available – Hot swap everything (no recovery time allowed) – No “night” – Graceful degradation under faults & saturation 50% Harvest, 100% Yield – or inverse? Constant evolution (internet time) – Software will be buggy – Hardware will fail – These can’t be emergencies... PODC Keynote, July 19, 2000 No mixed versions – can replace schema, protocols, ... Twice used to change physical location PODC Keynote, July 19, 2000 10 Conclusions Parallel Programming is very relevant, except… – – – – historically avoids availability no notion of online evolution limited notions of graceful degradation (checkpointing) best for CPU-bound tasks CPU- Conclusions Winning solution is message-passing clusters message– fine-grain communication => finefine-grain exception handling fine– don’t want every load/store to deal with partial failure Key open problems: – – – – – – libraries & data structures for HA shared state support for replication and partial failure better understanding of probabilistic systems cleaner support for exceptions (graceful degradation) support for split-phase I/O and many concurrent threads splitsupport for 10,000 threads/node (to avoid FSMs) PODC Keynote, July 19, 2000 Must think probabilistically about everything – – – – no such thing as a 100% working system no such thing as 100% fault tolerance partial results are often OK (and better than none) Capacity /* Completeness == Constant PODC Keynote, July 19, 2000 New Hard Problems... Really need to manage disks well – problems are I/O bound, not CPU bound Backup slides Lots of simultaneous connections – 50Kb/s => at least 2000 connections/node HAS to be highly available – no maintenance window, even for upgrades Continuous evolution – constant site changes, always small bugs... – large but unpredictable traffic growth Graceful degradation under saturation PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 11 Parallel Disk I/O Want 50+ outstanding reads/disk – Provides disk-head scheduler with many choices disk– Trades response time for throughput Pushes towards a split-phase approach to disks splitGeneral trend: each query is a finite-state machine finite– split-phase disk/network operations are state transitions split– multiplex many FSMs over small number of threads – FSM handles state rather than thread stack PODC Keynote, July 19, 2000 12

It so life

love as life

PODC

PODC-keynote