Sunday, 18 April 2021

Finding defects(on paper)

The shelf life of technology is decreasing and the contraction is accelerating every day. What does this mean? The time from ideation to production is shortening and new disruptions have quickly become a norm. 

High quality of the service and/or product is an implicit assumption for the organization/product to exist. 

How do we measure product quality? Run it through a test cycle, file the deviations, fix what is necessary, and document what can be managed.

The testing phase involves writing a test plan, building testbeds, environment, recording/reporting deviations, validating the changes, and finally handing it over to the customers. Testing is a costly but mandatory endeavor to gain confidence in the success of the product. 


So what is finding defects on paper mean? 

In the current system of product validation rewards/recognition is based on what is recorded 😀. 

Organizations tend to reward folks where NO(or almost) defects were reported for the Software/Hardware piece they delivered AND/OR folks who reported maximum high-quality defects. Seems to be a sensible way to differentiate among employees providing the maximum value in a specified time. 

What most organizations don't measure is how many of these defects could be found without testing literally(remember literal testing is a cost to the organization and profitability) and continuously encourage (and provide time, resources, and training) to find as many defects as possible on paper(without testing literally).  

Finding defects on paper means how many defects can be found during PRD review( yes as early as PRD), design review, code review, and test plan review. 

Now we can debate, that is how it is done today so how is it different?

There are two tactical actions that organizations can start to measure the effectiveness(current) and build a feedback loop for continuous improvement. 

First, for any review, feedback/comments/AI which are not clarifications but problems(defects) caught early,  start tracking them as early offline defects. Keep the overhead of tracking these offline defects as minimal as possible by using automation. Closing them automatically when the review owner confirms the compliance. 

2) During  PPA(post-project analysis)  review the defects which were found "literally" and identify if strengthening any upstream review process would have helped to find them "on paper". This will build a feedback loop that continuously strengthens the review processes. 

In summary, for us, it is difficult to appreciate and reward the actions that we cannot measure(and 👀) and compare. 

Finding defects through the review process helps to reduce the cost of defects(because they are found early), save cost (running labs, servers), enable any participants in the review process to report, and moves the focus of improvement(and investment to strengthen) on upstream processes.  



Friday, 16 April 2021

Disaggregation ?( Vs/&) Whitebox

Hello friends, thank you for providing feedback on the topics in my previous posts

While discussing disaggregation with some of my colleagues I realized sometimes we interchangeably use Disaggregation and Whitebox. While it may work OK in some context but there is a subtle difference between the two. 

What is Disaggregation: 

The name has it all, it is "to separate into component parts". In the context of the network, it is the methodical approach to separate a closely integrated system and/or network into multiple sub-components. 

When we slice something we can either slice it horizontally or vertically( don't ask me diagonally 😀) and the same can be applied to networks. The picture below is an example of horizontal disaggregation where the system manufacturer has partitioned his systems into various subcomponents and the subcomponents interwork among each other use well-defined interfaces. 

A good example of horizontal disaggregation is the "open line system" (originally deployed for submarine) where submarine network providers integrated transponders of different vendors on a common line system thus partitioning the network horizontally.




The disaggregated architecture allows service providers to choose and select different vendors for these subcomponents. AT&T (and many others) are leading the journey toward production deployment of disaggregated architecture and if you get a chance you can read about DDC architecture(for IP) and OpenROADM(for optical) for additional insights(check reference section below). 

So where is the Whitebox? 

Disaggregated architectures enabled swapping data plane(and other network functions like firewall, traffic control) from a specific vendor to any standard hardware. This standard 'blank' hardware is a white label switch a.k.a Whitebox. 

Whitelabel box helps the service provider save CAPEX and provides CSP an option to "choose". It may however bring some additional effort required for integration, operations, and troubleshooting for a disaggregated deployment.  Companies like Cumulus networks are providing integrated solutions with white box hardware provided by vendors like Facebook( yes "Facebook" 😯), HPE, Dell, and others helping CSPs to adopt and sail through early integration challenges. 

To summarize Disaggregation is a broad philosophy to slice and dice your system/network into multiple sub-components while Whitebox is a specific action to replace proprietary hardware with standard hardware and manage this commodity hardware using intelligent software( SDN controller). 


References:

OCP-DDC Architecture

ONF- Open Disaggregated Transport Network

Linkedin Disaggregation journey

Linkedin DataCenter architecture

My earlier post on Open Networking


"We must all suffer one of two things: the pain of discipline or the pain of regret - Jim Rohn"

Tuesday, 13 April 2021

SRE - Making engineer's life better

Hello everyone, welcome back, with this post I plan to share my perspective about SRE( Site Reliability Engineering). I was impressed by the perspective of harnessing the power of automation to support services in a production environment. 


While DevOps (along with all the tools) is changing the way(mindset, culture & style) products are developed and enabled in production, SRE is one of the major pillars (other two, Continuous Delivery and Infrastructure Automation) of the DevOps transformational journey. 

Key callouts:

1. SRE = Site(Service) Reliability Engineering 

Almost all organizations( even hardcore product companies) want to provide their offering as a service(XaaS, Anything as a service). Service offerings help to flatten the revenue curve making it more predictable and it is a welcome change for the consumers since they don't have to plan for steep (and uneven) CAPEX allocations for greenfield deployments or network refresh.

2. Process workflow as Code

This perspective fascinated me, while there is a lot of focus within engineering teams to automate (the results are visible as well) automating production service support is a well-deserved extension to the "Automate Everything" philosophy.  

And why not? most of the processes have well-defined workflows and are repeated over and over again. 

3. Mindset and Skillset : 

SRE thinking forces solving the same problem, but with an engineering mindset. SRE team is a group of software engineers and system engineers building and maintaining operational processes "as code".  

Self-service automation can reduce the manual dependency among team members and make the entire process more efficient by reducing manual mistakes and uplifting the operational code based on retrospective learnings and modifications for new business needs. 

4. Engineer's quality of life: 

All of us are aware of how tough is to maintain high-reliability business-critical services. My sincere regards to all the folks who spend the sleepless nights and sacrifice family time for the wider good. 

SRE thinking aims to automate everything that is possible to automate and enable humans to do more thinking( feedbacks) than doing. 

People are our greatest asset and any change that promises an improvement in their quality of life is welcome.  


I will share additional insights from  SRE by Google and another book by Richard cook, "How complex systems fail" 

Feel free to share your insights, resources, and any feedback. 

Have a great day!

References:

SRE Fundamentals- by Google Inc

SRE: The Big picture ( Pluralsight.com)



Protobuf ?

Hello friends this is a follow-up to my earlier post related to gRPC Vs Restconf and as promised below is a quick summary on Protobuf (the...