Networking- Now & Future

Thursday, 3 March 2022

NETCONF Vs RESTCONF

Hello friends, last week many of my colleagues asked me about Netconf, Restconf & gRPC, specifically what is the difference among them.

At a high level, my colleagues understand that these protocols were developed to minimize "vendor-lockin" and build vendor-agnostic network management & monitoring applications for a specific technology.

Let me try to summarize(as succinctly as possible):

1. Both NETCONF and RESTCONF have CONF in common, which means these interfaces can be used to retrieve the device configuration and manipulate the device configuration.

2. Both NETCONF and RESTCONF use YANG data models which define the schema/template of the information that is pushed and retrieved from device/s.

3. Why do we need RESTCONF: RESTful APIs were very famous in the software industry, many processes, and integration were already done using RESTful APIs, keeping this in mind RESTCONF was promoted as another technique(apart from NETCONF) to work with YANG data models.

Let's compare the major difference/s:

A quick summary of major differences

Comparison of operations between Netconf and Restconf:

Netconf layers(from RFC):

gRPC in another post and will link it to this post, Happy reading!

References:

RFC Netconf(6241)

RFC RESTCONF(8040)

Saturday, 17 July 2021

Edge Compute - Why ?

Edge computing is a logical evolution of cloud-based network architecture (hosting centralized compute, storage for business application/s).

Cloud architecture resulted in the exponential growth of business applications providing pay as you grow business model, truly revolutionary. What happened next was every business application(there are exceptions and valid use cases for not moving to the cloud) moved to the cloud to save CAPEX, improved time to market, and operational flexibility( XaaS).

Datacenter began to proliferate at an exponential speed and as per Statista's survey, there are 7.2 million data centers globally as of 2021.

A logical business problem that became much more noticeable with cloud architecture was almost all the end-user data was routed to a datacenter for business analytics and return traffic to execute any business decision on the end device.

IoT exacerbated this problem with now millions of devices ready to send traffic to these data centers. You can guess, the Internet will become a bottleneck(bandwidth) apart from latency and any unpredictable network disruptions.

How does Edge computing help?

Edge compute moves some portion of the storage and compute resources from centralized data centers and closer to the data source itself. It can be considered as an evolution from centralized computing to distributed computing. Keep it in mind it is not like traditional computing with on-premise deployment, although you can consider it something similar to the branch office concept.

What could be the challenge?

Decentralization may lead to high levels of monitoring and control as compared to the centralized model, are we going back 😉 ?

Interesting use cases:

1. Autonomous vehicles: Everyone's favorite, firstly you will have millions of cars, and sending everything to a centralized data center will choke the internet, and secondly you don't want to have latency or networking disruptions while the car is on it way 😉

2. Farming: Collect all the sensor data, compute it locally and ensure crops are harvested in peak season.

3. User experience(latency): Measure performance for users across the internet and determine the most reliable low latency path

What is the opportunity :

For product vendors, it provides a new market segment to build devices with limited compute and storage but enormous volume, pluggable/s to connect these edge mini data centers to end-users and upstream to mother data centers.

I plan to follow with an additional post on the architecture for edge computing, till then

Happy reading and stay safe !

Thursday, 27 May 2021

Transponder(less) Architecture

Hello friends, today I will touch upon how some innovations in silicon photonics and miniaturization of the transponder functions catalyzing the evolution of new transport network architecture.

You can also read Transponder(less) architectures as architectures with "less/limited" transponder/s.

Every device, connection, and software code could be a potential point of failure in a network and keeping the # of components minimal to realize services is always a good idea. The new architecture with reduced components enables this design philosophy.

Now let's see what do we mean by transponder "less" architecture.

**I am using Transponder and muxponder interchangeably, read Transponder as (Trans/Mux)ponders.

In general, Transponders provided gray-to-color translation(for long-distance transmission), and muxponder aggregate many smaller services into a bigger colored pipe to transport over the optical transport network.

Transponder less architecture is the act of collapsing the features provided by the transponder/muxponder to another point/layer in the network.

How is this possible now?

The main reason this is possible now is the miniaturization of transponder features into small form factors like CFP2-DCO and ZR/+.

These optics have consumed the DSP and other optical analog functions provided by transponders thus eliminating the need for a transponder( generally fulfilled by specific hardware).

IPoDWDM was something similar done in past but the solution required having a Transponder(like) hardware on routing boxes consuming slots on routing box and also interlocking the refresh cycles of routers and transport hardware( both having considerably different refresh cycles)

How would it benefit

With commercial versions of CFP2 and ZR available many CSPs can take the advantage of these innovations to reduce the network cost and more importantly minimize the # of components required to realize the same service.

In many metro and DCI where distance is limited (~30-80kms) many network connections can now be made with these new optics and some passive(or active) mux/demux blades without a need of deploying an entire transport line system.

Apart from reducing the CAPEX this should significantly reduce the OPEX( limited power requirement, employee training, hardware spare), and the most important benefit in my view is a simplified network.

Happy reading! Stay safe!

References:

CFP2 DCO optics

400G ZR Optics

Saturday, 1 May 2021

Customer is WRONG !

Hello friends, an apology for deploying a provocative title to catch your attention😉

All organizations aspire to "delight" the customers, quality is implicit and any slack in user experience can turn into an existential crisis for the product and organization.

No one can escape accountability when customers are not able to consume the product/service for the problem the product/service was intended to solve but,

what happens when customers discover a new workflow or configure/use the product in a way it was never designed or built 😟

Customer fault - A BIG NO

Now how can you find these "unusual", "out of the order" workflows before they are reported by customers?

Validation teams within an organization have the accountability to ensure these problems are found and fixed before the feature/product/service is available to the customer.

For products having limited complexity and combinations, validating these negative scenarios can work( especially with the automated test suite) but for a multilayer,multi-component complex product/service this approach(find & fix) will not scale.

Secondly, does it even make sense to spend time and resources in validating something which is not a revenue (💲💲)generating activity?

What is a better approach? - "Mistake proofing"

One such mistake-proofing technique is "Poka-Yoke". Poka ➡ "mistake", Yoke ➡"prevent"

Poka-Yoke is a process introduced by Japanese engineer Shigeo Shingo. The purpose of Poka-Yoke is to develop processes to reduce defects by avoiding or correcting mistakes in the early design and development phases.

Building a product that takes care of this unusual workflow in design and development is the best phase to avoid a humungous effort later( with the find & fix approach).

All the cross-functional team's product owners, software architects, development, validation, operations & support team should come together during the design phase and plug all the possible entry points preventing any inadvertent misuse of the product/service by the end-user.

"Remember Proactive is better than Reactive"

Sunday, 18 April 2021

Finding defects(on paper)

The shelf life of technology is decreasing and the contraction is accelerating every day. What does this mean? The time from ideation to production is shortening and new disruptions have quickly become a norm.

High quality of the service and/or product is an implicit assumption for the organization/product to exist.

How do we measure product quality? Run it through a test cycle, file the deviations, fix what is necessary, and document what can be managed.

The testing phase involves writing a test plan, building testbeds, environment, recording/reporting deviations, validating the changes, and finally handing it over to the customers. Testing is a costly but mandatory endeavor to gain confidence in the success of the product.

So what is finding defects on paper mean?

In the current system of product validation rewards/recognition is based on what is recorded 😀.

Organizations tend to reward folks where NO(or almost) defects were reported for the Software/Hardware piece they delivered AND/OR folks who reported maximum high-quality defects. Seems to be a sensible way to differentiate among employees providing the maximum value in a specified time.

What most organizations don't measure is how many of these defects could be found without testing literally(remember literal testing is a cost to the organization and profitability) and continuously encourage (and provide time, resources, and training) to find as many defects as possible on paper(without testing literally).

Finding defects on paper means how many defects can be found during PRD review( yes as early as PRD), design review, code review, and test plan review.

Now we can debate, that is how it is done today so how is it different?

There are two tactical actions that organizations can start to measure the effectiveness(current) and build a feedback loop for continuous improvement.

First, for any review, feedback/comments/AI which are not clarifications but problems(defects) caught early, start tracking them as early offline defects. Keep the overhead of tracking these offline defects as minimal as possible by using automation. Closing them automatically when the review owner confirms the compliance.

2) During PPA(post-project analysis) review the defects which were found "literally" and identify if strengthening any upstream review process would have helped to find them "on paper". This will build a feedback loop that continuously strengthens the review processes.

In summary, for us, it is difficult to appreciate and reward the actions that we cannot measure(and 👀) and compare.

Finding defects through the review process helps to reduce the cost of defects(because they are found early), save cost (running labs, servers), enable any participants in the review process to report, and moves the focus of improvement(and investment to strengthen) on upstream processes.

Friday, 16 April 2021

Disaggregation ?( Vs/&) Whitebox

Hello friends, thank you for providing feedback on the topics in my previous posts.

While discussing disaggregation with some of my colleagues I realized sometimes we interchangeably use Disaggregation and Whitebox. While it may work OK in some context but there is a subtle difference between the two.

What is Disaggregation:

The name has it all, it is "to separate into component parts". In the context of the network, it is the methodical approach to separate a closely integrated system and/or network into multiple sub-components.

When we slice something we can either slice it horizontally or vertically( don't ask me diagonally 😀) and the same can be applied to networks. The picture below is an example of horizontal disaggregation where the system manufacturer has partitioned his systems into various subcomponents and the subcomponents interwork among each other use well-defined interfaces.

A good example of horizontal disaggregation is the "open line system" (originally deployed for submarine) where submarine network providers integrated transponders of different vendors on a common line system thus partitioning the network horizontally.

The disaggregated architecture allows service providers to choose and select different vendors for these subcomponents. AT&T (and many others) are leading the journey toward production deployment of disaggregated architecture and if you get a chance you can read about DDC architecture(for IP) and OpenROADM(for optical) for additional insights(check reference section below).

So where is the Whitebox?

Disaggregated architectures enabled swapping data plane(and other network functions like firewall, traffic control) from a specific vendor to any standard hardware. This standard 'blank' hardware is a white label switch a.k.a Whitebox.

Whitelabel box helps the service provider save CAPEX and provides CSP an option to "choose". It may however bring some additional effort required for integration, operations, and troubleshooting for a disaggregated deployment. Companies like Cumulus networks are providing integrated solutions with white box hardware provided by vendors like Facebook( yes "Facebook" 😯), HPE, Dell, and others helping CSPs to adopt and sail through early integration challenges.

To summarize Disaggregation is a broad philosophy to slice and dice your system/network into multiple sub-components while Whitebox is a specific action to replace proprietary hardware with standard hardware and manage this commodity hardware using intelligent software( SDN controller).

References:

OCP-DDC Architecture

ONF- Open Disaggregated Transport Network

Linkedin Disaggregation journey

Linkedin DataCenter architecture

My earlier post on Open Networking

"We must all suffer one of two things: the pain of discipline or the pain of regret - Jim Rohn"

Tuesday, 13 April 2021

SRE - Making engineer's life better

Hello everyone, welcome back, with this post I plan to share my perspective about SRE( Site Reliability Engineering). I was impressed by the perspective of harnessing the power of automation to support services in a production environment.

While DevOps (along with all the tools) is changing the way(mindset, culture & style) products are developed and enabled in production, SRE is one of the major pillars (other two, Continuous Delivery and Infrastructure Automation) of the DevOps transformational journey.

Key callouts:

1. SRE = Site(Service) Reliability Engineering

Almost all organizations( even hardcore product companies) want to provide their offering as a service(XaaS, Anything as a service). Service offerings help to flatten the revenue curve making it more predictable and it is a welcome change for the consumers since they don't have to plan for steep (and uneven) CAPEX allocations for greenfield deployments or network refresh.

2. Process workflow as Code

This perspective fascinated me, while there is a lot of focus within engineering teams to automate (the results are visible as well) automating production service support is a well-deserved extension to the "Automate Everything" philosophy.

And why not? most of the processes have well-defined workflows and are repeated over and over again.

3. Mindset and Skillset :

SRE thinking forces solving the same problem, but with an engineering mindset. SRE team is a group of software engineers and system engineers building and maintaining operational processes "as code".

Self-service automation can reduce the manual dependency among team members and make the entire process more efficient by reducing manual mistakes and uplifting the operational code based on retrospective learnings and modifications for new business needs.

4. Engineer's quality of life:

All of us are aware of how tough is to maintain high-reliability business-critical services. My sincere regards to all the folks who spend the sleepless nights and sacrifice family time for the wider good.

SRE thinking aims to automate everything that is possible to automate and enable humans to do more thinking( feedbacks) than doing.

People are our greatest asset and any change that promises an improvement in their quality of life is welcome.

I will share additional insights from SRE by Google and another book by Richard cook, "How complex systems fail"

Feel free to share your insights, resources, and any feedback.

Have a great day!

References:

SRE Fundamentals- by Google Inc

SRE: The Big picture ( Pluralsight.com)