Page tree
Skip to end of metadata
Go to start of metadata

PNDA forum discussion

PNDA-4031 - Getting issue details... STATUS


This is a large area and is divided into several separate topics -

Access control between users & services

Identity - a consistent notion of what constitutes an identity across PNDA & the execution of functions for that identity

  • E.g. applications started by Bob are then associated with Bob for the purposes of resource allocation

Authentication - a consistent approach to establishing the veracity of identities across PNDA

  • E.g. Alice cannot pretend to be Bob

Authorization - a consistent approach to controlling what an authenticated identity can & cannot do across PNDA

  • E.g. Bob can configure ingest and create applications, whereas Alice can only control the lifecycle of applications that have already been created.

Some of the key areas to be addressed include -

  • Most PNDA services are identity-aware. However, there's no consistent authentication of identity.
    • Where PAM is the underlying framework authentication is deferred to the configured mechanism, which could be local or LDAP
    • Some services can be configured to use LDAP directly
    • Other services assume a default identity without any authentication
  • Today, all services are accessed directly. Access control is greatly simplified by having one control point through which all services are accessed.
  • Some PNDA specific services do not currently implement authorization
  • For the services that do implement authorization, there are a multitude of schemes and control mechanisms. 
    • Management of authorization is greatly simplified by having a consistent approach, ideally with a single point of management.

Securing interaction between services

This also sub-divides into identity, authentication & authorization. Typically, we will use TLS on links and mutually authenticate on certificates.


  • Introduce identity to PNDA services where missing today
  • As far as possible, introduce one point through which access to PNDA is controlled
    • We believe Apache Knox is the most suitable technology for PNDA in this space, having a wide range of applicability across Hadoop services, pluggability to support PNDA services and supporting a number of widely used authentication frameworks out of the box. 
    • Some services are not covered by Knox and need separate analysis of how to provide a consistent authentication scheme overall.
    • More about Knox.
  • Authenticate identity consistently across PNDA services
    • Kerberos is the key Hadoop technology of interest for strong authentication, but there are other options and other PNDA technologies to consider, as well as applicability to cloud based deployment.
  • Authorize operations consistently across PNDA services
    • The key technology of interest here for Hadoop is Apache Ranger.
    • There are services across PNDA that would not be addressed by Ranger and need separate analysis of how to provide a consistent authorization scheme overall.

Overall Plan


  • Add identity awareness to all PNDA services and APIs where missing
  • This work is mostly complete


  • Integrate upstream Knox version 1.0.x as part of PNDA deployment in order to cover main HDP and other services. 


  • Authentication at the perimeter using Knox is pluggable - LDAP is likely, OAuth needs some investigation
  • Authentication behind the perimeter likely to be based on Kerberos & some components will need work to enable this


  • Initially, distributed management of authorization & a simple fixed scheme based on users (not roles)
    • See here for how this will be achieved for the Deployment Manager in the short term
  • Later, centralized authorization, likely using Ranger.


  • No labels


  1. Unknown User (ashishpok)

    Trev, James,

    I agree with this proposal. Couple of things I'd like to bring up

    1- I feel like using query parameter with username opens up the door for undesired "impersonation" attacks. Granted DM can be run as localhost with IP Table filters to protect that but it seems like it does leave a door open (especially if you want to scale out DM on a separate host). Not sure off the top how other applications handle this but one of ways could be using HTTPS with client cert validation for a secured channel between console backend and DM.

    2- I also think frontend components need to have some semblance of Role awareness (user session object probably). This will be needed (1) to probably be first line of defense before hitting DM (2) and more importantly to be able to control UI links and buttons (you would want to disable links on UI)

    Thanks, Ashish

    1. Unknown User (trsmith2)

      Hi Ashish, thanks for the comments!

      On (1), this isn't intended to address authentication. With a gateway and/or an authentication framework you'd expect the caller to be verified as 'bob' before the call is allowed. Once it's allowed, the way the user is communicated for the purposes of checking authorization is via the query component &user=.  

      For example, Knox can be plugged into an authentication mechanism. It will then verify identity before forwarding on the call, adding the query component &user=. In the case of the Console acting as the gateway, it should (and does) carry out authentication in the same way.

      TLS is indeed applicable for securing the intra-cluster links between components but doesn't solve (very elegantly, at least) user-to-cluster security. This is something that should be added to this PDP. 

      On (2), the console already uses sessions to carry user context through to the DM API. Once simple authorization is implemented in the DM, performing a disallowed operation will return an error. The console should handle that elegantly. Note that the console isn't the first line of defence to the DM, as many people use the DM API directly in automation for example.

      I agree the Console would ideally pre-emptively suggest what's possible and what isn't for a given session (grey buttons and other visual cues) - this is part of the more advanced topic of role based access control and centralized authorization management, and where Apache Ranger probably has a part to play if we wish to avoid re-inventing the wheel.

  2. Unknown User (ashishpok)

    Makes sense in both fronts!

    I actually didn't mean authentication but impersonating a user by calling DM directly with query parameter and if DM needed some sort of trust store and yes, intended use of TLS would be intra-cluster links between components. Secure GW in-front of DM will also work. In absence of TLS, I suppose DM port has to be protected for trusted IPs only (Knox, Console backend, Automation node etc), correct? IP tables are great but do have tendency to start becoming management headaches when cluster keeps growing (smile)

    I definitely like the idea of centralized authorization management.

    1. Unknown User (trsmith2)

      I think deployment on a segmented network is the answer to the connectivity question - the gateway would be on the outside (for example in AWS we deploy into a VPC with two subnets, one private). You can simulate this kind of arrangement with iptables but agreed, that could get unwieldy. 

      If the DM is deployed without authentication then malicious impersonation is trivial (you just change the URL). If the DM is deployed with authentication then malicious impersonation should be impossible - i.e. bob will be proved to be bob and the gateway will be proved to be the gateway before anything proceeds (via SPNEGO for example). We'll need to allow both options, I expect, given the range of deployment scenarios.

      As you say TLS is the right solution for intra-cluster link security and mutual authentication and I suspect it's easier to drop this in (it's readily supported by the web technologies in use) than hand-roll some kind of whitelist at the application layer.

      We need to fill out the authentication section of this PDP with more detail on this aspect but this slideshare gives a good overview of how it fits in - (slide 10 in particular). The 'doAs' here is essentially authenticated (non-malicious!) impersonation. The Hadoop interfaces generally work the same way.