ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Reverse Engineer Apache Jackrabbit Setup

    IT Discussion
    9
    22
    3.2k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • anthonyhA
      anthonyh
      last edited by

      We have a system that uses Apache Jackrabbit as an image (document) storage repository. We would really like to be able to pull documents for use with applications outside said system. The vendor of the system is, of course, not willing to volunteer how we can do this. So, I've been asked to reverse engineer it. I've looked at the database (MS-SQL) that's being used as storage and, yeah, I need to get into it from the Jackrabbit side...

      Anyone have any pointers on resources to help me with this? At least a pointer on where to start?

      It goes without saying, I have no clue what I'm doing. πŸ˜„

      1 Reply Last reply Reply Quote 7
      • MattSpellerM
        MattSpeller
        last edited by

        Upvoted for ambition + visibility

        1 Reply Last reply Reply Quote 2
        • gjacobseG
          gjacobse
          last edited by

          well good thing is that it's Open Source and runs on apache.

          Sounds like a @scottalanmiller question.

          scottalanmillerS 1 Reply Last reply Reply Quote 0
          • JaredBuschJ
            JaredBusch
            last edited by

            Jackrabbit has an API. Why go into the DB when you can use the API?

            1 Reply Last reply Reply Quote 1
            • AmbarishrhA
              Ambarishrh
              last edited by Ambarishrh

              I am not sure if this is helpful, but a search got me this http://blog.mooregreatsoftware.com/

              Part of that blog:
              Sadly, the metadata files for AEM Package Manager are very, very poorly documented. To make matters worse, there is a lot of duplication and inconsistencies between them. There is a little bit of information at the Apache Jackrabbit FileVault Documentation site, but it is focussed at the Vault filesystem and the like, not specifically how to use packages. The Adobe 6.1 Package Manager documentation discusses creating a package through the UI, but doesn’t discuss any of the mechanics. The Maven VLT plugin talks a little about how to set up Maven, but has huge holes in what is actually done and what the values really mean.

              In an effort to get some better understanding, I’ve done a lot of reading, testing, and reverse engineering to come up with the following information. If anyone knows where I can learn more, I’d love to know and pass that along!

              Not sure if it completely talks about Apache Jackrabbit, but thought this might help.

              And another one; talks about exporting data as XML
              https://wiki.apache.org/jackrabbit/BackupAndMigration

              1 Reply Last reply Reply Quote 1
              • scottalanmillerS
                scottalanmiller @gjacobse
                last edited by

                @gjacobse said in Reverse Engineer Apache Jackrabbit Setup:

                well good thing is that it's Open Source and runs on apache.

                Sounds like a @scottalanmiller question.

                LOL, yes. Jackrabbit itself is fully open. No reverse engineering needed. You can look right at the code or docs.

                1 Reply Last reply Reply Quote 0
                • scottalanmillerS
                  scottalanmiller
                  last edited by

                  So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

                  anthonyhA 1 Reply Last reply Reply Quote 0
                  • anthonyhA
                    anthonyh @scottalanmiller
                    last edited by anthonyh

                    @scottalanmiller said in Reverse Engineer Apache Jackrabbit Setup:

                    So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

                    The SQL database appears to be fairly simple. However, it's not in any easy-for-a-human-to-decipher structure (at least this human).

                    For what it's worth, we used to have a system that used IBM's FileNet for document storage...and I easily reverse engineered the Oracle back-end of that and was able to pull docs from that with no issues.

                    This is nothing like FileNet, unfortunately.

                    1 Reply Last reply Reply Quote 0
                    • T
                      tiagom
                      last edited by

                      @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                      @scottalanmiller said in Reverse Engineer Apache Jackrabbit Setup:

                      So the MS SQL Server database is overly complex? Hard to believe that the image data is not relatively easy to find in there.

                      The SQL database appears to be fairly simple. However, it's not in any easy-for-a-human-to-decipher structure (at least this human).

                      For what it's worth, we used to have a system that used IBM's FileNet for document storage...and I easily reverse engineered the Oracle back-end of that and was able to pull docs from that with no issues.

                      This is nothing like FileNet, unfortunately.

                      Of course, its so you pay them to do whatever customization you are after.

                      Sadly i have no experience with Apache Jackrabbit. Hope you figure this out!

                      1 Reply Last reply Reply Quote 1
                      • anthonyhA
                        anthonyh
                        last edited by

                        I think I may go down a less elegant, but something I can put together more quickly, method.

                        I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                        I can script this via Lynx on a Linux VM relatively easily.

                        All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                        dafyreD 1 Reply Last reply Reply Quote 1
                        • dafyreD
                          dafyre @anthonyh
                          last edited by

                          @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                          I think I may go down a less elegant, but something I can put together more quickly, method.

                          I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                          I can script this via Lynx on a Linux VM relatively easily.

                          All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                          You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                          anthonyhA 1 Reply Last reply Reply Quote 1
                          • anthonyhA
                            anthonyh @dafyre
                            last edited by

                            @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                            @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                            I think I may go down a less elegant, but something I can put together more quickly, method.

                            I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                            I can script this via Lynx on a Linux VM relatively easily.

                            All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                            You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                            If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

                            travisdh1T 1 Reply Last reply Reply Quote 0
                            • travisdh1T
                              travisdh1 @anthonyh
                              last edited by

                              @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                              @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                              @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                              I think I may go down a less elegant, but something I can put together more quickly, method.

                              I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                              I can script this via Lynx on a Linux VM relatively easily.

                              All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                              You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                              If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

                              Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

                              anthonyhA 1 Reply Last reply Reply Quote 0
                              • anthonyhA
                                anthonyh @travisdh1
                                last edited by anthonyh

                                @travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

                                @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                                @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                I think I may go down a less elegant, but something I can put together more quickly, method.

                                I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                                I can script this via Lynx on a Linux VM relatively easily.

                                All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                                You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                                If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

                                Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

                                No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

                                GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
                                JOURNAL - I believe this is something to do with clustering as well.
                                BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
                                BUNDLE - Not sure what this is.
                                NAMES - A reference table for various object names.
                                REFS - Empty in our implementation.

                                From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

                                travisdh1T JaredBuschJ 2 Replies Last reply Reply Quote 1
                                • travisdh1T
                                  travisdh1 @anthonyh
                                  last edited by

                                  @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                  @travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

                                  @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                  @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                                  @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                  I think I may go down a less elegant, but something I can put together more quickly, method.

                                  I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                                  I can script this via Lynx on a Linux VM relatively easily.

                                  All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                                  You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                                  If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

                                  Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

                                  No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

                                  GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
                                  JOURNAL - I believe this is something to do with clustering as well.
                                  BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
                                  BUNDLE - Not sure what this is.
                                  NAMES - A reference table for various object names.
                                  REFS - Empty in our implementation.

                                  From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

                                  VINVAL_DATA is probably the raw jpg/gif/whatever, I'd be surprised if you needed to convert it.

                                  Overall, Jackrabbit sounds like it was designed horribly, and you've found the best option out of the bad choices you have 😞

                                  anthonyhA 1 Reply Last reply Reply Quote 0
                                  • JaredBuschJ
                                    JaredBusch @anthonyh
                                    last edited by JaredBusch

                                    @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                    I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

                                    This is obviously not true. There will be a record someplace that contains all of the cross references or there would be no way for anything to be pulled out after it was stored. This is just silly reasoning. Just because you do not know where to find it does not mean it does not exist.

                                    That said, I told you all the way at the beginning of this thread to use the native API to pull documents instead of trying to kludge some hack together. That is the entire point of having an API.

                                    anthonyhA 1 Reply Last reply Reply Quote 2
                                    • dafyreD
                                      dafyre
                                      last edited by dafyre

                                      Compare ID fields in the NAMES and BINVAL tables... A system like this is not likely to have the correct information in one place.

                                      anthonyhA 1 Reply Last reply Reply Quote 0
                                      • anthonyhA
                                        anthonyh @JaredBusch
                                        last edited by

                                        @JaredBusch said in Reverse Engineer Apache Jackrabbit Setup:

                                        @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                        I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

                                        This is obviously not true. There will be a record someplace that contains all of the cross references or there would be no way for anything to be pulled out after it was stored. This is just silly reasoning. Just because you do not know where to find it does not mean it does not exist.

                                        That said, I told you all the way at the beginning of this thread to use the native API to pull documents instead of trying to kludge some hack together. That is the entire point of having an API.

                                        I am pretty knowledgeable about the non Jackrabbit side of this application, and I am going to say you're wrong. I'm confident the relationship is stored on the Jackrabbit side and NOT the front-end side.

                                        Yes, Jackrabbit has an API (I am fully aware of this). I looked at their "First Hops" exercise (making a connection to Jackrabbit), and you need to know about the JCR specification and how to program in Java. I do not have these skill sets (yet).

                                        http://jackrabbit.apache.org/jcr/first-hops.html

                                        1 Reply Last reply Reply Quote 0
                                        • anthonyhA
                                          anthonyh @dafyre
                                          last edited by

                                          @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                                          Compare ID fields in the NAMES and BINVAL tables... A system like this is not likely to have the correct information in one place.

                                          Unfortunately the NAMES table has a total of 10 records. It's not document names (good guess, though!).

                                          0_1481232011012_upload-c2105240-a37a-4ca8-8652-1b16bc475f44

                                          1 Reply Last reply Reply Quote 0
                                          • anthonyhA
                                            anthonyh @travisdh1
                                            last edited by

                                            @travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

                                            @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                            @travisdh1 said in Reverse Engineer Apache Jackrabbit Setup:

                                            @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                            @dafyre said in Reverse Engineer Apache Jackrabbit Setup:

                                            @anthonyh said in Reverse Engineer Apache Jackrabbit Setup:

                                            I think I may go down a less elegant, but something I can put together more quickly, method.

                                            I discovered that once I'm logged into the system (it's web based), I can simply browse to the document retrieval URL and stick the appropriate document ID in said URL. This will spit out said document.

                                            I can script this via Lynx on a Linux VM relatively easily.

                                            All we need to do is dump the desired document IDs to a list that I can then read on the Lynx side and, boom, we'll have the docs to do with as we please.

                                            You could also browse the database tables and figure out where said document IDs live, that way you can simply pull straight from the DB. πŸ™‚

                                            If I could do that, I would. The DB is in no way/shape/form readable by anything other than Jackrabbit. This was just confirmed by the vendor of the system. They actually just suggested exactly what I'm working on doing (after my boss had what he calls a "come to Jesus" moment with them).

                                            Hrm, let me guess, they're storing entire tables of values from PHP in single database columns? That is so very highly annoying, and goes against everything relational databases are supposed to be. I've had bad experiences with this in Drupal myself.

                                            No, it's not doing that. What it's doing kinda makes sense (at least from the limited sleuthing knowledge I have), it's just organized for Jackrabbit and not for a human. There are 6 tables:

                                            GOBAL_REVISION - Not sure what this is, we only have one record here. I believe it has to do with clustering (there are 4 app servers and Jackrabbit runs on each app).
                                            JOURNAL - I believe this is something to do with clustering as well.
                                            BINVAL - Where the documents are stored, I believe. There are two colums, BINVAL_ID and BINVAL_DATA.
                                            BUNDLE - Not sure what this is.
                                            NAMES - A reference table for various object names.
                                            REFS - Empty in our implementation.

                                            From what I've researched, the docs are stored in hexidecimal format. However, when I pull the BINVAL_DATA field for a given record and convert from hex to binary, the file is unreadable. Even if I could successfully convert the doc, the IDs for these records do not correspond to the IDs that we see on the front-end. I have not found any sort of relationship table/list in the front-end database, I suspect it's all done via Jackrabbit.

                                            VINVAL_DATA is probably the raw jpg/gif/whatever, I'd be surprised if you needed to convert it.

                                            Overall, Jackrabbit sounds like it was designed horribly, and you've found the best option out of the bad choices you have 😞

                                            Looks like BINVAL_DATA is a byte array type. This link below, though not Jackrabbit specific, shows how to convert between a file and byte array.

                                            http://www.programcreek.com/2009/02/java-convert-a-file-to-byte-array-then-convert-byte-array-to-a-file/

                                            travisdh1T 1 Reply Last reply Reply Quote 0
                                            • 1
                                            • 2
                                            • 1 / 2
                                            • First post
                                              Last post